SFASTA: Fast Index building

What is SFASTA? Genomic and bioinformatic-adjacent sequences (RNA, Protein, Peptides) are stored as FASTA files. Sequencing reads off a machine are stored as FASTQ files, adding a quality score associated with each nucleotide. Currently, these are non-human-readable plaintext files. As sequencing increases, we need to be able to process many more gigabytes and terabytes of files rapidly and… Read More »

Species-wide genomics of kākāpō provides tools to accelerate recovery

The kākāpō is a critically endangered, intensively managed, long-lived nocturnal parrot endemic to Aotearoa New Zealand. We generated and analysed whole-genome sequence data for nearly all individuals living in early 2018 (169 individuals) to generate a high-quality species-wide genetic variant callset. We leverage extensive long-term metadata to quantify genome-wide diversity of the species over time and present new… Read More »

Dissertation Defense Announcement

In partial fulfillment of my doctoral degree, I will be presenting my work on “Genomic complexities in the legume-rhizobial symbiosis.” My work is primarily computational and the work is generalizable to many different systems. Open to the public. I will be presenting my work on Thursday, May 31 @ Noon @ the BioSciences building room 257 in St Paul,… Read More »

Machine Learning for Variant calling with DeepVariant from Google Brain

Last December Google Brain released DeepVariant, a machine-learning based variant caller using convolutional neural networks. While PacBio and Nanopore (long-read) sequencing become more mainstream, there exist massive amounts of data from 2nd generation sequencing* for populations which still have lots of use. For the Medicago HapMap project, we have 262 accessions with various depth of 2nd generation sequencing.… Read More »

Partition implemented in Python

Coming from a functional programming mindset, I needed a partition function in Python. I discovered this on the internet and wanted to share it here with anyone else looking for similar functions. You can see what partition typically does at ClojureDocs to get an idea if you are curious.

Select and resequence manuscript published

Select and resequence reveals relative fitness of bacteria in symbiotic and free-living environments. Abstract Assays to accurately estimate relative fitness of bacteria growing in multistrain communities can advance our understanding of how selection shapes diversity within a lineage. Here, we present a variant of the “evolve and resequence” approach both to estimate relative fitness and to identify genetic… Read More »

Machine Learning Crash Course from Google

Earlier this month Google made their internal Machine Learning Crash Course available. You can read more about it on their developer blog. I have a few machine learning projects going, mostly to learn but also to create an alignment-free sequence origin-identification tool. The (unorganized, incomplete) code is available at my GitHub repository. I’m curious about methods to improve genome… Read More »

Using ODG from the Neo4j Web Console

The ODG query interface should suffice for many operations, and the command-line interface supports only certain analyses. If you have more advanced queries to run, you can interact with ODG’s generated database from nearly any programming language, using a library or package, via the REST API, or through Neo4j’s Web Console. This tutorial will cover accessing it via… Read More »

ODG, the Omics Database Generator, has been published

ODG: Omics Database Generator has been published in BMC Bioinformatics and is available online now. ODG is a tool that allows users to supply -omics data and ODG will integrate the data into a coherent database and generate a web-based user-interface. Advanced users can query the database directly, through a programming language or by using the CYPHER query language. ODG uses Neo4j’s… Read More »