Monday, January 19, 2015

Bash scripts for QIIME work: akutils

The initial problem:
In mid-December I was feeling pretty confident that I would be finishing up my dissertation within a couple of months, or at least enough of it to convince my committee to allow me to graduate.  That was when I hit a bit of a rough patch that has left me far better off than I previously was in terms of bioinformatic capability.  It all started when I was playing with some of my data.  I was randomly blasting sequences out of my demultiplexed data as a sanity check.  The data were fungal ITS2, and they were all hitting various fungi.  Then one came back as PhiX174.  Yes, the genome used as control sequence for most Illumina runs.  This had me a little surprised, so I continued and eventually found another PhiX read.  This is not supposed to happen, and as my data are community amplicon data, it seemed I could be incorporating random PhiX reads into my data and possibly reaching the wrong conclusion, depending on the extent of the contamination (or infiltration).  So how can this have happened?  First of all, my data was generated using single-indexed primers.  This is fine, but it was shown a couple of years ago by Kircher et al that some sample-sample bleed occurs on Illumina runs and that this can be mitigated by use of dual indexing.  So I was thinking it was simple bleed, but I had to determine the extent of it.  So I found a program called smalt (https://www.sanger.ac.uk/resources/software/smalt/) which can efficiently map reads to a reference.  Then I downloaded the PhiX reference for Illumina data (http://www.ncbi.nlm.nih.gov/nucleotide/NC_001422) and built an index to the highest sensitivity:

smalt index -k 11 -s 1 phix-k11-s1 <sourcefasta>

Where sourcefasta is a fasta version of NC_001422.

I found my data to be not terribly contaminated, but as I have multiple runs worth of data (single-indexed, dual-indexed, 2x150, 2x250, 2x300, 2x75), I wanted to efficiently determine the rate of PhiX infiltration.  And down the rabbit hole I went...

This turned into about 4 weeks of obsessive bash scripting interspersed with a week of unrelated, yet equally-obsessive lab work.  My first scripts worked on my system at least, but were pretty terrible.  Then I started learning new things, conditional statements, new functions, etc and useful scripts began to take form.  I am proud to say I now have a github repository that many may find useful in their QIIME work, which I am calling akutils (https://github.com/alk224/akutils), and seems to run well on Ubuntu 14.04 and CentOS 6.  First I will start with a brief description of the scripts, and then I will describe how to easily install akutils on your system, whether it be your own computer or an account on a cluster.

Biom handling:

  • biomtotxt.sh - Quick one-liner to convert biom table to tab-delimited (biom v1 or v2).
  • txttobiom.sh - Take tab-delimited table back to biom.
  • biom-summarize_folder.sh - Summarize an entire folder of biom tables at once.

Data preprocessing:

  • strip_primers_parallel.sh - Efficiently remove primer sequences from your raw data.  After I posted this blog I found the script would keep read 1 and read 2 in phase, but they would end up out of phase from the index, so I put it all back to a single core for now.  Still useful if you have genome data you wish to assemble.
  • strip_primers.sh - Remove primer sequences from raw data.
  • primers.16S.ITS.fa - Fasta containing 515/806 and ITS4mod/5.8SLT1 sequences, decoded into nondegenerate sequences for use with strip_primers_parallel.sh. If you have 16S or ITS2 data from EnGGen, this file is probably for you!
  • PhiX_filtering_workflow.sh - Take the PhiX out of your data before you even start and estimate the rate of PhiX infiltration in your data.
  • Single_indexed_fqjoin_workflow.sh - Run fastq-join to join your reads while keeping your index reads in phase with the joined result.
  • Dual_indexed_fqjoin_workflow.sh - Same, but for dual-indexed data.
  • concatenate_fastqs.sh - Just like the excel function, but for your fastq files.
  • ITSx_parallel.sh - A parallel implementation of the excellent ITSx.  Run split_libraries on fungal ITS data first, then use this to screen for sequences that match fungal ITS HMMer profiles.
QIIME workflows:
  • open_reference_workflow.sh - QIIME has its own built-in workflows including an open reference workflow (pick_open_reference_otus.py, but this is custom-built by me to be just how I like.  Maybe you also like!
  • chained_workflow.sh - Multi-step OTU picking with a precollapsing step and subsequent de novo picking with cdhit.  Extremely efficient and good for discerning a subtle effect size.
  • cdiv_2d_and_stats.sh - Core_diversity_analysis.py in QIIME is a fantastic script, but I thought it was missing a few things (this still is).  Runs your core diversity and also produces biplots, 2D PCoA plots (remember those from 1.7?) and collated beta diversity stat tables for the factors you specify (permanova and anosim).  This script relies on interactive user input so is the least appropriate for use on a cluster, but you can do this sort of thing on your laptop (e.g. Ubuntu in VM).
(note: fixed problem with chimera filtering large data sets today.  If you pulled this repo prior to 1:30pm 1/20, do a fresh git pull.)

Settings management:
  • akutils_config_utility.sh - This is the backend for my scripts.  Use this to set your global (or local) parameters (such as a reference database) for the various workflows.  After you clone this repo, you need to run this script first so the global parameters can be properly referenced.
SLURM management:
  • slurm_builder.sh - Rapidly produce a slurm script for your job without any errors (specific to the monsoon cluster at NAU, but easily modified for your specific cluster).
I know you are thinking, "this sounds awesome!  How can I get this on my computer?"  Well first, make sure you are using Ubuntu 14.04 or CentOS 6.  These scripts might run on other distros, but I make no effort to ensure compatibility.  Now that you have the correct OS, make sure you have git installed (on Ubuntu: sudo apt-get install git), go to your home directory, and run the following:


This will pull the repo to your computer.  Now you just need to add it to your path.  You will find 11 different ways to do this and twice as many opinions on stackoverflow, but these are the correct ways as far as I can tell.

In CentOS, modify .bashrc.

While in your home directory, execute:
nano .bashrc

Then add something like this to the end of the file:

# User specific aliases and functions
PATH=$PATH:/home/<youruserid>/akutils

export PATH

Log out and log back in and you should be able to call my scripts.

In Ubuntu, change /etc/environment (need sudo power).

While anywhere, execute:
sudo nano /etc/environment

Add (or append) a PATH line just as in the CentOS instructions:
PATH=$PATH:/home/<youruserid>/akutils

Reboot, and you should be able to call my scripts.

Install dependencies!!

Important note: If you are running on the monsoon cluster at NAU, all dependencies are already present except for ngsutils (which you can install without admin privileges to your home directory) and the slurm_builder.sh script will automatically call necessary modules.

Otherwise, you still have that pesky step of getting all the programs these scripts are calling into your path.  If you don't need them all, just install what you need.  Most of the scripts check for dependencies before running, so will let you know what is still missing.  Here's the list:


I like to put extra scripts (such as fasta-splitter.pl) into another directory called ~/added_scripts.  To add another directory to your path, change the necessary file according to your OS (see above) and append it like so:

PATH=$PATH:/home/<youruserid>/akutils:/home/<youruserid>/added_scripts

Then reboot or logout/login (as above) and it should all work.

People using monsoon can retrieve fasta-splitter.pl from /common/contrib/enggen/added_scripts/ (copy to your own added_scripts add add to your path).

Dependencies in place, run the config utility first.  You can call it directly, or from any of the workflows just pass in config as the only argument (see example below).  Follow the instructions and you will be good.  I typically set this up to default for 16S, and build a local config file for ITS or LSU data (other).

Example to call config utility:
chained_workflow.sh config

So how should I be using these?

Good question!  Say I have a 2x250 run of ITS2 data (single indexed).  My order of processing would be:
  1. strip_primers_parallel.sh (remove primers prior to read joining).
  2. PhiX_filtering_workflow.sh (get rid of random PhiX reads prior to data processing).
  3. Single_indexed_fqjoin_workflow.sh (join data if read2 doesn't look too bad, otherwise skip this and just use read1, and make sure your phix filtering is based off of read1 only - see --help for phix workflow command).
  4. Split libraries.  You could do this manually, or you call either chained_workflow or open_reference_workflow and interrupt the script (ctrl-C) once split libraries has completed.
  5. ITSx_parallel.sh to ensure I am using valid input sequences for ITS analysis.
  6. chained_workflow.sh to get an initial view of the data.  This runs in a fraction of the time of the open reference workflow.  The rate-limiting steps here are OTU picking and especially taxonomy assignment.  I prefer RDP for taxonomy assignment which can be a little testy, but gives consistently good results for every locus I have tested (don't exceed 12 cores, watch your called RAM to ensure you don't run out).
  7. open_reference_workflow.sh to get the "standard" output.
  8. Manual filtering of the raw otu table output.  This is super important and run-specific, so carefully inspect your output first.
      1. Eliminate samples you will not be using (other experiments and/or controls) with filter_samples_from_otu_table.py.
      2. Eliminate low count samples (usually use min 1000 reads) with filter_samples_from_otu_table.py.
      3. Eliminate non-target taxa (may require adjusting your database, UNITE in this context to detect non-target organisms such as host plant species) with filter_taxa_from_otu_table.py.
      4. Filter singletons/doubletons, unshared OTUs with filter_otus_from_otu_table.py (pass -n 3 -s 2).
      5. Filter at some abundance threshold (usually 0.005% according to Bokulich et al, 2013) with filter_otus_from_otu_table.py (pass --min_count_fraction 0.00005).
  9. cdiv_2d_and_stats.sh on filtered tables from the chained and open reference workflows. 
And as always, give every result the "sniff test" before accepting it.  Everyone makes mistakes.

I will keep trying to update the documentation (run some command plus -h or --help) over the next few months.  It is a little incomplete now, but things will get better.  To benefit from my updates, you need to learn one more thing, how to update your locally cloned copy of my git repo.

To update akutils (if/when updates are available):
Navigate to the repo directory (~/akutils in above examples).  Then execute:
git pull

It should update everything and you will have my changes.

OK great, but what about PhiX infiltration?  Is this a problem?
That's what I was initially trying to figure out.  At first I thought that if PhiX was bleeding into my data, every sample must also be bleeding everywhere else at the same rate.  But that doesn't seem to be the case.  I have runs with mock communities that I built.  I know what is in them, and I don't see their laboratory strains showing up in my environmental data from the same run.  However, the simpler the mock community, the more apparent the sample-sample bleed appears to be.  That is, for a given run, I can see sample-sample bleed within the mock dataset even though I don't see it at all in the environmental data.  I can't explain it all yet, but I am relieved this seems to be the case.  My dissertation work appears to still be based on high-quality data.  2x250 data seems to have the highest PhiX infiltration.  2x150 and 2x300 data have comparable, lower rates.  Read 2 always has a higher rate than read 1.  Now that I can remove it, I can stop worrying and also stop wasting computational time clustering non-target sequences.

But how can this happen?
I contacted Illumina about this and though the person I was corresponding with initially insisted that demultiplexing separates indexed from non-indexed data (this was my prior understanding as well), I passed them the Kircher paper and they eventually got back to me, acknowledging my observations.  The consensus at Illumina is that this can and does happen, especially if the control library is non-indexed.  During the indexing reads, the non-indexed PhiX clusters generate no real signal, and signal from nearby, possibly partially-overlapping clusters is read by the instrument at the PhiX cluster, thus attributing PhiX reads to some sample.  If this is indeed the mechanism, it seems unlikely there will be much sample-sample bleed this way as any cluster will generate more signal than its neighbor, thus neighboring signal will be noise at worst.

To give you some idea of PhiX infiltration on a single-indexed data set (reduced for testing purposes), here is some output from the PhiX workflow:

Processed 11250 single reads.
104 reads contained PhiX174 sequence.
Contamination level is approximately 1 percent.
Contamination level (decimal value): .0092444444
---

All workflow steps completed.  Hooray!
Mon Jan 02:54 PM MST 2015

Total runtime: 0 days 00 hours 00 minutes 1.9 seconds

One more thing!!
That output reminded me.  The workflows will report the total run time (useful for planning resource requisition via slurm), and the qiime workflows will pick up where they left off as long as you delete any partial results from the last step.  That is, if it broke for some strange reason at pick_rep_set.py (possibly slow file refresh rate on your system), delete that output (see the log file) and restart the workflow.

Happy power-qiimeing!!