EnGGen tech

Tuesday, May 12, 2015

Getting quality data from your sequencing provider

Computer science gave us the adage "garbage in, garbage out," and it is certainly appropriate here. The most important thing is that you provide high quality template for them to work with. If you want data you can be confident in and you want it in a reasonable amount of time, the most important thing is to provide an excellent sample. I basically just said the same thing twice, but it merits repeating. So how can you be certain what you provide to your sequencing facility is good enough to provide quality results?

One option is to be very good at all the molecular techniques so that you can do all the prep yourself. This has the added benefit of giving you, the researcher complete control over the process, so you can adjust things as you see necessary for your particular sample type (sometimes improving on existing protocols), and this can save you a substantial amount of money over time. On the other hand, if you submit a prepped library to a sequencing provider, you take on some liability and failed runs can be assumed to be the result of a poorly-prepped library (this can be expensive). If you plan to do a lot of sequencing, I recommend this option, but that is for another post (coming very soon, I promise).

Another option, and the right way to start in my opinion, is to simply be able to produce high quality preps for your particular sample type (DNA or RNA). I will focus on DNA preps here, but the same concerns apply to RNA preps. I will try to note when your RNA protocol should deviate from my advice.

Nucleic acid extraction:
The hardest part of any of these projects will always be the DNA extraction. As long as you have high quality DNA (or RNA), the rest of your prep should progress without much effort. As a service provider, we know well the importance of being able to confidently produce a library for a user under a deadline. Maybe you work with E. coli and genomic DNA is easy to come by. You can grow up some liquid cultures overnight, have fresh tissue in the morning, and procure micrograms of pure, clean DNA by lunch time. But I work in an environmental lab, and things are never that easy unless we are doing clone libraries (which we haven't done in years now!). Soils are notorious for being inconsistent across sites in terms of just about everything, and some of the other tissues we work with (fungi, feces, roots) are also difficult to obtain high quality DNA from. You might think this is no problem, and that you should just purchase a kit from a company that is appropriate for your tissue. I would say you are mostly right, but this is still no guarantee that your DNA will be any good. Many great kits are available from suppliers such as MoBio, Qiagen, Life Technologies and Sigma, to name a few that I have used with success in the past. It's OK to pick up the phone and call a company and ask for a recommendation.

Getting the right kit is only half the battle. You now need to actually use it in order to be able to complete your project. For your sake, I hope that you purchased the kit that will do 100 preps even though you only have 45 samples. The reason I say this is that practice is required in order to do good work. Period. But in this case, of course I am referring to the molecular lab. Before consuming any of your precious sample, you might want to extract DNA from something else that best approximates your actual samples (an extra sample?). This will tell you a lot, and some of the kits even come with troubleshooting guides that can help you to adjust your technique.

In my experience with soil and root extractions, I find that adding a few small steel beads to the usually less-dense grinding beads that come in kits and adding a heating step to the lysis component makes all the difference. For beads I use 2.3mm chrome-steel from BioSpec. I usually grind soils for 2-10 minutes at 30 Hz, though I will do this in bursts of usually no more than 2 minutes at a time. Keep in mind that the longer you grind, the more likely you are to physically destroy the nucleic acids you wish to successfully obtain. I would suggest that while you might be OK to grind a soil sample for 10 minutes total, a plant leaf or root won't need long at all to be thoroughly pulverized (maybe 2 min maximum).

All kits have a lysis step, but not all kits do this step under heat. In most cases, you are OK to identify this step (indicated in all good protocols) and extend it and add heat (careful doing this with RNA, but still necessary for some tissues). For example, instead of violent grinding and heated lysis, MoBio PowerSoil has you add sample to the bead tubes and place them on a vortex adapter for 10 minutes. The 96-well version has you shake your extraction plates in a tissue disruptor for a total of 20 minutes. In both cases, I have found that with the soils for my dissertation project I need to do something else. I add 5 of the aforementioned chrome-steel beads to each tube or well. Then, I get a water bath (96-well blocks) or a heated block (minipreps) up to 70C before starting the prep. For both versions of the extraction I make use of our GenoGrinder for the initial grind (ours is a GenoGrinder 2000, but the latest model is a 2010). I prefer the GenoGrinder over other disruptors (e.g. Retsch MixerMill, BioSpec BeadBeater96) because it can be fine-tuned, and more importantly, has linear throw (versus swinging throw) that is longer than most tubes are tall. If using a disruptor with swinging throw, make sure you rotate your plates or tube racks part way through the grind so that all samples get reasonably even exposure to this vital step. For tubes I will grind samples twice for 2 minutes each (4 min total) and then place them on the preheated block (fill with water to ensure even heating of samples). Then I set a timer for 10 minutes and vortex tubes briefly in pairs before moving to the next step. For plates, I grind the samples twice for 5 minutes (10 min total) each and then place them in the water bath. After 20 minutes, I grind them again for 1 minute and replace them to the water bath for another 30 minute incubation. The troubleshooting guide with the MoBio kits suggests a shorter heated incubation and warning that yield may be reduced if this is done longer than 5 minutes, but I didn't really get any yield doing less than 10 or 20 minutes. This is why it is important to practice! After the heated lysis step, proceed with the remainder of the protocol as written.

Quality control:
There are several things you can and should do to check the quality of your samples. The first is get an absorbance reading. Many people refer to this process as "Nanodrop," but this is simply the nifty spectrophotometer we use to obtain our absorbance data while consuming a very small amount of sample. Often the Nanodrop data is not encouraging, or people look only at the ng/uL value and get the wrong idea if there is coabsorbance at 260nm (where your DNA - the pyrimidines, that is - maximally absorbs light) due to something else absorbing maximally elsewhere (usually at 230nm). To give you an idea, here is an example of excellent absorbance data from a Nanodrop spec:

This is a result I got from a pine needle extraction, and you will never see something like this from a soil extraction, but this is a nearly perfect result (I keep it taped on the wall next to the instrument so people can see what to aspire to). Ideally your 260/280 ratio is at least 1.8 and your 260/230 ratio is at least 1.5. If this is the case, the reported concentration can be reasonably trusted, though I don't really consider Nanodrop concentrations reported at 10ng/uL or less as this is about the detection limit of the instrument. Doubt this statement? Consider this Nanodrop result from my 10ng/uL fluorescence standard (high quality calf thymus DNA):

You can see the 260/280 is quite high while the 260/230 is extremely low. In this case, both of these values are meaningless, while the concentration calculation is almost half what it should be. The only thing you can trust here is that you can clearly see the bump at 260nm, so you know something is there.

Here is an example of a Nanodrop result of a typical soil extraction:

If you are embarking on a dissertation or thesis project and this was the result you got, you might feel like crying, but have no worry, this is quite acceptable! You can barely discern the 260nm bump, but more importantly, you don't have a giant peak at 230nm. Consider this result from a similar sample:

The concentration looks encouraging, but that giant peak on the left is truly worrisome. You can see a shoulder where the curve crosses 260nm which indicates the presence of DNA, but that giant peak at 240nm indicates there is something contaminating your sample. In this case, I believe the problem is contaminating EDTA as I was playing with different purification solutions at the time (that trace was from 2012). EDTA in high concentration can sequester divalent cations that might be important to the progress of some crucial downstream step (think the Mg2+ in PCR). Another common problem with a similar absorbance profile is guanidine salt (lysis agent) that carried over from the extraction kit. Guanidine salts (GuSCN or GuHCL) are powerful denaturants that can destroy active enzymes you might need for something like PCR or restriction digestion. Fear not, as such things can often be removed with a cleanup. Many might opt for an ethanol precipitation here. Just make sure to rinse your precipitated pellet with plenty of 70% ethanol before drying and resuspending as you want to make sure that as much of that contamination as possible is solubilized and removed. Check out my post on bead cleanups which are also effective at removing contamination problems and is my preferred method for such problems anymore. There are also column-based cleanups, but these can be expensive and tend to incur loss of a surprising percentage of sample as you remove the impurities. If the last image was the result I had for some set of samples, I would use the bead cleanup approach, and then wash three times with 70% ethanol, making sure that the first wash at least had enough volume to contact all the inner surfaces of the sample vessel (plate well or tube). Second and third washes can have lower volumes. Washes need not be very long, a minute or two will suffice (probably still more than is necessary). Make sure to dry the samples, and then resuspend in Tris-Cl pH 8-9 and get a new Nanodrop result.

An example of a gel-extracted PCR product which contains a LOT of guanidine salt. Agarose solubilization buffers typically contain GuSCN. Note the peak maximum is at 230nm in contrast to the EDTA example above. This sample was submitted to EnGGen for Sanger sequencing, but failed to produce any signal:

A couple of notes about guanidine and EDTA carry-over. DNA binding to either beads or a silica membrane relies on salt-bridge formation between the DNA and the beads or the membrane and guanidine can participate in this process. This means that while most of the salt will be washed away, some will likely remain. Conversely, EDTA can sequester the very salts you need to facilitate the binding of DNA to your immobilization substrate of choice, so when you do the binding step, you might just lose most or all of your DNA. You can increase the volume of bead solution used in a bead cleanup or the ratio of binding solution used in a column cleanup to avoid this problem. Or else an ethanol cleanup will avoid the complications of either of these approaches.

After Nanodrop:
You may have noticed that I wasn't seriously suggesting that you use the concentrations provided by the Nanodrop. I am far more interested in what the Nanodrop can tell me about the quality of samples in terms of potential contaminants that can cause problems downstream. If you want real concentrations, you need to determine them by fluorescence. Most labs use something called a Qubit for this purpose. The Qubit works OK, but I prefer the Dynaquant300 fluorometer from Hoefer which has apparently been discontinued. I think this unit from Promega may be an acceptable replacement. The DQ300 can be equipped with a micro-cuvette adapter which allows me to use the same PicoGreen reagent used in the Qubit, but at much lower volume (60uL) and the sample is read through a borosilicate cuvette rather than a plastic tube which offers more consistent concentration determination. However you do it, fluorometry is the best way to get an accurate concentration determination as the dyes that are used (good ones, anyway, some are less good) are specific to the substrate of interest (eg, dsDNA or RNA). While it is not always necessary to obtain fluorescence quantification values, I maintain that it is a good idea and for all of my own work I use this to normalize all samples within an experiment to the same concentration prior to embarking on molecular analyses. In many publications you will see something much less involved such as "DNA was extracted with some kit, and diluted 100-fold prior to PCR." This really bothers me as it lacks useful details, and is pretty much not reproducible. In short, this is laziness on the part of the researcher. On the other hand, if you have something that works, why scrutinize it terribly? I can definitely see why such things are avoided sometimes. If you need accurate quantities of sample to send to your sequencing provider, use fluorescence to ensure you are actually providing the correct quantity.

Gel analysis:
This is the old-school method for checking your DNA. Run some DNA extract (~10uL) on a gel and it should migrate well-above the largest band. I used to use a HindIII digest of lambda-phage for this (largest band ~23kb), but now I just use my favorite ladder (KAPA express ladder) and if the genomic DNA is larger than the largest band I am satisfied. For difficult samples, you may not have enough DNA to see anything. If you do this check, you want a large, discrete band. If there is streaking in your sample, you may have a problem.

PCR testing:
The proof is in the pudding, as they say, so the real test of your sample is if it can be manipulated. Despite everything I have put in this post so far, if you can get your sample to work, you have little to worry about. For preps that rely directly on PCR, PCR is of course the best test of sample viability. But even if your sample is destined for a different prep which utilizes a different initial enzyme (e.g., restriction endonuclease for RADseq or transposase for Nextera) it also can tell you if your sample is clean enough for enzymatic treatment. For this reason, I will typically do a PCR test of every sample to see if I can get some standard loci to amplify. In most cases, this might be an rDNA locus (e.g., 16S or ITS) since primers for these loci are universal. However, if you are doing an mRNA pulldown, make sure you test your converted cDNA with a housekeeping gene instead (such as CO1) since if your pulldown was efficient, your sample should not contain any remaining ribosomal sequences. Ideally, your sample will work well with a generic polymerase and not require something special such as an adjutant (e.g., BSA, betaine) or an advanced enzyme such as 2G Robust from KAPA. It may be useful to contact your provider and find out some protocol details if your sample is to be subjected to PCR. For instance, for PCR-based amplicon preps such as 16S that we generate here at EnGGen, we use Phusion polymerase, so we encourage users to test their samples with Phusion using our PCR protocol (more on that in my next post).

While PCR testing might seem easy and straightforward, there are a couple of tricks that can be done initially which can save you a lot of time and headache later on. The first I recommend is to boost your MgCl2 concentration to 3.0 mM while testing your sample. This has the effect of increasing the activity of the polymerase and can ensure that you see a product if a product can be produced. In some cases this also causes non-specific amplification (PCR artifacts) in which case you should roll back your MgCl2 to where this doesn't happen, and/or increase your annealing temperature. The second trick is to start off with a dilution series PCR. This does a few things simultaneously. Crucially, it allows you to determine a dilution factor which improves results for your sample without actually creating any dilutions (which can needlessly occupy freezer space and consumes part of your precious sample volume). In some cases you may find that an already low-concentration template works well at a high dilution factor such as 1/100x. Like Nanodrop data, this can be indicative of the presence of PCR inhibitors which are sufficiently diluted away by the dilution series, even as there is still enough template DNA present for the reaction to efficiently proceed. If this is the case, you should still consider doing a cleanup as very dilute samples may amplify, but may also do so with low efficiency and increased bias which will diminish your ability to discern community patterns from your data.

So how to do a PCR dilution series? It is very easy as long as you keep in mind that like all molecular biology processes, homogenization of your reactions is crucial to your success. For each sample, plan to prepare 3 PCR reactions in 10uL total volume, using 1 uL of template DNA each. Make enough master mix to aliquot 9 uL for each PCR reaction (minus the template), and distribute to your PCR plate (avoid strip tubes as they often have issues with evaporation). I prefer to distribute the master mix so that each sample will be represented in three adjacent wells of the same row. Add 1 uL DNA to the first reaction, seal, mix and spin down. Now take 1 uL of the first reaction and deliver it (as template) to the second reaction. This creates your 1/10x dilution. Repeat the seal, mix, spin down, and move 1 uL of the second reaction to the 3rd reaction (for 1/100x). If you would prefer a 1/50x dilution, move 2 uL instead. Upon getting gel results (2 uL per well is more than enough), examine which reactions worked the best, and determine approximately the optimal dilution for your samples and what the actual concentrations are of the successful products.

EnGGen PCR testing suggestions:
Use the same polymerase we will use for your prep (usually Phusion HotStart II).
Get cycling conditions from us before you start.
Get primers from us or compare sequences against ours before you start.

Happy DNA prepping!

Monday, January 19, 2015

Bash scripts for QIIME work: akutils

The initial problem:
In mid-December I was feeling pretty confident that I would be finishing up my dissertation within a couple of months, or at least enough of it to convince my committee to allow me to graduate. That was when I hit a bit of a rough patch that has left me far better off than I previously was in terms of bioinformatic capability. It all started when I was playing with some of my data. I was randomly blasting sequences out of my demultiplexed data as a sanity check. The data were fungal ITS2, and they were all hitting various fungi. Then one came back as PhiX174. Yes, the genome used as control sequence for most Illumina runs. This had me a little surprised, so I continued and eventually found another PhiX read. This is not supposed to happen, and as my data are community amplicon data, it seemed I could be incorporating random PhiX reads into my data and possibly reaching the wrong conclusion, depending on the extent of the contamination (or infiltration). So how can this have happened? First of all, my data was generated using single-indexed primers. This is fine, but it was shown a couple of years ago by Kircher et al that some sample-sample bleed occurs on Illumina runs and that this can be mitigated by use of dual indexing. So I was thinking it was simple bleed, but I had to determine the extent of it. So I found a program called smalt (https://www.sanger.ac.uk/resources/software/smalt/) which can efficiently map reads to a reference. Then I downloaded the PhiX reference for Illumina data (http://www.ncbi.nlm.nih.gov/nucleotide/NC_001422) and built an index to the highest sensitivity:

smalt index -k 11 -s 1 phix-k11-s1 <sourcefasta>

Where sourcefasta is a fasta version of NC_001422.

I found my data to be not terribly contaminated, but as I have multiple runs worth of data (single-indexed, dual-indexed, 2x150, 2x250, 2x300, 2x75), I wanted to efficiently determine the rate of PhiX infiltration. And down the rabbit hole I went...

This turned into about 4 weeks of obsessive bash scripting interspersed with a week of unrelated, yet equally-obsessive lab work. My first scripts worked on my system at least, but were pretty terrible. Then I started learning new things, conditional statements, new functions, etc and useful scripts began to take form. I am proud to say I now have a github repository that many may find useful in their QIIME work, which I am calling akutils (https://github.com/alk224/akutils), and seems to run well on Ubuntu 14.04 and CentOS 6. First I will start with a brief description of the scripts, and then I will describe how to easily install akutils on your system, whether it be your own computer or an account on a cluster.

Biom handling:

biomtotxt.sh - Quick one-liner to convert biom table to tab-delimited (biom v1 or v2).
txttobiom.sh - Take tab-delimited table back to biom.
biom-summarize_folder.sh - Summarize an entire folder of biom tables at once.

Data preprocessing:

~~strip_primers_parallel.sh~~ - Efficiently remove primer sequences from your raw data. After I posted this blog I found the script would keep read 1 and read 2 in phase, but they would end up out of phase from the index, so I put it all back to a single core for now. Still useful if you have genome data you wish to assemble.
strip_primers.sh - Remove primer sequences from raw data.
primers.16S.ITS.fa - Fasta containing 515/806 and ITS4mod/5.8SLT1 sequences, decoded into nondegenerate sequences for use with strip_primers_parallel.sh. If you have 16S or ITS2 data from EnGGen, this file is probably for you!
PhiX_filtering_workflow.sh - Take the PhiX out of your data before you even start and estimate the rate of PhiX infiltration in your data.
Single_indexed_fqjoin_workflow.sh - Run fastq-join to join your reads while keeping your index reads in phase with the joined result.
Dual_indexed_fqjoin_workflow.sh - Same, but for dual-indexed data.
concatenate_fastqs.sh - Just like the excel function, but for your fastq files.
ITSx_parallel.sh - A parallel implementation of the excellent ITSx. Run split_libraries on fungal ITS data first, then use this to screen for sequences that match fungal ITS HMMer profiles.

QIIME workflows:

open_reference_workflow.sh - QIIME has its own built-in workflows including an open reference workflow (pick_open_reference_otus.py, but this is custom-built by me to be just how I like. Maybe you also like!
chained_workflow.sh - Multi-step OTU picking with a precollapsing step and subsequent de novo picking with cdhit. Extremely efficient and good for discerning a subtle effect size.
cdiv_2d_and_stats.sh - Core_diversity_analysis.py in QIIME is a fantastic script, but I thought it was missing a few things (this still is). Runs your core diversity and also produces biplots, 2D PCoA plots (remember those from 1.7?) and collated beta diversity stat tables for the factors you specify (permanova and anosim). This script relies on interactive user input so is the least appropriate for use on a cluster, but you can do this sort of thing on your laptop (e.g. Ubuntu in VM).

(note: fixed problem with chimera filtering large data sets today. If you pulled this repo prior to 1:30pm 1/20, do a fresh git pull.)

Settings management:

akutils_config_utility.sh - This is the backend for my scripts. Use this to set your global (or local) parameters (such as a reference database) for the various workflows. After you clone this repo, you need to run this script first so the global parameters can be properly referenced.

SLURM management:

slurm_builder.sh - Rapidly produce a slurm script for your job without any errors (specific to the monsoon cluster at NAU, but easily modified for your specific cluster).

I know you are thinking, "this sounds awesome! How can I get this on my computer?" Well first, make sure you are using Ubuntu 14.04 or CentOS 6. These scripts might run on other distros, but I make no effort to ensure compatibility. Now that you have the correct OS, make sure you have git installed (on Ubuntu: sudo apt-get install git), go to your home directory, and run the following:

git clone https://github.com/alk224/akutils.git

This will pull the repo to your computer. Now you just need to add it to your path. You will find 11 different ways to do this and twice as many opinions on stackoverflow, but these are the correct ways as far as I can tell.

In CentOS, modify .bashrc.

While in your home directory, execute:

nano .bashrc

Then add something like this to the end of the file:

# User specific aliases and functions

PATH=$PATH:/home/<youruserid>/akutils

export PATH

Log out and log back in and you should be able to call my scripts.

In Ubuntu, change /etc/environment (need sudo power).

While anywhere, execute:

sudo nano /etc/environment

Add (or append) a PATH line just as in the CentOS instructions:

PATH=$PATH:/home/<youruserid>/akutils

Reboot, and you should be able to call my scripts.

Install dependencies!!

Important note: If you are running on the monsoon cluster at NAU, all dependencies are already present except for ngsutils (which you can install without admin privileges to your home directory) and the slurm_builder.sh script will automatically call necessary modules.

Otherwise, you still have that pesky step of getting all the programs these scripts are calling into your path. If you don't need them all, just install what you need. Most of the scripts check for dependencies before running, so will let you know what is still missing. Here's the list:

1) QIIME 1.8.0 or later (https://qiime.org)
2) ea-utils (https://code.google.com/p/ea-utils/)
3) Fastx toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
4) NGSutils (http://ngsutils.org/)
5) Fasta-splitter.pl (http://kirill-kryukov.com/study/tools/fasta-splitter/)
6) ITSx (http://microbiology.se/software/itsx/)
7) Smalt (https://www.sanger.ac.uk/resources/software/smalt/)
8) HMMer v3+ (http://hmmer.janelia.org/)

I like to put extra scripts (such as fasta-splitter.pl) into another directory called ~/added_scripts. To add another directory to your path, change the necessary file according to your OS (see above) and append it like so:

PATH=$PATH:/home/<youruserid>/akutils:/home/<youruserid>/added_scripts

Then reboot or logout/login (as above) and it should all work.

People using monsoon can retrieve fasta-splitter.pl from /common/contrib/enggen/added_scripts/ (copy to your own added_scripts add add to your path).

Dependencies in place, run the config utility first. You can call it directly, or from any of the workflows just pass in config as the only argument (see example below). Follow the instructions and you will be good. I typically set this up to default for 16S, and build a local config file for ITS or LSU data (other).

Example to call config utility:
chained_workflow.sh config

So how should I be using these?

Good question! Say I have a 2x250 run of ITS2 data (single indexed). My order of processing would be:

strip_primers_parallel.sh (remove primers prior to read joining).
PhiX_filtering_workflow.sh (get rid of random PhiX reads prior to data processing).
Single_indexed_fqjoin_workflow.sh (join data if read2 doesn't look too bad, otherwise skip this and just use read1, and make sure your phix filtering is based off of read1 only - see --help for phix workflow command).
Split libraries. You could do this manually, or you call either chained_workflow or open_reference_workflow and interrupt the script (ctrl-C) once split libraries has completed.
ITSx_parallel.sh to ensure I am using valid input sequences for ITS analysis.
chained_workflow.sh to get an initial view of the data. This runs in a fraction of the time of the open reference workflow. The rate-limiting steps here are OTU picking and especially taxonomy assignment. I prefer RDP for taxonomy assignment which can be a little testy, but gives consistently good results for every locus I have tested (don't exceed 12 cores, watch your called RAM to ensure you don't run out).
open_reference_workflow.sh to get the "standard" output.
Manual filtering of the raw otu table output. This is super important and run-specific, so carefully inspect your output first.

Eliminate samples you will not be using (other experiments and/or controls) with filter_samples_from_otu_table.py.
Eliminate low count samples (usually use min 1000 reads) with filter_samples_from_otu_table.py.
Eliminate non-target taxa (may require adjusting your database, UNITE in this context to detect non-target organisms such as host plant species) with filter_taxa_from_otu_table.py.
Filter singletons/doubletons, unshared OTUs with filter_otus_from_otu_table.py (pass -n 3 -s 2).
Filter at some abundance threshold (usually 0.005% according to Bokulich et al, 2013) with filter_otus_from_otu_table.py (pass --min_count_fraction 0.00005).

cdiv_2d_and_stats.sh on filtered tables from the chained and open reference workflows.

And as always, give every result the "sniff test" before accepting it. Everyone makes mistakes.

I will keep trying to update the documentation (run some command plus -h or --help) over the next few months. It is a little incomplete now, but things will get better. To benefit from my updates, you need to learn one more thing, how to update your locally cloned copy of my git repo.

To update akutils (if/when updates are available):

Navigate to the repo directory (~/akutils in above examples). Then execute:

git pull

It should update everything and you will have my changes.

OK great, but what about PhiX infiltration? Is this a problem?

That's what I was initially trying to figure out. At first I thought that if PhiX was bleeding into my data, every sample must also be bleeding everywhere else at the same rate. But that doesn't seem to be the case. I have runs with mock communities that I built. I know what is in them, and I don't see their laboratory strains showing up in my environmental data from the same run. However, the simpler the mock community, the more apparent the sample-sample bleed appears to be. That is, for a given run, I can see sample-sample bleed within the mock dataset even though I don't see it at all in the environmental data. I can't explain it all yet, but I am relieved this seems to be the case. My dissertation work appears to still be based on high-quality data. 2x250 data seems to have the highest PhiX infiltration. 2x150 and 2x300 data have comparable, lower rates. Read 2 always has a higher rate than read 1. Now that I can remove it, I can stop worrying and also stop wasting computational time clustering non-target sequences.

But how can this happen?

I contacted Illumina about this and though the person I was corresponding with initially insisted that demultiplexing separates indexed from non-indexed data (this was my prior understanding as well), I passed them the Kircher paper and they eventually got back to me, acknowledging my observations. The consensus at Illumina is that this can and does happen, especially if the control library is non-indexed. During the indexing reads, the non-indexed PhiX clusters generate no real signal, and signal from nearby, possibly partially-overlapping clusters is read by the instrument at the PhiX cluster, thus attributing PhiX reads to some sample. If this is indeed the mechanism, it seems unlikely there will be much sample-sample bleed this way as any cluster will generate more signal than its neighbor, thus neighboring signal will be noise at worst.

To give you some idea of PhiX infiltration on a single-indexed data set (reduced for testing purposes), here is some output from the PhiX workflow:

Processed 11250 single reads.

104 reads contained PhiX174 sequence.

Contamination level is approximately 1 percent.

Contamination level (decimal value): .0092444444

---

All workflow steps completed. Hooray!

Mon Jan 02:54 PM MST 2015

Total runtime: 0 days 00 hours 00 minutes 1.9 seconds

One more thing!!

That output reminded me. The workflows will report the total run time (useful for planning resource requisition via slurm), and the qiime workflows will pick up where they left off as long as you delete any partial results from the last step. That is, if it broke for some strange reason at pick_rep_set.py (possibly slow file refresh rate on your system), delete that output (see the log file) and restart the workflow.

Happy power-qiimeing!!

Wednesday, August 6, 2014

EnGGen bioinformatics computer

Bioinformatic processing is an essential part of any lab that produces or simply utilizes any NGS data. The grant that funded our MiSeq also funded acquisition of a new computer to provide some bioinformatic power. The process was a learning experience for me, albeit an enjoyable one as I like playing with hardware and getting software to work even if it means a lot of time spent. I find that most of the time, the computer is very reliable and provides the needed power for most questions we might have. Occasionally something happens and then some time is needed to find the problem and work it out. This can take 5 minutes or it can take a week and you must be prepared to persevere so that that hardware you have invested money into can continue to function as desired. Hopefully with every rebuild you will become more adept at setting up your system and see ways in which better performance can be achieved. I am writing this blog not only to illustrate to others how to set up a decent bioinformatic system but also for myself (or my successor) so that I have all the necessary details in one place.

The system is a Mac Pro tower from late 2012. It has dual Intel Xeon processors (E5645 @2.4 GHz offering 24 virtual cores) and 64GB RAM. It came with Mac OSX on a single 1TB drive, but we purchased 4 3TB drives (has 4 SATA bays) and I initially installed Ubuntu Linux 12.04 on it over a RAID 10 configuration. This means that the drives were paired up to create two 6TB volumes and then each 6TB volume mirrored the other (redundancy). Externally we also have a Synology DS1813+ RAID enclosure containing 8 4TB NAS drives on Synology Hybrid RAID with 2 drive fault tolerance.

Last week I came back from vacation and my external RAID had finished expanding as I had added drives to it just before leaving. The computer also wanted to reboot to finish installing updates so I figured it was an ideal time for a restart and then I would finish administering the external RAID. Unfortunately, the computer failed to reboot and refused every other attempt to do so. It could have been a number of things (I was immediately suspicious of the software updates), but I eventually learned one of the drives failed. Turns out that my initial install on RAID 10 was a smart move, but I had to overcome some other problems as well which I will detail here.

Desktop computers are moving toward something called EFI (Extensible Firmware Interface) or UEFI (Universal EFI). It makes sense to update hardware sometimes, but we are in that painful time right now when most computers in use still use BIOS (Basic Input Output System) instead. Our Mac is EFI while all other computers in lab are still BIOS-based. Thus, when I tried to make a new bootable USB Ubuntu drive I failed every time since the Mac hardware was incompatible with the boot partition set up by BIOS-based computers. Luckily I still had the old 1TB OSX drive laying around and I swapped it in to boot OSX and produced a compatible USB drive that way (Ubuntu 14.04 AMD64/Mac version). That problem solved I was finally able to get the computer functioning and attempted to examine the RAID array. Since I had a RAID 10 I decided to simply install the OS over the first drive though in retrospect this is a bit of Russian roulette and I should have simply worked from the USB install. The first thing I did was install gparted (sudo apt-get install gparted). Then run gparted to examine the state of each disk (sudo gparted). This showed there was a problem with volume sdb which was the second drive in the array. Volumes sdc and sdd still had healthy RAID partitions on them. To administer RAID in linux, you should install something called mdadm (sudo apt-get install mdadm). To get things going I first had to stop the RAID (puzzling, but nothing worked until this was done: sudo mdadm --stop /dev/md0). md0 is the name of the RAID array (multidisk array). Then a simple command to assemble the RAID got things functioning immediately (sudo mdadm --assemble --scan). This started up the remnant RAID 10 with 2 of the 4 original drives and mounted it automatically. I was lucky that the drive used for the OS and the failed drive together constituted a RAID0 pair and not two identical components of the RAID1 (RAID 10 is sometimes called RAID1+0) or some data would have been lost.

Before I could back up the contents of the remaining RAID I had some other things to do first. I planned to backup the data to the external Synology RAID, but all it had done was expand the RAID array which doesn't change the addressable size of the RAID volume so despite now having over 20TB RAID, the volume as addressed by linux was still only about 7TB. To top it off, now that I had a fresh OS install, I no longer had the software set up to communicate between the Synology box and the linux system. So I went to the Synology website and downloaded SynologyAssistant (global 64bit). You then unzip the files into some directory (I used a subdirectory within Downloads) and run the install script. However, the install script doesn't come executable, so you have to change this first (sudo chmod a+x install.sh). Now you can run the script (sudo ./install.sh). Do what the script tells you and when finished, start SynologyAssistant from command line (type SynologyAssistant). It should automatically detect the active array (did for me). Click on it and then click "connect." The web interface opens in a browser and you need to enter username and password (if you ever lose these....). From here I was able to administer the RAID to expand the existing volume to fill out the drives. Unfortunately the first step is a parity check and with so many large drives it took about 30 hours. Once that is done it is a simple click job to expand the volume and this step only takes a few minutes. You next need to tell linux how to communicate with the Synology RAID. If you go to the Synology website, their instructions are very outdated for Ubuntu 10.04 and uses cifs rather than nfs plus some extra credential files that may represent a security risk. Some others have listed the correct way to do this elsewhere (this is a good post http://www.ryananddebi.com/2013/01/15/linuxmint-or-ubuntu-how-to-automount-synology-shares/). First you need to install nfs software (sudo apt-get install nfs-common nfs-kernel-server --note sure if you need both or not, but just to be safe). I had a hard time with nfs first until I realized it is a client service. First, nfs needs the directory to be used by the OS to address the external RAID to be identified in the /etc/exports file. I added the following two lines to the end of the exports file (use sudo nano /etc/exports):

#Synology nfs directory
/home/enggen/external ip.address.here.from.synology.interface(rw,sync,no_root_squash,no_subtree_check)

You can then start the nfs service (sudo service nfs-kernel-server start).

Next you need to add a line to /etc/fstab to tell the computer how to mount your external RAID so add the following to the end of fstab:

# Automount Synology RAID device
synology.ip.address.here:/volume1/homes /home/enggen/external nfs rw,hard,intr,nolock 0 0

To make things go you can either reboot or type: sudo mount -a

Now I had a functioning RAID with over 20TB space and I archived any folders from my desktop or a few other database locations to a backup directory using tar command to gzip everything (tar -czvf /path/to/external/RAID/device/archivename.tar.gz /path/to/folder/to/archive. I opened multiple terminals and set everything to archiving simultaneously which can slow performance, but it was now Sunday and I didn't plan to be in lab all day long. Once that was going I left and all my data was safe and backed up by Monday morning.

However, all was not well. I spent the next day or two trying to get the old drives prepared for a clean install. I found that the remnant file systems were causing me problems so I used gparted to eliminate the various partitions from each drive until I was left with a complete volume of unallocated space. One drive was less cooperative and it wasn't until I connected it to a windows computer in an external adapter and tried to format it from command line with all zeros (some windows code here). Windows told me it had over 2000 bad sectors which means this piece of hardware was probably responsible for my non-booting status and even if I had managed to recover it it likely would have failed again soon after.

So, I was down to only 3 3TB drives which sounds like a lot, but I have one project for instance that quickly bloated to nearly 6TB in size. So I need space. I looked around the existing lab computers for a drive to harvest and found a 500GB drive. As a bonus it also spins at 7200rpm so it should be ideal for running Ubuntu OS. Next I plan to establish a RAID0 (striped RAID) with the existing, healthy 3TB drives which will offer a 9TB capacity and speed benefits from spreading the data across three physical volumes. To protect this data I will set up daily backups to the external RAID using cron. To make this easy, I will do this via a service called webmin. More on that once it is set up.

Wednesday, April 2, 2014

Reliable Sanger sequencing with 0.2uL BigDye

Sanger sequencing might be the way of the past, but it remains an essential tool for many applications. Researchers can submit samples for processing as raw DNA (requiring PCR amplification), plasmid, or PCR product for sequencing. However, as costs for pretty much everything continue to rise, the only way you can control your Sanger costs is to become proficient at this technique and roll back the amount of bigdye that you use per reaction. List price on the Life Tech website is now about $1100 for 800uL of the stuff, and that doesn't include tax, handling, or dry ice charges. But we need our sequences, and some of us just don't have the budget to produce sequence according to the ABI protocol. Here I present a brief protocol for producing high quality sequences using just 0.2uL bigdye per reaction.

1) PCR a clean product (no extra bands)

2) ExoSAP your reactions:
Combine (adjust volumes to maintain ratios) 50uL H2O, 5uL SAP (1U/ul), 0.5uL ExoI (10U/ul). Add 2uL to each reaction per about 5uL volume, mix and spin down, and run cycler program (37C 40min, 80C 20min, 10C forever). Alternatively, you can add less exosap (1uL perhaps), mix and spin down, then let reactions sit on the bench overnight. In the morning kill the enzymes with 20min at 80C.

3) Prepare sequencing reactions:
First, I do this in 384well plates. This means I have very little headspace into which to evaporate any sample volume, and this presumably keeps my chemistry much more stable than if you were to do this in a 96well plate. That said, I have done many many 5uL reactions in 96well plates with no problems, but I very much prefer 384well plates these days.
Start with a high concentration primer working dilution (20uM is good, 15uM is easier for fewer reactions). For the following calculations, I use these solutions: BigDye v3.1, 5X BigDye sequencing buffer, 50mM MgCl2, 20uM primer.
Each reaction contains:
0.2uL BigDye
1uL Sequencing buffer (final at 1X)
0.15uL MgCl2 (final at 1.5mM extra)
0.75uL primer (final at 3uM)
2uL template
0.9uL H2O

Multiply by the number of samples you have and add 10% for pipetting error. Distribute 3uL of this mixture to each well, and follow with 2uL of template. I prefer to seal PCR plates for any thermal cycling applications with reusable silicone mats (http://www.phenixresearch.com/products/smx-pcr384-sealing-mat.asp; http://www.phenixresearch.com/products/mpcs-3510-sealing-mat-pressure-fit-lid.asp) since microseals gave me some grief many years ago (mostly edge evaporation). You just need to wash these with water. Making yourself crazy with bleach and autoclaving will shorten their life substantially, plus it's pretty much a waste of time. Run the following thermal cycle: 95C 2min; 60 cycles of 95C 10s, 50C 10s, 60C 2min; 10C forever.

4) Retrieve your plate and get ready for cleanup. For 384well plates there is not enough space for an ethanol cleanup, so I use a modification of the Rohland and Reich bead cleanup (http://enggen-nau.blogspot.com/2013/03/bead-cleanups.html). Make a higher percentage PEG solution (25% instead of 18%) with this recipe (see other post for part numbers):

2650uL H2O
50uL 10% Tween-20
100uL 1M Tris (pH 7)
2000uL 5M NaCl
5000uL 50% PEG8000
200uL carboxylated beads

Mix solution very well, and careful pipetting the PEG as it is like honey. Add 15uL to each sequencing reaction. Seal thoroughly with adhesive foil and mix by inversion. Spin solution down gently. Just fast/long enough to get the solution into the bottom of the well. If you see pelleted beads, you need to mix again and spin down more gently. This may take some experimenting with your centrifuge. I use an Eppendorf 5804R with an A-2-MTP rotor, and I let it spin up to about 1000 and hit stop to get things into the wells. Let stand for ~45 min. The precipitation is somewhat time-dependent as well as concentration dependent, so the longer you wait (to a point), the more sequence you will see close to the primer. When your timer goes off, or you think you waited long enough, apply your plate to a magnet stand (http://www.alpaqua.com/Products/MagnetPlates/384PostMagnetPlate.aspx). Tape it in place on either end to keep it from moving. Separation should take about 5 min, but waiting another 5 min doesn't hurt. Now, you can pipette the waste volume out or you can do some inverted centrifuging and save a lot of time and tips in the process. With my centrifuge, 1 min inverted spins on 3 folded paper towels (if plate is full) at 400rpm works well. Very important that acceleration and deceleration are set to 1. Any brown you see on the paper towel afterward is usually residual beads that didn't make it to the magnet. I like to "lube" each well by adding 5uL 70% ethanol before the first spin. This reduces the viscosity and eases the solution from each well. Multichannel repeat dispensing electronic pipettes are very useful here. After the first inverted spin, take the magnet/plate back to your bench. Add 25uL 70% EtOH to each well. No need to wait, take the plate right to the centrifuge and spin inverted again. Repeat the 70% wash twice more. After the third wash, allow beads to dry (~30 min at room temp, or 3 min in vacuum centrifuge at 60C, mode D-AL so rotor does not turn). Note that over-dried beads can be very hard to resuspend, and this translates into samples where the DNA doesn't want to go back into solution. Once dry, resuspend samples in 20uL sterile water. It helps to seal with foil so you can vortex the plate. Samples should look like mud during resuspension. If you see any that look clear with brown flakes, keep vortexing. Once samples have had the appearance of mud for ~2-5 min, place plate back on magnet and transfer 10uL to a 96 well plate for sequencing. A little bead carry over will make no difference. Just spin the plate down hard to pellet the beads before submission. If you need to reinject any samples, you still have 10uL of backup sequence product. Also note that you do NOT need to denature. Cycle sequencing produces only single stranded products. Put them right on the instrument.

5) Enjoy your data!!

A note on sequencer usage, our lab has both a 3130 (4 cap) and a 3730xl (96 cap). They do the same thing, and yet their stock protocols were not equal. I wondered at first if this had to do with something else in the instrument, but then I noticed I got crappy, low-signal peaks on the 3130 when I ran the same product on both instruments. I checked, and it injected samples for less time, and at a lower voltage. Further, it cut off sequences after about 600 bases. Ask your sequencing lab about the module they use. If at all possible, have them set the injection voltage to 1.5 kV and the injection time to 20sec as this fixed all my problems. I also extended the run time on the 3130 from 1200 to 1800s, and now I get 1000 bases of sequence. We have many many runs on both instruments (just replaced the array on the 3130 after 1200+ injections), so this protocol adjustment isn't going to shorten instrument life at all.

And finally, some notes on my sequencing recipe. We had some difficult samples last year and did a lot of troubleshooting. First we did a MgCl2 gradient (on ABI control plasmid) to see how this affected results. Addition of 1.5mM MgCl2 seemed to give an extra 20-40 bases of high quality data. However, it was addition of copious amounts of primer that made the real difference. This allowed us to sequence very dilute samples with high success and get nice long read lengths (900+ bases). We experimented with lower volumes of bigdye and got nice data with as little as 0.05uL/reaction, though the signal dropped off after a few hundred bases. Perhaps more cycles would improve this, but evaporation can become a challenge when you start doing 100 cycles, and then the cycler runs forever and you can't get anything else done. I have also found that faster cycling works OK with Bigdye (try cutting extension time to 1 min and raising temp to 68C), but I am not confident enough to use it as a general protocol yet.

Wednesday, January 22, 2014

My electronic lab book

You may or may not have any experience with electronic lab books. Many of the "better" ones are meant to be integrated with some sort of LIMS (Laboratory Information Management System/Software) which may or may not cost a lot of money and may or may not be useful to your specific needs (for a real joke check out LabBook Beta on Play Store). Personally, I have tried to digitize various parts of my lab life for year, but I always come back to paper and pen, and securely taping important items (product inserts, gel photos, etc) into my notebook. As a result, I now have numerous notebooks that span all the way back to 2001. Since my notes are organized by date, I can usually recall approximately when something was done that I need to reference, but it can take some time to go through everything to find what I need. I also have witnessed several other people do one task or another on the computer and find their lab notes scattered among excel files, google docs, and the traditional lab book. So I have been looking for an electronic notebook that is as similar to paper and pen as possible, and may allow for better organization. Most importantly, it has to feel natural. If I am forcing myself into the e-notebook exercise, it isn't going to work well and I will be back to paper pretty soon.

I've had a smartphone for about a year now, so I am familiar with the Android OS. I also have an ipod video that ran faithfully from 2006 until recently, and occasionally help people out who prefer Mac OS. Given the various issues getting the ipod to play nice with Windows and Linux, and my recent positive experience with Android, I was pretty sure I should go for Android. Also, it hurts the pocketbook less.

The tablet. Elegant-looking Samsung hardware.

I settled on a Samsung Galaxy Tab 3 10.1. I got a refurbished device off Newegg for about $300 with shipping, and simultaneously purchased a cover, stylus, and screen protector. The cover was another $15, the stylus $25, and the screen protector (pack of 3) was $6. I had played around with some software using my phone, and planned to use the popular free app Papyrus (looks like a paper notebook) to test drive the new tablet.

Then everything arrived, and I learned a few things... First, the tablet I purchased has only a capacitive screen. These are far better than their resistive predecessors, but do not have the stylus functionality of a Galaxy Note series tablet (a few other manufacturers as well). The note has an integrated stylus called an S-pen which is a digitizing device. When you enable S-pen functionality in your handwriting software, the screen no longer responds to your finger touch as a means of "palm rejection." Unfortunately, I had purchased an S-pen stylus that was totally incompatible with my capacitive screen. And how was I going to make this thing work anyway? I went to Staples and picked up a Wacom Bamboo Alpha stylus for $15 which seemed to have a finer point than most other capacitive styluses, was think like a pen, and had decent user feedback online.

The cover. Wakes and sleeps your device when opened or closed.

Unfortunately, I could use my chosen app (Papyrus) for writing only if I also kept a small piece of bubble wrap present to insulate my hand from the screen. As I wrote across the screen I would have to stop and adjust the position of the bubble wrap. This is not practical, and I was doubting myself already. So I went through Google Play store and downloaded free versions of other possibly useful handwriting apps with decent reviews. If they didn't have a free version to test, I just ignored them since I can't spend university money on apps that might be completely useless (hint hint, Devs). I tested Papyrus, FreeNote, INKredible, and LectureNotes. As I mentioned previously, Papyrus lacked palm rejection for a capacitive screen. Same with FreeNote and INKredible, although INKredible definitely felt really nice when writing. Hard to explain, but you need an app that lets your brain respond like it would to the physical act and immediate feedback (seeing your written strokes) of writing on paper. The ONLY app I tested that has a useful palm rejection function is LectureNotes. Luckily it writes well also. There are a lot of people online disparaging the use of a capacitive screen, or even the functionality of palm rejection in LectureNotes, but I tell you it works very well. Many people online suggested downloading a free app called TouchscreenTune to adjust the sensitivity of the screen to improve the palm rejection, but all this app does for me is open briefly before crashing, so it was no help whatsoever. I did need to go out and purchase another stylus. For $30, I picked up a Jot Pro by Adonit. This is the only capacitive stylus you will find that has a usefully small tip. It is embedded in a plastic disc that allows you to see your writing and not damage your screen. A little strange at first, but you forget it's there pretty fast. Adonit has a new stylus called the Touch which has Bluetooth functionality and an onboard accelerometer to yield pressure sensitivity and better palm rejection, but the advanced functions don't work for Android (yet), only for iOS (ipad). It is unclear if the company (or other app Devs) have any intention to port these functions to Android.

The stylus. It's magnetic and sticks to the back of the cover.

Almost all the pieces were in place, but I still didn't have a completely functional electronic lab book. I do DNA work, so I run a lot of gels that I am accustomed to taping into my notebooks. Also, I wanted the ability to export my notes to the cloud so that I could share specific notebooks with collaborators. This turned out to be pretty easy. LectureNotes ($4.50 or so for full version) has a splendid amount of available customizations. I can export each notebook as a pdf, and specify the destination folder in my directory structure. Then I use a second app called FolderSync ($2.50 or so to get rid of ads) to sync the contents of that directory with a cloud service. I chose Dropbox since I got 50GB free for purchasing the tablet, but I would probably use my Ubuntu One or Google Drive account instead if I hadn't had that resource. FolderSync can use each of these services and many more. After adding the computer I use to take gel photos to dropbox, I can now import gel photos by telling LectureNotes to import photo from... Then I choose Dropbox and browse to the new photo, resize, and move it to the position on the page I want and done!! In order to upload my notebook to the cloud, I still have to physically choose "export" in LectureNotes, but this goes pretty fast.

And now I have something that is working. Certainly a Note series tablet (or other device with active

stylus capability) would be better suited to my needs, but they are still pretty expensive. I find myself already coveting the 12" Note that Samsung recently announced for release in the next few months, both for the increased real estate as well as the active stylus functionality (S-pen), but I expect this device to cost at least $700. So to recap, you absolutely can use a capacitive screen and stylus for your lab book (detractors, please sit down!). The tablet hardware may be important to my success, so I wouldn't count on a much cheaper device functioning as well. I am using a Samsung Galaxy Tab 3 10.1 with LectureNotes (using heuristic palm rejection at 6000ms delay) and Adonit Jot Pro stylus. With FolderSync my notes are synced as pdf files to my Dropbox account for sharing with collaborators. Happy e-notebooking, scientists!!

A shot of some notes I took today in LectureNotes, complete with gel photo. My handwriting isn't much worse on the tablet than on paper. I was also able to import the pages I had managed to produce previously in Papyrus by exporting them as .jpg and importing them as images to new pages I placed before my existing page

A shot of my old lab book for comparison.

Saturday, November 2, 2013

Remote Desktop Connection from Ubuntu

I have been enjoying Ubuntu Linux now since 2008. Like many, I didn't see it as a viable replacement for Windows as I still require the suite of MS Office programs in order to functionally collaborate with colleagues. As of 2011, I acquired a netbook which came with Windows 7, but was far to puny to run this OS properly, let alone drive normal programs once it came up. Before long, I ditched Windows from this machine in favor of Ubuntu (11.04 at the time, now using 13.10). This isn't a super laptop, far from it, but it is nice to know there is an OS I can handle with my ultra portable, bombproof netbook. The thing has a 32GB SSD, 2GB SDRAM, wifi, and a 8.9" screen. I used the xrandr command to build a short script to modify the lame 1280x600 native resolution of the screen to a more comfortable 1368x768, and with Libre 4.1 and access to the offerings of Google Drive, I am more compatible than ever with my Windows/Mac-loving colleagues. And still, I can't shake Windows as I have several machines I use at work.

Until an Ubuntu edition of MS Office is available (ever?), I probably will never get away from Windows, but today I found one more reason to need Windows even less. I was bouncing from computer to computer today taking data from various places and consolidating everything in Google Drive spreadsheets. Once collated, I will then need to send my data to my on-campus Windows image where certain statistical packages reside (JMP, SAS). I was busy all day with lab work and lamenting that I was going to have to come back in tomorrow to do this work, or else stay very late. If only I could run my stats from my couch...

This is when I discovered rdesktop, an easy to use client for connecting to a Windows Remote Desktop Connection from an Ubuntu computer. From Ubuntu, it is easy to install (probably in the software center too):

sudo apt-get install rdesktop

It is a small application and installs quickly. You are almost done...

From the terminal (doesn't come up in the dash), type

rdesktop servername

(for me, rdesktop vlab.nau.edu)

You should be at your familiar login screen. For other NAU users, you will need to click the "other user" button and change your domain (eg NAU\username) for login. However, the native rdesktop window is uncomfortably small, and for some reason, window resizing is not an option. Fortunately you can use the -g option to set a specific resolution (say, -g 800X600) or a percentage of your screen size. I like the percent option and found 90% to work best in most cases (with the scaling, a portion of your window can protrude into neighboring workspaces at >95%). So, now I login as follows:

rdesktop vlab.nau.edu -g 90%

But that is too much to type, so I wrote a little one line script to fill in the details for me. In my local scripts directory I did the following:

nano vlab

This puts me into the text editor nano and starts me editing the new file called vlab.

Add the following text to the text editor:

#!/bin/bash

rdesktop vlab.nau.edu -g 90%

Hit ctrlX to exit nano, saving as you go.

Change the permissions of the file:

sudo chmod a+x vlab

Test your script locally:

./vlab

If it works, copy the script to your bin directory so it will be called no matter your working directory:

sudo cp vlab /bin/

That's it. Now I can go home and run my stats, and all I have to do to get the thing running is open a terminal and type:

vlab

Best of all, I can now access a Windows computer running remotely from my Ubuntu Linux netbook, giving me one less reason to need/desire Windows on my portable computer. Of course someday I will graduate, but that also means I could purchase a competent desktop computer in the future, keep it at work (or at home if workplace firewalls are too cumbersome), and access all my Windows needs from elsewhere, and negate any need to maintain synced cloud accounts (Dropbox, Ubuntu One etc) for my workplace documents.

Happy remote desktoping, Ubuntuers!

Wednesday, May 1, 2013

Mystery of pH change when flying

I have a little story to relate here, and I would be interested to hear back from anyone who has an idea what is happening.

It all started last summer when our lab took on a project for another lab at a different institution. The researcher shipped me their DNA in plates, plus primers and instructed me to perform multilocus genotyping on the roughly 600 samples. Upon receipt, I ran a quick PCR check of a few samples for each locus, and everything looked beautiful so I tossed the project into the freezer intending to process everything in a week or two when I knew I would have time to devote some time directly to this job. When I got back to the project, I ran the same quick PCR check just to be sure, and this time nothing really worked. Perplexed, I repeated the exercise, thinking that perhaps I forgot to add something crucial, but again the same non-result. I spent the next two weeks frantically troubleshooting this project, hesitant to contact the client since I had no idea what had happened to once perfectly good DNA that had only been opened once and had been placed in a freezer with no temperature fluctuations.

Did I contaminate the DNA with some degrading compound in the brief period I had it opened? This seemed unlikely since I do this process all the time, using the same lab practices I used during these PCR checks. Eventually I contacted the other lab and they sent me more DNA to work with. When I received that shipment, I processed all of the samples immediately in fear that they also would degrade. During this processing, I stumbled onto a bit of evidence about what may have happened. In my PCR mix, I use phenol red as a colorant. This is also a handy pH indicator which is a lovely dark red above about pH 8, but goes to an alarming yellow when the pH drops. I was doing small PCR reactions (4uL in 384well plates), so I had 3uL mastermix in a plate to which I was adding 1uL DNA. I added some DNA to a set of these reactions, and watched as they immediately changed from red to yellow. Immediately I took a few microliters of a sample and streaked it across a pH strip -- pH 5!! This prompted me to inquire to the other lab about the method used to extract the DNA and the buffer in which it was stored, etc. Samples were all extracted by the popular Qiagen kit, but this was actually done at a third lab so they weren't sure of the storage buffer. I was put in contact with the next lab, and they claimed the DNA was always eluted in Tris-Cl pH 9.0 (hey, my favorite buffer!!). I insisted this couldn't be the case and wondered if they had accidentally used nanopure water from an RO source or something that might actually have such a low pH, but they stated otherwise, and there was no use arguing anymore. I finished processing the samples and put the whole mess behind me, thinking it would always remain a nagging mystery.

In November I traveled to another lab to learn a technique for a new instrument we had received. As a part of this exercise I brought some DNA with me that I had prepared for the process, but paranoid as I can be, I decided to bring all the pieces of my chemistry along in case anything went wrong. These pieces included several plates containing PCR reactions containing phenol red. Everything traveled with me in my luggage in an insulated container with samples a bit of dry ice. I inspected the contents upon arrival, and all seemed in order, so I tossed them into the freezer. The next day, I prepared to process these samples, retrieving them from the freezer and allowing them to thaw. I was making some notes into my lab book and picked up a plate to check if it had yet thawed and was horrified to find everything had gone to yellow!

"Not again," I thought.

I frantically started looking for some Tris buffer to add to bring the pH back to where it should. Surprisingly, the lab I was in had none on hand, so I headed down the hall, bothering anyone I found in a lab for a little Tris. I located some within 10 minutes, took an aliquot into a falcon tube and headed back to my precious samples. I grabbed the first plate, and just before I tore the foil seal off, I saw the wells had gone from yellow back to red. What the hell?? Upon closer inspection, I saw this was only the case in a few of the wells that I happened to have opened briefly, and thus exposed to the atmosphere before resealing. Curious, I tore the foil seal off, put a new seal on, vortexed, and spun my plate down. Now all the wells were back to red.

So exposure to the atmosphere seemed to have solved my pH problem. So what can pH do to DNA? DNA is actually a pretty stable molecule (read up on the RNA world hypothesis for why it is so stable). It is an acid and as such, is most stable in a slightly basic buffer solution (that's why we love Tris so much). However, raise the pH too much, and the bases no longer pair (e.g. alkaline denaturation as in Illumina preps), or decrease the pH too much and other bad things start to happen. Low pH, I have read, leads to depurination (loss of A or G bases) of your DNA strands, effectively fragmenting DNA into unusable bits (no longer than about 30 nucleotides). How low does it need to be? In theory, anything acidic will contribute to this effect, but the more acidic you get, the more rapidly this will occur. If my chemistry background serves correctly, things will really start to change as you approach the pKa of DNA, which is somewhere around pH 5.0 -- right about where I had measured the pH of the DNA from the project last summer. Storing DNA in water, rather than a buffered solution is known to be less ideal than the buffer, and this may be related to the natural dissolution of carbon dioxide as carbonic acid from the atmosphere into standing water, but the pH of such water is generally measured around pH 6.7 or so, not terribly acidic at all.

So what could be happening here? When you place samples into a plate sealed with foil, there is slow evaporation/sublimation of your storage buffer over time, presumably due to slow air exchange through the adhesive layer holding you foil in place. During an average airplane flight, despite the pressurization of the cabin, everything on the plane is at a markedly lower pressure than when the plane is on the ground. The gaseous contents of the cabin aren't terribly different than what you find at sea level, otherwise there wouldn't be enough oxygen to remain conscious at 35,000 ft. So, low partial pressure of gaseous components, and subtle permeability of your sealed plate. This should actually release dissolved gases back into the atmosphere, and the loss of carbonic acid should actually raise the pH. But that's not what I saw, and everything can be fixed by thawing, removing the seal briefly, and applying a new seal upon arrival at your destination. So problem solved, but what was the problem in the first place?

Anyone??