I have been working all afternoon to try to get something to make sense from the sequence data Christina and I have.
I have met with no success.
Well, maybe minor success.
Success at tracking things down.
Kind of like when you’re hiking, and you finally realize YOU’RE OFF THE TRAIL. So that’s kind of critical and a key piece towards getting on the trail. But I’m at that stage where I’ve just sorta recognized that I’m on a side trail and I have some idea which direction to walk to get to the main trail, but I’m still a long ways off. Might have to camp in between.
So, basically, I finally realized that this github project that I forked and have been trying to use since I was in Indiana – all because I wanted these little pretty network graphs, and I thought this idea of OTUs instead of species was ‘cool’ [seriously, discovered this with a google image search for like ‘beautiful microbes’ or something equally absurd] – is expecting not the raw sequence data from the company, which is what I have, but something much more processed.
In the meanwhile, while being confused by my github files, I have contemplated this system called Mothur, and another thing called USEARCH, and another thing called QIIME.
Things that exist on my computer now:
dhcp-v214-067:tools ellie$ ls FunFrame_v0.9.3 mothur python_scripts HMMFRAME_3_16 otu-frontiers helper scripts prinseq-lite-0.20.4
Mothur is the hilarious-est because it’s logo is seriously a mother nursing a baby. WTF?!
But, I didn’t make it far with that because, you know, I’m kinda lacking biology knowledge to make sense of things.
Part of USEARCH is UPARSE, which I learned is meant to be used as a pipeline.
Encountering this pipeline page was where I finally realized there was some serious step-wise processing that had to happen; no just feed your files to the algorithm and get something out.
Manual feeding of files into script 1.
Take output and manually feed into script 2.
Repeat for all four million steps of ‘pipeline’
So, I sort of stumbled through a UPARSE process, and did in fact produce several files through their pipeline. But, wandered over (and around) lots of stumbling blocks and didn’t feel like I was really doing it right.
So I kept bouncing back to this github project, which seemed simpler, maybe just because there are just a small handfull of files and there is no documentation to confuse me (nor documentation to tell me what the hell to do) … And then FINALLY discovered, that with that code I ALSO needed to be doing some pre-processing of my data.
I found their publication (from just last month!) about the marsh bacteria which I’d originally seen on one of their blogs. And in this paper, I read that they run their files through this thing called FunFrame – which is another of their own systems – before they use the stuff that’s up on github to make the graphs.
So, the picture above is of the flow chart that’s included with the download of FunFrame that’s linked from this previous paper. It was SO NICE to find this file and see what I’m supposed to include and how it all fits together! Even if it is crazy huge!
And, I’m thinking, yes, I have data files that are one per sample, OH GOOD! I CAN DO THIS! WITHOUT KNOWING SCIENCE!
So I briefly got all excited about that, but was quickly stymied by getting their FunFrame thing working because of all its dependencies; mainly NumPy which is all 32-bit and doesn’t play nice with my 64-bit Python.
So, then I found a nice little scientist-specific python that includes all the things and in a simple one-click install.
I got registered for my free copy as an academic, and then, while it was downloading, went back to reading the FunFrame directions.
Damn if it doesn’t say that actually those .fasta files that are the top of the flow chart as inputs are, in fact, supposed to be pre-processed FIRST by something like – yup, you guessed it – Mothur or QIIME. HAHAHA.
So, I went back to the USEARCH thing and stumbled around some more, and made it through to the last step that produces a file that maybe contains some OTUs. But, what to do with it?
So, I tried to do the last step which says you make a table with it, and this thing that I had skipped at the beginning because I couldn’t figure out what information it wanted, well, now its hurting me. Because the last step just says:
**ERROR** barcodelabel= not found in read label 'M02233:23:000000000-A75VJ:1:1101:21328:1808 1:N:0:1'
I mean, I know there’s no barcode, because I don’t think there was one to begin with, and so I didn’t run the removal script. And not only for that reason, because also I didn’t know what information it wanted, and also because the page that said to run it, well I didn’t encounter that page till after I’d already moved along to a step past it – and why go backwards? I never hike backwards, I don’t care how lost I am, I prefer to soldier on in the general direction of the trail – so … yes.
Now I am going to bed.
And I have no cool graphs.
But, I know I need to know some things, like
What is the barcode sequence in our data? Is there one?
What is the read layout of our data? http://drive5.com/usearch/manual/read_layout.html I think it’s the Illumina one. Why do we have R1 and R2 for each sample? Are they paired?
What’s the primer sequence that was used? (Needed for, e.g. http://drive5.com/python/fastq_strip_barcode_relabel_py.html )
What can I learn from this stats file to actually choose proper parameters for quality trimming/filtering the fastq data and transforming it into trimmed fasta data, instead of just making some up that are “reasonable” as suggested by the guy who writes USEARCH (where ‘reasonable’ in his suggestion ranges from 0 to 1, when I think we’re talking percentages…… ??)
S1_R1_stats.log Started Sat Mar 15 18:59:14 2014 Read length distribution L N Pct AccPct ------- ---------- ------- ------- >= 301 54659 45.3% 45.3% 300 61874 51.3% 96.6% 299 1651 1.4% 98.0% 298 2433 2.0% 100.0% 297 16 0.0% 100.0% 296 4 0.0% 100.0% ... ... [and more tables] ...