Ellie Harmon

colorful circles

Finally some sucess at bioinformatics!

I spent far too much time this weekend playing with the bacteria taxonomy data again.

This time, I’ve got a couple of lovely things to show for it, though!

I adapted mbostok’s clustered circles example to work with our taxonomy data a couple weekends, ago, before my last meeting with Christina. One of the problems I had in displaying them, though, is that the highest taxa lavel, phylum, still has like 22 different memebers. This is too many different colors to make sense of visually.

So, Christina told me that usually when scientists report on bacteria, they only graph colors for the top 4 or so phyla, and then use an ‘other’ category for everything else. This would make the color-choosing much more reasonable!

Yesterday, I was quite pleased to get these displaying and with new colors worked out. Sample 1 is up at the top of this post. For comparison, here’s sample 2:

This is the color key for both:

Christina and I had also talked about keying the hues off of this phyla / other grouping, and then shading each node within a given cluster based on class. So, I spent some time this morning messing with the colors again, getting the shade variation working.

Here is an earlier attempt, showing Sample 1, and then the color scheming I settled on, also showing Sample 1.

This involved a lot of tweaking of colors to get ones that looked pleasing & manual re-arranging of the color assignments so that they all seem well-differentiated and also so that clusters with lots of internal variation have colors that work better for that (e.g. you can see how the green can take lots of shading before it becomes invisible; yellow & orange, not so much; that turquoise color I used for a while was awful looking.).

ALSO, while I never quite got it sorted out exactly what state our raw sequence data is in, or if it even makes sense to do my own OTUs (they have already gone through a UPARSE OTU categorization by the sequencing company it turns out), I DID figure out how to recreate the style of chart that the Beuatiful Bacteria people were making with d3 instead of R.

Mostly this involved starting, again, with one of Mike Bostok’s excellent examples, and then spending quite a bit of time translating my data from a wonky CSV file into something ready to be network-graphed with some python scripts. Then, tweaking the javascript just a little bit.

I’m so much more comfortable messing around with python & javascript than R. Check these out:

Each colored circle represents a particular order of bacteria. Colors are coded at the Phylum level, same key as above. White labeled circles represent different samples. Lines connect order of bacteria to the samples in which they were found.

The first one shows the comparison of similarity between the three samples I took at the first collection site.

The second one is for the samples at the third collection site. I think it’s weird that sample 7 only contains things that exist in Sample 8 and Sample 9, nothing unique. And, also, that Samples 8 & 9 share a bunch of things that aren’t in Sample 7 at all. Weird!

I love how the circles representing the orders of bacteria that aren’t present in this particular sample cluster around the edges.

Here are some denser network showing all of the samples that were taken (1) directly from the footpath of the trail, (2) from water, (3) from the soil beneath a plant.

I’m really excited about this!!

I still think I’m not quite doing the color thing in the best way, in particular by not really taking advantage of d3’s scale system, but I can’t quite wrap my head around it.Instead I’ve got this beast:

function assignColor(dataObject) {
	if (typeof classes[dataObject.cluster] == "undefined") {
		classes[dataObject.cluster] = {"count": 0};

	if ((typeof classes[dataObject.cluster][dataObject.Class] == "undefined")) {
		var cluster_color = d3.lab(clusters_darker[dataObject.cluster]);
		classes[dataObject.cluster][dataObject.Class] = cluster_color.brighter(classes[dataObject.cluster]["count"]*.9);
		classes[dataObject.cluster]["count"] = classes[dataObject.cluster]["count"]+1;
		return classes[dataObject.cluster][dataObject.Class];
	else {
		return classes[dataObject.cluster][dataObject.Class];

I’m also still really curious about all the huge no-hit categories – what are the mystery sequences that were in them? And, what do they mean?

I’m not sure how to think about this in a helpful way with graphs and charts yet, which is all I want to do at the moment, apparently…

Christina & I are working on getting a website up an running, so maybe soon I can actually share the interactive versions of these, which are pretty fun.