So I've recently had a little setback wherein the samples I've gathered and been attempting to process for the past several months (nine, start to finish if I'm counting correctly) aren't fit for purpose. I can get perfectly good RNA out of them, but not RNA that is sufficiently good for sequencing.
Unfortunate really given that that's the entire basis of the data for my PhD. And that I need a good time course of host/pathogen sequence data around which to build my models. No worries! apparently. Just go and get some datasets from other studies that have been published and use them to build your models! This has been suggested to me several times over the past year or two. Each time I've dutifully gone off and looked for some of these magical published datasets. I've even gone searching a few times when no one has suggested it.
The only difference I found last time I went searching was that I actually found a relevant study with published data. I use the singular deliberately there. One. This one, if anyone is particularly interested.
There are institutions like NCBI that are collecting a large amount of data. Sequence data, protein structure data, metabolomic data. There are 13 databses focused on RNA listed by Wikipedia, but most of them are specialised on specific subsets of RNA - not the transcriptome in it's entirety. One thing I think we forget though is that even though we're collecting large amounts of data, the amount of data generated by biologists in last decade or two is significantly larger. Which means that even if there are people working on similar topics, there's a decent chance that the data is either a) difficult to find or b) hasn't been published in the first place*.
The other possibility that springs to mind is that there just haven't been many experiments of the sort that I am doing that have been done. One of the things that I have slowly gotten used to over the past year or two is the realisation that a lot of the work that I assume is fairly basic for understanding biological systems just hasn't been done. There's so much of it, it's only been possible to study whole systems as a whole fairly recently and there's a distinct lack of both money and people to do the work. So when I go out looking for host/pathogen sequence data in plants, even though the conventional wisdom that seems to have seeped into the scientific mindset is that there is plenty of data out there, I find one useful example.
So much to do, so little time (and money).
*Approaches like that proposed by the DNA Digest group will I think be quite useful in opening up access to the masses of data that's been backed up and then then not used on thousands of servers around the world. Fingers crossed.