The Twelve Bar Blues of Open Science
This post originally appeared on the Software Carpentry website.
Most musicians can play along with a twelve-bar blues once they know what the key and tempo are. Many kinds of scientific work are equally well structured: the results aren't predictable—it wouldn't be research if they were—but the equipment setup, sample preparation, note-taking, statistics, and write-up are (mostly) structured in ways that other scientists are familiar with. This lets them pick up each other's projects more quickly; it also gives scientists more time to do what's unique about a particular project because they don't have to spend time thinking about the things that aren't.
Open science—by which I mean all the new ways of doing science that the web has inspired—isn't there yet. Many people have tools and techniques that work well for them, but every setup is one-of-a-kind, and everyone has had to assemble the pieces for themselves. I think we could do better, and the thing I think about when I think that is Ruby on Rails.
A lot of Ruby on Rails' original simplicity is now hidden under a mass of extensions and auxiliary tools, but when it first appeared in 2004, it was a minor revolution. One reason was the "create a blog in 15 minutes" screencast that showed people just how easy simple things could be. Another was its emphasis on convention over configuration: instead of letting web developers do things however they wanted, it said, "Here's where stuff is and here's how it works." Once a competent developer knew what the application was supposed to do, she could (almost) immediately start building the things that were specific to it, and other competent developers could join in (more) easily. What's more, Rails' predictable structure and workflow made it easier for newcomers to adopt—certainly easier than contemporary competitors, which all-too-often required novices to make key decisions before they had the knowledge or experience to do so well.
So what's the equivalent for science? What would be in a combined template/tool/framework/worldview for small- to medium-sized scientific projects that would let scientists do science our way with near-zero startup overhead? (Note that I'm not asking, "What web programming framework should they use?") A few things I want are:
- Data organized in a particular way. I'd use William Stafford Noble's "A Quick Guide to Organizing Computational Biology Projects" as a starting point, though it's obviously not right for all cases. I also realize that projects using Data of Unusual Size will probably store metadata and slices locally, rather than entire data sets, but that's no worse than the diversity of data, images, and auxiliary files stored by some small web apps.
- Scripts too. The little bits of code used to do "last lap" statistics, create graphs, and so on should be in a predictable place, and be invokeable in uniform ways.
- Everything automatically accessible on the web, for the long term.
As soon as I create a project,
other scientists should be able to query my data—not just download it,
but query it.
OData
and
GData
are a step toward a solution for data;
the scientific equivalent of rails new projectshould automatically generate the wrappers needed to pull a project's data and metadata. Everything else should automatically be accessible too: every paper, figure, or table should have a DOI and be searchable and fetchable. And note that "accessible" doesn't just mean "the bits are available": without correct documentation of formats and semantics—which most scientists never quite get around to writing by hand—data rusts almost as quickly as code.
- Standard commands to tie it all together.
I mentioned rails new projecta moment ago; Rails, Django, and other frameworks of the same ilk create a command application for each project so that people can add new features in a reliable, findable way with a few keystrokes. Science projects should have something similar for adding new data sets, producing new results, etc.
- Pluggable. Rails didn't just put the application code and data in particular places; it also provided slots for adding new management commands and analysis tools.
- Forkable and mergeable. Whatever Rails-like system scientists use must play nicely with version control, because as I wrote a couple of weeks ago, future scientists won't submit, publish, and download papers: they will fork and merge projects.
- Federated. The web lets us access information stored anywhere on the planet; ironically, our response so far has been to centralize like never before. Wikipedia, Facebook, and GitHub are effectively all single points of failure: one court order, change of policy, or (as we recently discovered with Mendeley) ideologically incompatible acquisition can have wide-reaching impact. We know how to distribute information so that it can survive the disappearance of its initial host—BitTorrent is just one example. I hope that whatever else "Science on Rails" is, it will be robustly federated from the ground up.
What else would you add to this list? What would you want to see in a framework for 4×8 projects (i.e., ones where four people work together for eight months to produce a result)? What pieces of the solution are already out there, waiting to be integrated?