Larry Page would like to get back to his fishes, the catfish and loaches he studies as a University of Florida ichthyologist, but first he and his colleagues have some work to do on plant, animal and fossil specimens — millions of them.
Page and an army of helpers are on year three of iDigBio, a 10-year, $12 million effort to digitize the biological specimens tucked away in museum collections across the country. He estimates there could be 1 billion nationally, representing the collected knowledge of biological diversity for vast swaths of the planet.
Although it has taken him away from his fishes, the effort already has contributed to his work in an unexpected way.
“A student and I just found a database where specimens of a loach have been collected in Pakistan,” Page says. “We thought these loaches only went as far west as India, but there they were, in Pakistan.”
Such moments of discovery can be rare. Scientists can labor long stretches between them, sometimes whole careers without them. But with iDigBio, Page says, “these aha! moments could almost be routine; that’s what this is all about, and it is happening all the time.”
Centuries of knowledge are stored in museum collections around the country, from a fern collected, pressed and labeled hundreds of years ago, to an insect collected just last year. For taxonomists and other scientists, finding these specimens can be daunting.
“There are specimens that have been around for 100, 200 years, but they’ve been in a drawer somewhere, and it’s hard to know where everything is,” says Page, the director of iDigBio. “If it’s online, you can touch a button and find in seconds what it might have taken you a lifetime career to know was there.”
By the most recent estimate, humans share the Earth with 8.7 million other life forms. Species are being lost and discovered all the time, but even with all the collected knowledge, much, much more remains to be learned. Scientists estimate that 86 percent of land plants and animals and 91 percent of those in the sea have yet to be identified.
Lately, says researcher Pamela Soltis, those discoveries are in specimen drawers.
“Most species discoveries are actually made in museums, not in the wild, anymore,” says Soltis, who studies molecular systematics and evolutionary genetics as distinguished curator at the Florida Museum of Natural History.
The specimen drawers in most museums are a treasure trove. In most cases, only a tiny fraction of what a museum owns is on display. In storage, sometimes a specimen is overlooked, sometimes not correctly named or well-described. The beauty of keeping it all is that a second look, years later, by a student or a scientist, can uncover an entirely new species. Armed with new methods, like genetic sequencing, new discoveries can be made, or a gap in an evolutionary path filled. Digitizing that information could speed up the pace of such discoveries. The tree of life, Soltis says, could be way more complex than we think it is.
“There’s more information about biodiversity in museum collections than any other place — except nature itself,” adds Page, “but the problem is, it’s really difficult to get.”
Scientists who want to examine specimens outside their institution must travel far and wide, generally to several institutions. Alternatively, they can ask for a loan, which means someone at the host institution must pull the specimens off shelves, wrap them up, box them and ship them. There are other issues, too, with a loan. The institution receiving a loan must have the facilities to properly store it, and of course, things can go wrong in shipping.
Loans, Page says, are never quite satisfying because they leave you wondering what else might be there. And the most valuable specimens, the primary types — the specimens used to validate the name of a species — are so precious that often a museum will not loan them at all.
Putting this treasure online opens it to research, education and just plain curiosity. Both the specimen and the traditional label information are digitized, sometimes along with other information, such as the audio of Cornell University’s bird songs. Using the label data alone, a scientist can produce maps showing, for example, the range of an organism or change in its distribution over time.
The National Science Foundation is funding digitization of museum and university collections to the tune of $100 million over 10 years as part of its Advancing Digitization of Biodiversity Collections program, with iDigBio coordinating the national effort. So far, 156 collections, representing all 50 states, are participating. Museums and universities are grouped into thematic collections networks, which work together to digitize information on a particular research topic, for example insects that feed on plants. Museum collections are being added all the time, most recently the Field Museum in Chicago for its insects and Appalachian State University in North Carolina for its herbarium specimens.
The digitization effort is massive, and that’s where citizen scientists can help, Page says. Hobbyists, for example, could input the label data. Two different people would enter the data and a computer would cross reference it, so if there’s any discrepancy a specialist would know to take a look at that file.
Already 14 million specimen records and 2 million images are online, accessible from a search portal developed by UF’s Advanced Computing and Information Systems Laboratory (ACIS). The data so far fit into banks of computers the size of two refrigerators in the ACIS lab.
As a big data project, iDigBio certainly qualifies. But José A.B. Fortes, the computing director for iDigBio, says there are issues beyond size. The variety of the data and the differences in how data are recorded and input from institution to institution make it a thorny computing problem. Something as simple as a date can be entered different ways: day first, month first, or year first, using numbers, using names. The storage essentially is a resource problem, one that money solves. The others can be tricky.
“There are challenges in the heterogeneity of the data, heterogeneity of practices and the degree to which different folks feel comfortable with different technologies,” says Fortes, an AT&T Eminent Scholar in the Department of Electrical and Computer Engineering and director of the ACIS Lab.
Fortes estimates 15 to 20 times more space will be needed long-term, along with a commitment to upgrading the equipment. Redundancy, too, is built in to safeguard the data from catastrophes that could wipe out the database. Perhaps the biggest commitment — and biggest opportunity — lies in hosting the data, Fortes says.
“The greatest asset in the information age is data, and that’s something that should be the responsibility of an institution of knowledge, like a university. Hosting this data could be a differentiator from one university to the other,” Fortes says.
The point of iDigBio is to have the data forever as a resource to do bigger and better things, to enable scientists to answer questions that cannot now be answered.
“If you are in a position to do that you clearly have a competitive advantage,” Fortes says.
“Just like having an excellent library is a differentiator, so is data,” Fortes says. “If we do it right everyone will come to us. They may not come by walking, they may come through the Internet, but that will enable other things to happen and that will pay for the long-term commitment to host the data.”
As rewarding as it is to expand access beyond the traditional realm of museum collections, Soltis also is excited about the novel ways the data can be studied. By using label data — where and when a specimen was collected — locations can be plotted as GPS coordinates to construct models of species distribution based on variables such as temperature or rainfall. That is a much more powerful way of understanding the range of a species, Soltis says.
Recently Soltis and colleagues used the digital label data that exist for about half of Florida’s 4,000 species of plants to construct a model of the effects of climate change on plant communities in Florida.
Based on herbarium records, Soltis and her colleagues plotted each species on a map, then combined the maps to create a picture of plant species diversity in Florida today.
Then, using the characteristics that determine the niche of each species — for example, temperature, rainfall, soil type — the group looked at predictions for those attributes in future climate change scenarios. Her model for 2050 shows an alteration in species distribution as some areas dry out and others become wetter, with an overall loss of plant diversity in Florida, creating challenges for conservation (see map).
“This sort of analysis wouldn’t be possible at all if we didn’t have the digitized data from the museum specimens,” Soltis says. “We can use the digitized information to find links between evolutionary history, response to climate change and extinction risk.”
Soltis points out that her species diversity model was incredibly complicated even though it was only in Florida and only half of the plant species. Translating it to a larger scale, regionally or nationally, with many more data points makes it a huge data problem because it involves heterogeneous data: distributions, evolutionary history, possibly genetics. There is talk already of linking iDigBio with GenBank, the national genetic sequence database, creating yet another avenue of inquiry.
“The possibilities of linking all these together present a whole new range of opportunities,” Soltis said. “There are almost no precedents for being able to do that. Big data takes on a unique shape when we think about biodiversity.”
Big data projects like iDigBio are changing the past emphasis in science on hypothesis testing, Soltis says. Sometimes a scientist might not know what to hypothesize, but patterns can emerge from the data and point to a hypothesis to test.
“Without access to the data, you might not even know to ask the question, so I think this whole concept of data-driven science is a really important paradigm shift for us,” Soltis says, “and I think our biodiversity data are perfect for exploring the bounds of that.”
Page says iDigBio is not only a big data issue, it’s a dark data issue, too, because so much of the data has been hidden from view.
“This is a clear example of bringing the data out of the darkness and into the light,” Page says.
The process of shedding light has won over the few institutions that were lukewarm about joining the digitization effort, Page says. Getting funding for collections work sometimes is difficult. It’s a challenge to convince administrators that a “bunch of dead fish in jars has any value,” but as more data go online, museums will have an easier time demonstrating the value of their collections.
He says he laughs when people ask what scientists are going to do with all these data. iDigBio, he says, is much more than an exercise in knowledge for the sake of knowledge. The data will be central to exploring climate change, populations, evolution, extinction, conservation, the number of species that share the planet with us, the very history of life on Earth.
These are huge research questions, so huge that they have been intractable, Page says, until now.
“I really prefer to just work with fish, but I’m more and more excited about this,” Page says. “The aha! moments, I think, will be happening all the time.”
back to top
By: Cindy Spence
Curator of Fishes, email@example.com
José A.B. Fortes
Professor of Electrical and Computer Engineering and Computer Science, firstname.lastname@example.org
Curator of Molecular Systematics & Evolutionary Genetics, email@example.com