At the December 2006 American Society for Cell Biology meeting, there was a Special Interest Subgroup entitled "Managing the Data Explosion in Systems Biology".
Highlights from "Managing the Data Explosion in Systems Biology"
6 March 2007
NANCY R GOUGH
At the December 2006 American Society for Cell Biology meeting, there was a Special Interest Subgroup entitled "Managing the Data Explosion in Systems Biology". One focus of this Special Interest Subgroup was issues related to data integration. It appears that, instead of pushing toward a commonly adopted standard, some labs are building tools to allow data in different formats to be mapped and imported for performing analysis and interpretation of high throughput data results.
H. Steven Wiley (Pacific Northwest National Laboratory, WA) introduced the session. Wiley talked about the fact that, although there is a lot of data in cell biology, we are still actually data-poor in terms of the complex data required to really understand cell physiology and the response to signals and changes in the environment. There is a lot of "simple" data–gene sequences, protein interaction information, and structure information–but little "complex" data–information about the dynamics of the system. The research approach thus far has been to figure out the parts and then try to figure out the dynamics. The computational approaches used to analyze the "parts data" to produce dynamic systems biology interpretations and generate hypotheses include both highly specific models and more abstract models. Wiley ranked the various approaches, from most specific to least, as models based on differential equations (these require the most complex and complete data sets), Markov chain models, Boolean models, Bayesian models, and finally statistical models.
There are fundamental issues related to integrating the large, albeit simple, data sets that are generated and processing this information into meaningful knowledge. Some of the specific problems that Wiley mentioned included redundancy in the sequence databases, redundancy in the probe sets used for microarray analysis, uncertainty in the identification of proteins (from mass spectrometry experiments, for example, where the peptides generated could be derived from more than one protein), uncertainty in the protein quantitation data, missing data, and the lack of tools for cross-referencing gene and protein identifiers across databases. Unfortunately, the gene identification standards are not the same as the protein identification standards, so it can be very time-consuming to match a peptide to its protein and then the protein to the gene. Wiley indicated that researchers at the Pacific Northwest National Laboratory can generate 480 Gigabytes of proteomic data per day, but lack sufficient automated mechanisms for identification and interpretation of the results.
Following this introduction, several speakers described efforts related to mapping high throughput data sets to information in existing databases. Some of the tools mentioned relevant to the modeling or computational analysis of cell signaling included
- Cytoscape network visualization software
- Onto-Tools statistical gene expression analysis software
- Gaggle data management application
Feel free to add to this list of tools and comment on any experiences you may have had in working with these or other tools for computational approaches to cell signaling.