ISB Informatics: Systems Biology in action by Nat Goodman

In this review, I describe a sample of the software used at ISB, focusing on elements that have moved beyond proof-of-principle and may be relevant to other groups engaged in systems biology.

ISB Informatics: Systems Biology in Action
By Nat Goodman

The Institute for Systems Biology (ISB), where I work, is an independent non-profit research institution located in Seattle . It is a multidisciplinary place and includes biologists, physicians, computer folks, engineers, physicists, mathematicians, education specialists, and others. Most projects involve a combination of large-scale data production, small-scale bench biology, and computational analysis. Software is developed or acquired to meet the needs of specific projects or research groups. Software is routinely shared among groups, but there’s no official ISB informatics system that everyone uses.

In this review, I describe a sample of the software used at ISB, focusing on elements that have moved beyond proof-of-principle and may be relevant to other groups engaged in systems biology.

The array facility provides processing pipelines for standard Affymetrix expression arrays and certain types of two-color spotted arrays. The Affymetrix pipeline uses Affymetrix software (GCOS, formerly MAS 5) for image acquisition and basic data analysis, and Bioconductor and Tibshirani’s SAM for more advanced analysis. The spotted array pipeline uses Buhler’s Dapple for image analysis and Ideker’s VERA and SAM for data analysis. Most users do additional analysis beyond what the pipeline gives them. Bioconductor and TIGR Multiexperiment Viewer (MeV) are widely used for this purpose, Nieselt’s Mayday is used by some people, and at least one group uses GeneSpring.

The proteomics group provides the Trans-Proteomic Pipeline for processing mass spec proteomics data generated using a variety of instruments. The pipeline includes tools for processing of raw mass spectra, database search to identify peptides and proteins, assessment of confidence in these identifications, identification of proteins in cross-linked complexes, and protein quantitation. The pipeline is based on the mzXML standard and can accommodate additional tools that conform to the standard.
Data from these pipelines and others end up in the SBEAMS database. SBEAMS stores many kinds of data but does not attempt to integrate this data in any deep sense. To integrate or analyze data, users export it into files or databases constructed specifically for this purpose. Caveat: Unlike most ISB software, which runs on Linux and MySQL or Postgres, SBEAMS requires Windows and Microsoft SQL Server.

Cytoscape is a program for visualizing and analyzing biological networks, such as protein-protein, protein-gene, and gene-gene interactions. Cytoscape provides add-on tools, called plug-ins, for doing integrated analysis of interaction data with other kinds of data. Examples include analysis of expression data to identify sub-networks with highly correlated expression, and annotation data, such as GO, to associate sub-networks with biological functions. Originally developed by Trey Ideker when he was at ISB, Cytoscape is now produced by a consortium he leads that includes ISB, the University of California San Diego , Sloan-Kettering, Institut Pasteur, and Agilent.

Similar capabilities are provided by Ingenuity Pathways Analysis (IPA), a commercial product that the company provides to ISB through a collaboration. A key strength of IPA is that it operates on Ingenuity’s database of interactions, which is more comprehensive and probably more accurate than available public databases. A drawback is that ISB is not allowed to integrate IPA or the Ingenuity database with other software or databases, making it harder to incorporate IPA into the main data-analysis flow. Discussions are underway with GeneGo for access to their MetaCore product.

Gaggle is a framework for integrating interactive software tools to support data exploration. A controller manages communication among the interactive tools (collectively, geese), which controller manages communication among the interactive tools (collectively,    geese), which run as separate programs on the user’s desktop computer. Components communicate with each other, generally in response to user requests, by passing simple messages via Java’s Remote Method Invocation. Geese are constructed by modifying existing programs to implement the Gaggle communication protocol, a process which is generally straightforward for well-written Java programs. Existing geese include Cytoscape, TIGR MeV, Data Matrix Viewer (for viewing and graphing tabular data), R command console (for statistical and mathematic programming), Web interfaces to KEGG and STRING, and a Firefox extension that adds Gaggle communication to any Web page. Gaggle was developed by the Baliga laboratory at ISB and continues as a collaboration with Bonneau at New York University .

GDxBase is a framework for disease-oriented Web sites developed by my group at ISB in collaboration with Smink in Todd’s laboratory at the University of Cambridge . The software integrates disease-specific and general biological data, presenting the information in a form suitable for disease researchers who are not experts in the underlying data types. Tools are provided for viewing data for lists of genes across the integrated datasets. We use GDxBase for a large type 1 diabetes Web site, T1DBase, funded by the Juvenile Diabetes Research Foundation, and a smaller Huntington’s Disease Web site, HDBase, funded by the Hereditary Disease Foundation and High Q Foundation. Other groups are using GDxBase for type 2 diabetes, prion disease, bloodomics, and diseases of energy metabolism.

Systems biology needs a ton of software to digest the data upon which it relies. This review gives a taste of the software used at one leading systems biology institution. The most important message, I think, is that diversity rules. While computer folks always want a coherent architecture to tie the software together, a less organized approach is probably better for this dynamic field at this point in time.