Open source software allows for integration and analysis of massive data sets
Meeting Systems Biology Data Demands
Nitin Baliga, Ph.D.
Interoperability is a challenge to the efficiency and effectiveness of information te chnology solutions across industries. How do you get information from database software built by company A and move it to an analysis software developed by
organization B for final dissemination via a presentation software developed by company C? If the different solutions even allow for data integration, the likelihood that the process will be smooth, efficient and effective is exceedingly low.
Addressing this challenge in systems biology based research is absolutely critical. Systems biologists work with very large sets of data (terabytes in some cases) from many different sources, using a variety of software tools. These tools tend to be very specialized in nature, with each providing — in minute detail — different and critical pieces of the microbiological puzzle being assembled. As a result, biologists find themselves manually cutting and pasting to transfer data between programs or databases, creating temporary files, running Web searches, and taking notes. This time-consuming process is an expensive and unnecessary use of research dollars.
The missing piece is a solution that allows for the analysis of data from a holistic perspective such that a researcher can identify which distinct pieces fit to gether in a way that reveals the entire
picture. This is, in fact, the very nature of systems biology; understanding how an e ntire system, rather than a single component of a system, responds to individual or multiple perturbations. The true cha llenge is to achieve this level of data integration with millions and millions of data points generated from a living biological system that is constantly changing.
An elegantly simple solution
As is the case with most advances, necessity, especially in the face of resource cons traints, is the mother of invention. After considering alternative software resources in the marketplace, our research team at t he Institute for Systems Biology (ISB) decided to develop an elegantly simple and flexible solution that provided the most e fficient and cost-effective means for meeting the complex data compilation and analysis needs associated with systems biology research. We developed the "Gaggle."
Gaggle is open source software that allows for integration and analysis of massive da ta sets from multiple sources, often through simple point and click manipulation. It is a simple, Java software environment that utilizes the classic software engineering strategy of separation of concerns and a policy of semantic flexibility t o solve the problem of software and database integration. The software uses four simple data types (names, matrices, netw orks and associative arrays) to bring together diverse databases and software. Gaggle and the various display and analysis programs, run distinctly and simultaneously on a desktop. The ‘Gaggle Boss’ is a simple server program that communicates between and among the various analysis and display programs in use.
Gaggle allowed the ISB research team to develop and validate visual representations, or maps, of gene regulatory networks in extremophiles for the first time. These organisms live in extremely harsh environments that are lethal to most organisms — hypersaline environments, such as the Great Salt Lake, or superheated thermal vents at the bottom of the ocean, or radioactive environments that would unravel and destroy human DNA in minutes.
Gaggle also allowed the team to efficiently harness, from disparate sources, the information necessary to develop predictive models for how gene regulatory networks would respond to environmental perturbations. Understanding how genetic networks within these organisms are regulated such that they can adapt and survive in environments not normally supportive of life, could eventually lead to breakthroughs in healthcare, bio-energy, agriculture, computing and more.
Halobacterium NRC-1 has 13 general transcription factors (GTFs), which are proteins that interact with control regions (promoters) of more than 2400 genes to direct organismal adaptation to environmental change. The GTFs form two related families (one of seven GTFs, the other of six) that function through mutual interactions, resulting in possibly 42 discrete pairs. The team’s hypothesis w as that each pair may have potentially varying binding affinities for the promoters of some or all of the approximately 2400 genes.
To test this hypothesis so as to understand how Halobacterium regulates its genetic networks in response to varying environmental factors, we needed to look at several interacting elements of its genet ic network, including mRNA, protein-protein and protein-DNA interactions, protein evolutionary history and protei n metabolic function. Existing databases such as KEGG and String housed some of this information. The research team needed to generate additional experimental data by conducting microarrays to measure the amount of expression of al l of the 2400 genes in more than 300 environments, mass spectrometry analysis to map interactions among the two families of GTFs, statistical analysis of GTF-promoter binding site associations and GTF deletion analysis to investigate functional consequences. Examples of programs used by biologists to view and analyze such data include:
• DataMatrixViewer (DMV) — for navigating and selecting data from experiments as well as for displaying and plotting numerical data • Cytoscape — for viewing protein-protein interactions, protein-DNA interactions, ass ociation networks, etcetera • TIGR’s Microarray Expression Viewer (Mev) — a popular tool for statistical analysis of gene expression
• "R" — a statistical programming language with various associated packages including BioConductor • Bioinformatics Web browsers — such as the Kyoto Encyclopedia of Genes and Genomes (KEGG), EMBL’s STRING and BioCyc All of the above programs have been adapted to integrate with Gaggle and were used in mapping the genetic regulatory network of Halobacterium, as well as in exploring predictive models for network response to external perturbation.
In the case of this particular research investigation, for example, the team simply s elected a gene, let’s say tfbA, and then clicked on "Broadcast." Gaggle then sent that information to the other open Gaggle-en abled programs and databases (e.g. KEGG, DMV, Cytoscape, TIGR’s microarray expression viewer, R). In each of these software tools, they could expand and then filter this query to include additional genes that share some properties with tfbA and/or with ea ch other. Data of
click to enlarge Data analysis with Gaggle interest resulting from such a filter in one program could then be "broadcast " to the other open programs simultaneously with the click of a mouse, at which point the team would seek and display, and/or enable further expansion and filtering of information associated with the broadcast data.
For instance, in Cytoscape all genes that have a promoter-binding site for TFBa (the protein encoded by tfbA) can be found; in DMV the expression values for all these genes can be selected, and in MeV or R these genes can be classified into groups of related expression values. Further, in KEGG a list of metabolic pathways can be queried and displayed in which genes in each of the groups are implicated. All these transactions between the various software and databases occur in a seamless manner without the researcher ever manually entering queries or loading data into programs, changing data format, or cutting and pasting information. Manual rather than automated data exploration can be time-consuming and can result in a significantly higher number of errors than is the case via use of Gaggle. Also notable is that, with Gaggle, Web searches become two-way explorations by enabling retrieval of query results for analysis with other software and Web resources.
From an engineering perspective, Gaggle was designed to:
• use the fewest possible software elements
• keep each maximally ignorant of all others
• avoid biological semantics
• use mainstream programming languages, and only one such language if possible
• avoid operating systems dependencies
• make sure that existing popular software and data formats are supported
• place a priority on ease of installation and update.
Java was the clear choice of programming languages to achieve the aforementioned para meters. The software environment functions across operating systems, contains strong remote method invocation for inter-process communication and has the means to communicate with programs written in other languages.
As an example, using Gaggle our research team was able to identify that:
• The 13 GTFs appear to play a role in mediating large scale responses to environment al changes (e.g. giving them the ability to survive and thrive in very harsh conditions)
• TFBf controls ribosome biogenesis
• Nearly two-thirds of the 1,048 Halobacterium promoters that were analyzed are associated with more than one GTF, which suggests a likely redundancy and an increased level of organismal fitness even in the event that some GTFs are damaged. Arriving at these findings — all of which were first-time discoveries — would have been significantly more time-consuming and costly without an effective data integration solution. Integration capabilities like those made possible by solutions such as Gaggle will play a critical role in advancing systems exploratio n of more complicated organisms, such as humans, in the future.
1. ISB is a non-profit research institute dedicated to the study and application of s ystems biology. ISB’s systems approach integrates biology, computation and technological development, enabling scientists to analyze all elements in a biological system rather than one gene or protein at a time.
2. Facciotti MT, Reiss DJ, Pan M, Kaur A, Vuthoori M, Bonneau R, Shannon P, Srivastava A, Donohoe SM, Hood LE, Baliga NS. General transcription factor specified global gene regulation in archaea. Proc Natl Acad Sci U S A. 2007; 104(11): 4630-4635. PMID: 17360575.
3. These programs can be found at gaggle.systemsbiology.net.
Nitin Baliga is an assistant professor at the Institute for Systems Biology in Seattle, WA. He may be contacted at editor@ScientificComputing.com.
Gaggle integrates several software packages
Cytoscape (www.cytoscape.org), EMBL STRING )string.embl.de), Kyoto Encyclopedia of Genes and Genomes (KEGG)) www.genome.jp), R langauge ( www.r-project.org), Microarray Expression Viewer (Mev)( www.tm4.org),