- Nearly a decade ago, ISB’s Baliga Lab published a landmark paper describing cMonkey, an innovative method to accurately map gene networks within any organism from microbes to humans.
- Two new papers describe the benchmark results of cMonkey and also the release of cMonkey2, which performs with higher accuracy.
- Using this approach, genetic and molecular data generated from any organism, be it a bacterium or a birch tree, can be explored and analyzed from a network perspective.
For most of the history of biology, data was a limiting quantity that was painstakingly gathered and meticulously curated. However, the meteoric rise of sequencing technology accompanied by the parallel emergence of computing has fueled a recent data explosion in biology. Such novel technologies allow scientists to ask new questions and revisit old ones with a fresh perspective. However, it also brings new problems when interpreting the deluge of information.
Nearly a decade ago, members of the Baliga group at the Institute for Systems Biology published a landmark paper describing cMonkey, an innovative method to accurately map gene networks within any organism from microbes to humans. The method takes advantage of expanding biological datasets and computational power, and has since been applied to many different organisms across a huge range of publicly and privately generated experimental data.
Prior to cMonkey, it was common practice to group sets of genes that had similar expression levels across the experimental conditions using a process termed clustering. However, this practice inherently assumes that gene clusters are static. While this assumption may apply in limited circumstances, such as a binary treatment vs. control experimental setup, it may not be true across a wide variety of conditions. Thus, ISB researchers set out to create a novel method to cluster both genes and conditions simultaneously, known as biclustering. Unlike clustering, biclustering is a difficult and arguably impossible computational problem, meaning that no solution is guaranteed to be optimal. Nevertheless, biclustering algorithms including cMonkey have demonstrated practical value in dealing with real biological data.
In an update, published on April 15 in the journal Nucleic Acids Research, ISB researchers present the evolution of cMonkey. The paper describes updates to the algorithm – cMonkey2 – and assessed performance against alternative platforms using three distinct datasets from two different bacterial organisms and human cancer cells. Detailed performance benchmarks demonstrate the efficacy of cMonkey2 in accurate network reconstruction as well as the broad applicability across cell types. Performance aside, a major addition to the software is that is has been converted from being solely a biclustering algorithm to a biclustering and data integration platform. The new platform enables easy integration of many different additional data types, allowing users to expand the data classes to include categories we have not thought of yet.
Indeed, one of the signature features of cMonkey is its ability to integrate additional relevant sources of information. For example, many metabolic pathways have been experimentally defined in bakers yeast and various bacteria. Such pathway information can be assimilated as an association network that guides the algorithm in finding parsimonious biclusters. Various association networks based on interactions among proteins, DNA, and other molecules can also be used to aid this process. In the end, the clusters that arise are the ones that can best account for all the disparate types of data that are fed into the algorithm. This integrative approach gave cMonkey an advantage compared to other biclustering algorithms when it was released, and continues to be a distinguishing characteristic in its’ updated form.
There is a major opportunity to build bicluster networks from plentiful publicly available consortium datasets generated by multiple independent laboratories. This opportunity also presents a challenge because variation between the sources can introduce noise that reduces bicluster quality. In a paper published on April 15 in the journal BMC Systems Biology, ISB researchers have added a new metric to the cMonkey scoring algorithm to improve the quality of biclusters when dealing with highly variable source datasets. This metric improved the accuracy of condition-specific gene clustering, including a demonstrable enhancement in predicting a physiological response to nutrient shifts in yeast cells.
Beyond including the new bicluster quality metric, the revised cMonkey2 is now modularized to facilitate facile incorporation of additional data types, as well as adjustment of the weights those data types receive in the biclustering calculations. The resulting outputs of biclusters are easily interrogated using an intuitive web-based framework, and the data can be further analyzed and visualized using additional software that has been developed by the Baliga laboratory, different groups at the ISB, and the rest of the scientific community. A final note worth mentioning is the programming language: Originally built in the statistical modeling environment R, the updated version has been rewritten in Python, one of the most widely used programming languages today.
The benchmark results in the paper speak for themselves, but suffice it to say this is a uniquely comprehensive and powerful tool for modern systems biology research. To encourage widespread adoption of cMonkey2, the documentation for usage and development has been updated and expanded. Using this approach, genetic and molecular data generated from any organism, be it a bacterium or a birch tree, can be explored and analyzed from a network perspective.
Image above: Scanning EM of bacteria being eaten by white blood cell. Photo Credit: Adrian Ozinsky