EGRIN2 Portal

EGRIN2 Portal

http://egrin2.systemsbiology.net/

EGRIN 2.0 is a systems-level model that delineates the complex relationship between environment, gene regulation, and phenotype in prokaryotes

Why EGRIN 2.0?

A foremost challenge in systems biology is to understand how just a few transcription factors (TFs) in a microbial genome generate a wide array of nuanced responses to varied environmental challenges. EGRIN 2.0 is a new model for the complete gene regulatory network (GRN) of a prokaryote. This model is reverse engineered directly from gene expression data and genomic sequence, and hence the methodology to generate EGRIN 2.0 is applicable to any prokaryotic organism.

References

Most Powerful Tool for Reconstructing a Gene Network

Most Powerful Tool for Reconstructing a Gene Network

3 Bullets

  • Nearly a decade ago, ISB’s Baliga Lab published a landmark paper describing cMonkey, an innovative method to accurately map gene networks within any organism from microbes to humans.
  • Two new papers describe the benchmark results of cMonkey and also the release of cMonkey2, which performs with higher accuracy.
  • Using this approach, genetic and molecular data generated from any organism, be it a bacterium or a birch tree, can be explored and analyzed from a network perspective.

For most of the history of biology, data was a limiting quantity that was painstakingly gathered and meticulously curated. However, the meteoric rise of sequencing technology accompanied by the parallel emergence of computing has fueled a recent data explosion in biology. Such novel technologies allow scientists to ask new questions and revisit old ones with a fresh perspective. However, it also brings new problems when interpreting the deluge of information.
Nearly a decade ago, members of the Baliga group at the Institute for Systems Biology published a landmark paper describing cMonkey, an innovative method to accurately map gene networks within any organism from microbes to humans. The method takes advantage of expanding biological datasets and computational power, and has since been applied to many different organisms across a huge range of publicly and privately generated experimental data.
Prior to cMonkey, it was common practice to group sets of genes that had similar expression levels across the experimental conditions using a process termed clustering. However, this practice inherently assumes that gene clusters are static. While this assumption may apply in limited circumstances, such as a binary treatment vs. control experimental setup, it may not be true across a wide variety of conditions. Thus, ISB researchers set out to create a novel method to cluster both genes and conditions simultaneously, known as biclustering. Unlike clustering, biclustering is a difficult and arguably impossible computational problem, meaning that no solution is guaranteed to be optimal. Nevertheless, biclustering algorithms including cMonkey have demonstrated practical value in dealing with real biological data.
In an update, published on April 15 in the journal Nucleic Acids Research, ISB researchers present the evolution of cMonkey. The paper describes updates to the algorithm – cMonkey2 – and assessed performance against alternative platforms using three distinct datasets from two different bacterial organisms and human cancer cells. Detailed performance benchmarks demonstrate the efficacy of cMonkey2 in accurate network reconstruction as well as the broad applicability across cell types. Performance aside, a major addition to the software is that is has been converted from being solely a biclustering algorithm to a biclustering and data integration platform. The new platform enables easy integration of many different additional data types, allowing users to expand the data classes to include categories we have not thought of yet.
Indeed, one of the signature features of cMonkey is its ability to integrate additional relevant sources of information. For example, many metabolic pathways have been experimentally defined in bakers yeast and various bacteria. Such pathway information can be assimilated as an association network that guides the algorithm in finding parsimonious biclusters. Various association networks based on interactions among proteins, DNA, and other molecules can also be used to aid this process. In the end, the clusters that arise are the ones that can best account for all the disparate types of data that are fed into the algorithm. This integrative approach gave cMonkey an advantage compared to other biclustering algorithms when it was released, and continues to be a distinguishing characteristic in its’ updated form.
There is a major opportunity to build bicluster networks from plentiful publicly available consortium datasets generated by multiple independent laboratories. This opportunity also presents a challenge because variation between the sources can introduce noise that reduces bicluster quality. In a paper published on April 15 in the journal BMC Systems Biology, ISB researchers have added a new metric to the cMonkey scoring algorithm to improve the quality of biclusters when dealing with highly variable source datasets. This metric improved the accuracy of condition-specific gene clustering, including a demonstrable enhancement in predicting a physiological response to nutrient shifts in yeast cells.
Beyond including the new bicluster quality metric, the revised cMonkey2 is now modularized to facilitate facile incorporation of additional data types, as well as adjustment of the weights those data types receive in the biclustering calculations. The resulting outputs of biclusters are easily interrogated using an intuitive web-based framework, and the data can be further analyzed and visualized using additional software that has been developed by the Baliga laboratory, different groups at the ISB, and the rest of the scientific community. A final note worth mentioning is the programming language: Originally built in the statistical modeling environment R, the updated version has been rewritten in Python, one of the most widely used programming languages today.
The benchmark results in the paper speak for themselves, but suffice it to say this is a uniquely comprehensive and powerful tool for modern systems biology research. To encourage widespread adoption of cMonkey2, the documentation for usage and development has been updated and expanded. Using this approach, genetic and molecular data generated from any organism, be it a bacterium or a birch tree, can be explored and analyzed from a network perspective.
Image above: Scanning EM of bacteria being eaten by white blood cell. Photo Credit: Adrian Ozinsky

Publication

MeDiChI

MeDiChI

The MeDiChI Model-Based ChIP-chip Deconvolution Algorithm

This is the download and instruction page for the MeDiChI software in support of the Bioinformatics manuscript

Please cite this publication if you utilize this package for your published research.

MeDiChI is method for the automated, model-based deconvolution of protein-DNA binding (Chromatin immunoprecipitation followed by hybridization to a genomic tiling microarray — ChIP-chip) data that discovers DNA binding sites at high resolution (higher resolution than that of the tiling array itself). This enables more stringent analysis of the functional binding (including regulated genes and DNA binding motifs), than would be possible using standard procedures for enrichment detection. The procedure uses a generative model of protein-DNA binding sites, and a linear model of the cumulative effect of those sites on the intensity of microarray probes. It uses constrained linear regression and L1 shrinkage to estimate the parameters of the linear model, which correspond to the high-resolution locations and intensities of the binding peaks. Finally a bootstrap is used to estimate the uncertainties and significance of each binding site.

We have developed a MeDiChI R package (including all functions for analysis and visualization, and all novel data presented in the manuscript).

Source Code

Please visit our Github repository for downloads, source code, installation instructions, and basic usage.

Publications

Inferelator

Inferelator

The Inferelator is an algorithm for infering predictive regulatory networks from gene expression data.

It does so by selecting the regulators (transcription factors or environmental factors) whose levels are most predictive of each gene or bicluster's expression (see cMonkey for more information). Using linear regression, L1 shrinkage and model selection via the LASSO coupled with 10-fold cross validation to strictly enforce parsimony and avoid overfitting, the method fits a multivariate kinetic model of gene expression that includes a sigmoidal activation model via the logistic function and mean decay rate parameter (τ). The model allows for the simultaneously fitting of time-course (τ/Δt > 0) and steady-state (τ/Δt ≈ 0) data, and was chosen from the class of generalized linear models to allow for fast parameter estimation and cross-validation. In addition, we developed a simple way of incorporating a generalized-linear extension of pairwise-logical interactions (AND, OR, XOR) between predictors using the functions min and max (which mimics physical chemistry derivations of logical interactions).

Thus, our generalized-linear dynamical network model cleanly incorporates some details of kinetic models, while maintaining the simplicity, flexibility, and robustness of linear and boolean models.

When integrated with cMonkey, it can also pair potential regulators with their putative cis-elements (DNA binding sites). We used the Inferelator to learn the global regulatory network of H. salinarum NRC-1.

Source Code

Publication

Visualize and explore the Halobacterium regulatory influence network.

MTB Network Portal

MTB Network Portal

http://networks.systemsbiology.net/mtb

The MTB Network Portal serves as a portal for computational modeling program to generate an integrated, predictive gene regulatory network model of host/pathogen interactions.

References