Computational Modeling

Translational Regulation of Gene Expression in Halobacterium salinarum

This summer, Michael Huang and Will Wick delved into computational biology.

In recent years, the conception of ribosomes as static machines, non-responsive to changing conditions within the cell has gradually fallen away, and now scientists are exploring how ribosomes might act as dynamic regulatory elements on their own.

A former postdoc in the Baliga lab, Arjun Raman, started research into ribosomal heterogeneity with the goal of “establishing a basis for conditional regulation of ribosomal protein genes and other translation system elements,” using the Archaea Halobacterium salinarum as a model organism due to its being easy to work with in the lab. Michael and I built off part of Arjun’s work, investigating some of the transcriptomic and proteomic data he had produced.

A variety of computational tools were involved in our research. Our programming was done in R and Python, which we used to call various programs. To quantify the reads for our RNA-seq data, we used several different pipelines: Kallisto, STAR and Cufflinks, and HTSeq. Then for further data analysis, we used Samtools, MEME, and DESeq2.

We were given a huge dataset to work with. It included 2509 genes by 4 timepoints by 3 replicates by 3 layers of data. This did not lend itself well to familiar programs like Excel, so we relied on R to handle the bulk of our data processing.

We used R’s outlierTest to identify outlier genes for which ribosomes preferentially attached. For the genes that were identified as outliers at multiple time points, we constructed coverage plots, with the relative position along the transcript on the x axis and the reads of ribosome protected fragments normalized by the million total reads in the genome on the y axis. This is where Samtools came into play, as we used Samtools to compute the reads at each loci of the genes. We noticed a pattern for these outlier genes: they all had peaks at a particular region. Further, for all the outlier genes except VNG_1207C, this peak got progressively lower from timepoint 1 to timepoint 4.

To see if this pattern was unique to coverage plots of the ribosome protected fragment data for outlier genes, we constructed additional coverage plots of both the mRNA data for outlier genes and ribosome protected fragment for control genes which fit the regression. The control genes still exhibited a peaks, as seen in the lower left, but these peaks were not as defined as those in the outlier genes. Also, when looking at the mRNA coverage plots for outlier genes, we found very slight peaks, but these were occurring in regions of the transcript compared to the ribosome protected fragment data, leading us to believe that rather than these peaks being an artifact of mismatched reads, they might be regions where the ribosome was stalling. We therefore sought to discover motifs among the DNA sequence of the outlier genes, hopefully finding a motif that was causing the peak.

The standard program for motif identification is MEME, an acronym for multiple expectation maximization for motif elicitation. We tried three different input models with MEME: the peak region, the peak region plus the ten bases on either side of it, and the entire transcript. We have shown on the projector MEME’s output for when we tried the peak and its surrounding region; however, none of the three input models produced any statistically significant motifs.

In addition to the RNA-Seq data we worked with, we also recently received proteomic data achieved through this SWATH proteomics procedure. Within the Lysate-only data, we found fewer (39) shared across all time points than with the ribosomal isolated data (48), with 15% of the ribosomal protein genes either not being detected across both datasets. In plotting the pertinent and significant data across the time points, we find that the log2 ratio of the genes seemed to reduce over time and cluster towards -1.5, which in log2 means approaching 1/3rd, while in the ribosomal isolation data the trend was much tighter and seemed to approach towards -1.0, which means half of the cells compared with TP 1. We are currently investigating the prospective outliers and quantifying their significance.

This research raises a few questions that can be investigated in depth in the future. Researchers have been constructing transcriptional regulatory networks for a while now, but can the same be done for translational regulatory networks, and can these be integrated with transcriptional models to gain a more a detailed understanding of how cells control how much protein is produced? Another question is to look at which specific conditions result in the use of certain ribosomes for translation. A particular case to apply this to would be latent state of tuberculosis, which occurs under hypoxia

In addition to computer-based analysis of Halo RNA-Seq data, we emulated a lab procedure, growing and incubating Wild-Type NRC-1 Halo strains in addition to a several mutant strains.

ISB High School Interns 2017