Sulfolobus solfataricus P2

Sulfolobus solfataricus P2

Refseq: NC_002754

NCBI taxonomy ID: 273057

Transcriptome structure

Analysis of growth samples of S. solfataricus P2 identified 2113 transcriptional units of which 459 were polycistronic transcripts and 1654 were monocistronic (Table 1). The genome contains lots of repeat sequences with partial and full IS elements (~ 10% of the genome) (She et al., 2001)which boundaries were hard to detect. Majority of transcripts have spanned short distance around the coding-sequence boundaries, with 5’ UTRs (694 transcripts, 65% of genes assigned to an experimentally determined TSS) and/or 3’ UTRs (408 transcripts, 55% of the genes assigned to an experimentally determined TSS), which agrees with the previous observation that most Sulfolobus genes generate leaderless transcripts (Wurtzel et al., 2010).

We identified 151 transcripts unannotated in original genome sequence analysis (She et al., 2001). BLASTX search of these transcripts identified 97 putative novel proteins (query coverage > 34% and E-value < 5E-6), of which 32 matched putative proteins identified in previous study (Wurtzel et al., 2010) (Fig 1A). We identified 109 putative novel antisense RNAs and the previously reported ncRNAs at 280 genomic positions in which 254 belong to 102 kinds of sequenced ncRNAs (Tang et al., 2005; Zago et al., 2005)and 26 were computationally predicted ones (Gardner et al., 2009; Grissa et al., 2007; Omer et al., 2000). Nearly all of the reported ncRNAs (277 ea, 99%) were expressed (mean P = 0.98, std = 0.06). The ncRNAs expressed in tiling array experiments were prevalently located in the opposite orientation to the transposon-related genes (148 cases) (Fig 1B), which implies these antisense RNAs have evolved for preventing excessive transposition (Tang et al., 2005; Wurtzel et al., 2010). Locations of four single transcripts covering large genomic regions (2 ~ 6.5 kb) with no sequence homology to nr proteins were almost identical to those of computationally predicted CRISPRs (Grissa et al., 2007) (Fig 1C). The remaining three predicted CRISPRs at 1744007 (411 bp), 1809772 (1450 bp), 1811328 (4230 bp) were not expressed (P < 0.5).

In the genome annotation of S. solfataricus P2, extensive genes (868 gene pairs) partially overlap their coding sequences; 636 co-directional (→→), 204 convergent (→←) and 28 divergent (←→) (Palleja et al., 2009). As for the 76 convergent or divergent overlaps by more than 25 bp, the probability expressed in the overlapping regions (P = 0.64) were much lower than that in the non-overlapping regions (P = 0.95) (t-test P value = 6E-17), which implies overlapping region of only one gene tend to be expressed. We identified more than 19 genes with suspected erroneous annotation (Fig 1D).

Tiling array analysis in this study identified TSSs for transcriptional units which were not determined in the previous deep sequencing analysis (Wurtzel et al., 2010)in large extent. Of 638 TSSs of annotated genes determined by both methods, positional discrepancy was mainly detected for 167 genes for hypothetical protein, transposon, and operon constituent.

Table 1. Overview of the Transcriptome Structure of S. solfataricus P2

Figure 1. Examples of discoveries made through tiling array analysis of dynamic changes in transcriptome structure of S. solfataricus P2.

(A) Discovery of a new gene. We have discovered at least 151 transcripts in genomic locations that were not assigned to any annotated features. Here, we show an example of a newly discovered transcript that encodes a protein homologous to a H/ACA RNA-protein complex component Gar1 from Sulfolobus solfataricus 98/2 (E-value = 6e-44). This putative gene showed same expression pattern with downstream SSO0946 encoding transcription initiation factor IIB (TFB-2) (Pearson correlation ~ 1). (B) Discovery of an antisense ncRNA. At least 109 antisense ncRNAs were discovered. The example shown is for a ncRNA that is antisense to the 5' end of transposase ISC1250. The expression of the transposase and its antisense ncRNA were moderately anti-correlated (Pearson correlation ~ -0.55). (C) Expression of CRISPR. Of seven computationally predicted CRISPRs, four were expressed distinctively and consistently within the regions. Here, we show the largest CRISPR starting at 1233428 (~6.5 kb). Interestingly, hypothetical gene having CRISPR-associated (cas) Pfam domain is located on the opposite strand of upstream of the largest CRISPR but its expression was not correlated with that of CRISPR (Pearson correlation ~0). (D) Identification of misannotation. Lots of overlapping genes seems to be subject to erroneous annotation. Purine biosynthesis operon is followed by operon in opposite strand in which the putative start codon is located at 144 bp from the annotated start site. TSS determined in this study was 148 bp apart.

 

S. solfataricus P2 Resources

Data for this project can be accessed through the following software and databases that were generated through DOE funding.

AGaggle Genome Browser (ID/PWmaggie/archaea; Note: this link downloads large data files (~60-190MB)).   The Gaggle Genome Browser is software developed by the Baliga Lab for visualizing systems biology data organized by their genomic coordinates.  You can learn more about GGB by going here.  You will find extensive information on data formats and features along with demos and screencasts on how to use this software.  Once you have launched the GGB software prepackaged with S. solfataricus P2 data, you can browse the annotated and curated information on transcriptome structure as follows: (1) Right click on bookmarks file and save the file to your desktop;  (2) In GGB, click on Bookmarks>Load Bookmarks to load this file; (3) Finally, click on Bookmarks>Show Bookmarks --the bookmarks should appear as a new pane on the right hand side.  GGB can communicate with other software in the Gaggle framework; for more information about Gaggle go here.  Additional software tools for analyzing S. solfataricus P2 data in Gaggle can be found here. [Using GGB: go here for information on how to interpret information contained in the various tracks.  IMPORTANT: Please go here to make sure your computer is set up right for using this software]

 References