The Data is Your Palette

Over at the Institute for Systems Biology, which hosts several visualization tools on its website, Nitin Baliga and his group recently added a tool to their database and software integration framework Gaggle.


The Data Is Your Palette
[October 2008] By Matthew Dublin

While powerful compute muscle in the form of workstations, acceleration hardware, and sophisticated algorithms continues to help crunch data sets that seem to multiply at turbo-charged rates, it isn’t becoming any easier for researchers to make sense of it all. "With the vast amounts of data out there, if we rely solely on traditional science and technology methodologies, scientific problems are going to become more and more difficult to solve," says Haesun Park, a professor of computer science at Georgia Institute of Technology.

That is why data visualization software for biology holds so much promise: it offers a way for researchers to get a grip on data without drowning in it. Thanks to a recent $3 million grant from the National Science Foundation and the Department of Homeland Security, Park now heads up the Foundations of Data and Visual Analytics research initiative, aimed at conducting foundational research for visual analytics tool development. But Park says next-gen data visualization tools — that is, visualization tools that do not just represent data graphically but that actually allow researchers to view their data in novel ways — are still in the beginning stages.

Recently, software developers at the Broad Institute gave data visualization a nice boost in the form of the Integrative Genomics Viewer, a freely available tool that allows researchers to view different genome data sets and navigate, overlap, and intermix data from various sources — including genetic variation, epigenetic, and microRNA expression data, as well as RNAi screens and phenotype annotations of samples. Taking its cue from the University of California, Santa Cruz, Genome Browser, the IGV is designed to handle large-scale experiments. "We use the UCSC Genome Browser a lot and we like it, but it tends to bog down and become unusable as the size of your feature tracks grows," says Jim Robinson, senior software engineer at the Broad. "We do a lot of ChIP-seq here … and the way we process them is every 25 base pairs on the whole genome, so they cannot be loaded in the UCSC browser."

What really sets the IGV apart from other genome visualization tools, says Robinson, is its Google Map-like design. In much the same way that Google Maps allows users to zoom in and out on a particular place using a map that’s essentially made of tiled images pieced together, the IGV allows users to have the same kind of viewing freedom along the base pairs of a genome. "IGV took its inspiration from Google Maps, and it divides your data into manageable-sized chunks on specific zoom resolution scales, so if you’re looking at the whole genome, it processes the data and just loads into the viewer a small chunk of representative statistics for the data to visualize," Robinson says. "Then as you zoom down, it’s getting more and more detailed into the data until you’re finally looking at the raw data; however, you’re looking at smaller and smaller pieces, so it’s the tile concept that Google Maps uses."

Unlike Google Maps, however, where data is stored on Google servers, users are responsible for their own data, although the IGV currently supports more than a dozen different file formats. That’s definitely a downside to the IGV — it places a lot of the compute burden on the end users, Robinson says. Recently, the IGV team put the finishing touches on a client server version. And a little further off, but definitely on the drawing board, is a Web-based version. "We’ve just begun to experiment with ways to combine different data types, different graphs, in a way that visually still makes sense, and it’s not actually that easy to do," Robinson says. "We’re looking at ways of enabling the IGV to do cross-species comparisons, and combining data from mouse and human is of particular interest here."

According to Robinson, who is also a member of the Cancer Genome Atlas project, developers at the Broad did not initially set out to build a tool for mass consumption. Instead, it was the nature of the Cancer Genome Atlas, with its myriad data sets, that left Robinson and his colleagues constantly wanting a more integrated way to view several data sets against one another.
A team at the European Bioinformatics Institute is developing a tool that also boasts Web 2.0-like interactivity: Dasty2, an open-source, Web-based client for visualizing protein sequence data. According to Henning Hermjakob, team leader of proteomics services at EBI, the heart of Dasty2 is its use of AJAX, which endows it with a colorful graphical interactivity that lets information pop up simply by scrolling across an item with the mouse. Hermjakob hopes that its open source status will encourage users to continue to develop this functionality.

Dasty2 is built to seamlessly establish multiple asynchronous requests to Distributed Annotation System Web servers hosting annotation data. DAS is a framework that provides transparent access using XML to multiple databases at the same time, so only a single client is needed to gain access. That’s where Dasty2 comes in. More than 261 databases, including Uniprot, Ensembl, and Pfram, are all DAS-enabled. Hermjakob says his team made the decision to make Dasty2 a Web-based application to increase accessibility for users, rather than requiring them to download and install it. To improve compatibility, it uses the same color-coding scheme as Uniprot and SRS, and uses colors, borders, shading, and line separation to contrast features with the background to clearly delineate the relationships among annotations. It also has a zooming feature that lets users to explore a particular protein sequence.

Gaggle and GESTALT
Over at the Institute for Systems Biology, which hosts several visualization tools on its website, Nitin Baliga and his group recently added a tool to their database and software integration framework Gaggle. The Gaggle Genome Browser, which is freely available on the Gaggle website, allows users to download any genome from NCBI; the program then paints it on the fly. Users can upload their transcriptome data onto the browser, which will appear as tracks, Baliga says. Possible application areas include the visualization of tiling arrays and ChIP-chip data. The genome browser has yet to be published, but it is currently available for download from the Gaggle website.

Baliga says he always reminds his software engineers that they are in the business of helping biologists see complex data in a setting that enables intuition, not more confusion. "There’s a lot of information that we need to put in context to build an understanding of how biological systems work — and with high-throughput data, it becomes very difficult to have all of that information in front of you in a meaningful way. And that’s where visualization tools are most effective," he says. "[Data visualization tools] need to go in a direction that will allow users to interact with these complicated models and build up on the intuition so that they can interact and curate the models that they developed in a way that feeds back into the next cycle of systems biology."
One of Baliga’s colleagues at ISB is Gustavo Glusman, a lead developer of the GESTALT workbench, an early visualization tool that works with the FEAST gene prediction program. Glusman emphasizes the importance of making tools that create visual metaphors that have real scientific meaning. "People frequently create too complicated graphs and publish them with little explanation," he says. "Such graphs have been called ‘ridiculograms’ because they are visually stunning, scientifically meaningless, and yet, published." According to Glusman, visualization software needs to be as simple as possible while still showing rich data. "The tool should be able to produce many different ways of looking at the data, and allow the user to modify and tinker at will," he says. "The tool should also quickly become second nature to the user, avoiding technical issues like complex installation, demanding system requirements, and frequent crashes."

Proteolens and GeneTerrain
Jake Chen, an assistant professor of informatics and computer science at Indiana University, recently released Proteolens, a freely available visualization platform tailored to the annotation and analysis of multi-scale biological networks. Chen and his team have already demonstrated the program’s effectiveness at visualizing protein-peptide mapping, human disease association network, and drug-target interaction data sets using a node-layout format. According to its developers, the most notable feature of Proteolens is that it enables the bioinformatics specialist to create tables and views as well as to build queries on top of one another, without leaving the visualization environment. That’s especially useful for exploratory data analysis and data mining, says Chen.
While other data visualization programs — such as Cytoscape, VisANT, and Pathway — are great tools for conducting visual annotation and analysis of biomolecular network data, Chen believes Proteolens offers something different in allowing the user to specify data definition. "Those are impressive tools, but we don’t believe that they have the full flexibility of allowing the user to use data definition language to build queries on top of another," says Chen. "That’s the key to the systems biology research that we conduct in our own group, so we don’t just build tools and let others use them; we build tools because we want to push the limits ourselves."

He and his colleagues are also hard at work refining a biomarker panel visualization tool that they plan to add to the Proteolens package later this year with a unique graphical data-rendering presentation. GeneTerrain, as its name implies, displays a "terrain" of multicolored topological maps with peaks and valleys to represent the signals produced in noisy protein or gene expression datasets. Each nuance in the map allows users to identify possible biomarkers of interest, without any sort of deep understanding of the algorithmic engine running underneath the hood. "The topological features are a step up from the usual heat map layout, which doesn’t use the space in 2D to represent any information," Chen says. "But with GeneTerrain, we use all the real estate to represent the inherent biological properties, so the user just looks at which region is consistently up or down to make the decision whether that region, which represents clusters of genes linked by network, [is] important."

VariVis and MotifCluster
There are also visualization tools not intended to be downloaded by the end user at a desktop, but are instead geared toward the online database manager or curator. The VariVis toolkit, which is essentially a collection of Perl scripts that generate graphical models of gene sequences complete with their corresponding variant data, is one such software solution. As long as the database Web server is Perl CGI script-enabled, database managers can install the open source VariVis and run it seamlessly inside the database’s existing architecture. "VariVis was designed to give the curators and owners of locus-specific databases access to some basic visualization tools," says Tim Smith, VariVis developer and a PhD student at the Genomics Disorders Research Centre, a branch of the Howard Florey Institute in Melbourne, Australia. "Most locus-specific databases are maintained in the spare time of their curators, often as a sideline to more mainstream research, [which] means that a lot of these databases don’t have the necessary time or money to develop tools of their own."

The tool offers two conceptual views, both of which display sequences and variation with structural annotations. The standard view displays a gene sequence and overlays positions where variations are present. Users can click on variant symbols to see a brief overview of the data or to perform simple PubMed and Google Scholar searches on the variant. VariVis also offers a gel view, which displays the sequence vertically in an unbroken data stream. This mode renders all possible nucleotide combinations for each position while highlighting nucleotides present in the reference sequence and any variations in contrasting colors.
Even though VariVis was only recently made available, Smith says the feedback from his fellow researchers has been very positive. "We’ve had a great response. Our recent paper in BMC Bioinformatics was just classified as ‘Highly Accessed,’ which is terrific," he says. "We’ve had a number of other database curators talking to us, and add to that the steady stream of people downloading it from our website. … We’re very happy."

The biggest development challenge for Smith and his colleagues was similar to what faced many visualization tool developers: figuring out exactly what data users need to see with a graphical tool for a particular research application, and what they can live without. "As you can imagine, there were a lot of differing opinions on the matter, and in the end we just had to use our best guess, as we wouldn’t have been able to satisfy everyone," Smith says. "We wanted to keep VariVis as simple as possible, both to use and to run."

Meanwhile, Rob Knight, an assistant professor of evolutionary biology at the University of Colorado, Boulder, and his team finished development on a visualization tool for identifying related motifs in a set of sequences called MotifCluster. This new app clusters sequences together into families by using the motifs they contain. The user can then determine whether certain proteins are related; visualize motifs mapped onto trees, sequences, and 3D structures; and cluster sequences by shared conserved motifs.
"MotifCluster allows you to focus on the motifs in the context of a multiple sequence alignment, a set of 3D structures, and a network relating proteins to motifs, thus providing a unique combination of tools for rapid discovery of similarity between distantly related groups of proteins and specific changes that discriminate among functionally distinct classes of proteins," Knight says.

The biggest hurdle Knight and his team faced was the goal of parallelizing the software to work on their own compute cluster for quick response times. This is to facilitate multiple users, rather than a single visualization job that requires massive compute power. At this point, running visualization tools in parallel is still unusual in the field. "For large data sets, use of clusters is routine for providing rapid visualizations in other fields, such as physics and climate simulations, but genomics is lagging behind in this respect," Knight says. "So because the software doesn’t exist [for parallelizing data visualization tools], users do what they can on their desktops."

Despite the fact that the steady stream of visualization tools indicates a certain level of hopefulness among developers that visualization may be a silver bullet, the inherent abstractness of putting large, complex data sets into visual metaphors has its limitations. From where Robinson is sitting, the big issue facing visualization software development is not just how to handle disparate data sets and present them visually, it’s parsing the massive data sets down so that they can actually be represented in an intelligible way.

"I think the big challenge is that you have way more data than you can view, so you have to come up with intelligent ways to view what’s important out of that data," Robinson says. "We’re just starting that now, to experiment [with] things … But if you have a gigabyte of data [and] you only have 700 x 700 pixels on the screen … somehow that has to get reduced to something that makes sense."