GLBIO 2019 – Fostering Interdisciplinary Computational Biology Research

December 3, 2019 PLOS Collections Biology & Life Sciences Computer & Information Sciences

The 14^th Great Lakes Bioinformatics Conference 2019 (GLBIO) took place at the University of Wisconsin at Madison from May 19th to 22nd. The conference was organized by the Great Lakes Bioinformatics Consortium to provide an interdisciplinary forum for the discussion of research findings and methods in bioinformatics. An important goal for the conference was to foster long-term collaborative relationships and networking opportunities within the domain of computational approaches to biology. The conference committee accepted 5 full-length original research papers for oral presentations at the conference, all of which were then submitted and accepted for inclusion in the PLOS Collection devoted to the conference.

This year marks the 14th Great Lakes Bioinformatics Conference and has attracted 277 attendees from eight countries. The conference agenda included three different tutorials and workshops on a wide variety of topics including UCSC Genome Browser, dimensionality reduction methods for biomedical data and employing docker for software development and reproducibility. In addition, it included 10 research sessions and 6 invited keynote speakers providing updates on several exciting areas of research and education, spanning population genetics, evolutionary biology, host-pathogen interactions, precision medicine, systems biology, big data analytics and drug repositioning. The conference also attracted 114 poster presentations presented in two poster sessions.

GLBIO 2019 Research Papers:

Accepted papers from the conference paper submissions included in this collection comprised of a new clustering approach for biological sequences using phylogenetic trees (Balaban et al, 2019), a framework for prediction of regulatory elements from partial data (Zhang and Mahony, 2019), a new resolution function which can efficiently quantify and compare the shapes of phylogenetic trees by encoding them with a single number (Hayati et al, 2019), improved connectivity measures for signaling networks based on hypergraphs to facilitate signaling pathway analysis and prediction (Franzese et al, 2019) as well as an integrative approach to predict protein-RNA binding by considering thermodynamic and sequence contexts of RNA revealing the increasing interest in integrative data mining and machine learning approaches on multi-omic datasets and employing such datasets for systems biology studies (Su et al, 2019). A summary of each of these works is detailed below:

A new resolution function to evaluate tree shape statistics

**Hayati *et al* 2019: The geometric perspective of a good and bad tree shape statistic.**
From a geometric perspective, a good statistic can discriminate between different trees, and place similar trees together. In theses two figures, we embedded the set of trees with 9 tips using multi-dimensional scaling (MDS) and the NNI distance between the trees. The points in the top and bottom plots are colored based on their Sackin and I2 values respectively. The green, blue and red points correspond to the upper quartile, lower quartile, and the inter-quartile interval of the distribution of the statistics, respectively. The clustering pattern in the figure indicates that the Sackin index can separate the trees into groups in a way consistent with the NNI distances, while the I2 index is unable to do so.

Abstract: Phylogenetic trees are frequently used in biology to study the relationships between a number of species or organisms. The shape of a phylogenetic tree contains useful information about patterns of speciation and extinction, so powerful tools are needed to investigate the shape of a phylogenetic tree. Tree shape statistics are a common approach to quantifying the shape of a phylogenetic tree by encoding it with a single number. In this article, we propose a new resolution function to evaluate the power of different tree shape statistics to distinguish between dissimilar trees. We show that the new resolution function requires less time and space in comparison with the previously proposed resolution function for tree shape statistics. We also introduce a new class of tree shape statistics, which are linear combinations of two existing statistics that are optimal with respect to a resolution function, and show evidence that the statistics in this class converge to a limiting linear combination as the size of the tree increases. Our implementation is freely available at https://github.com/WGS-TB/TreeShapeStats.

Direct prediction of regulatory elements from partial data without imputation

**Zhang** et al 2019: **Illustration of the IDEAS method.** When making an inference at a locus in a target cell type, IDEAS first identifies a set of cell types that share locally similar chromatin landscapes with the target cell type. Then IDEAS makes predictions based on the chromatin marks in the target cell type and the predictions made in the locally related cell types at the same locus. The IDEAS algorithm is a full Bayesian nonparametric probabilistic model. All model parameters, except for hyper parameters, are learned from the data, including number and parameters of chromatin states, size of local intervals, number of cell type clusters, and locus-specific profiles.

Author Summary: Histone modifications and other gene regulatory signals can be profiled across the genome in a given cell type, and each type of regulatory signal correlates with the presence of specific gene regulatory activities. Genome segmentation methods look for patterns across combinations of regulatory signals to annotate more general “regulatory states” (e.g. enhancers, promoters, repressed regions, etc.) across the genome. To see how regulatory states change across cell types, we need to run genome segmentation in a consistent way across the analyzed cell types. However, due to experimental and cost limitations, we may not have profiled the same regulatory signals in all available cell types. Current approaches deal with this missing data problem by either limiting genome segmentation analysis to the subset of regulatory signals that have been profiled in all analyzed cell types (which limits the types of regulatory states that can be detected and/or the numbers of cell types that can be analyzed), or by predicting what the missing regulatory signals would have looked like. The latter “imputation” approach is computationally costly, and is not always accurate. The current manuscript introduces a third strategy to handling missing data in the genome segmentation problem. Our approach, based on the IDEAS genome segmentation platform, removes the need for data imputation by directly accounting for missing data within the algorithm. In cell types where some regulatory signals are missing, IDEAS can still provide accurate regulatory state annotations based on a combination of the regulatory signals that have been observed in that cell type, the regulatory states annotated at the same location in other cell types (which may be based on more complete regulatory signal information), and the regulatory states in surrounding regions.

Hypergraph-based connectivity measures for signaling pathway topologies

**Franzese *et al 2019*:** **Hyperedges traversed to compute B0, B1, …, B4 from source pathway Mst1.**
Node colors represent B-relaxation distance from k = 0 (B0, blue) to B4 (bright green). Gray nodes are entities that are not in Bk but are involved in traversed hyperedges. Star-shaped nodes are members of the MET pathway. This network is available on GraphSpace.

Author Summary: Signaling pathways describe how cells respond to external signals through molecular interactions. As we gain a deeper understanding of these signaling reactions, it is important to understand how molecules may influence downstream responses and how pathways may affect each other. As the amount of information in signaling pathway databases continues to grow, we have the opportunity to analyze properties about pathway structure. We pose an intuitive question about signaling pathways: when are two molecules “connected” in a pathway? This answer varies dramatically based on the assumptions we make about how reactions link molecules. Here, examine four approaches for modeling the structural topology of signaling pathways, and present methods to quantify whether two molecules are “connected” in a pathway database. We find that existing approaches are either too permissive (molecules are connected to many others) or restrictive (molecules are connected to a handful of others), and we present a new measure that offers a continuum between these two extremes. We then expand our question to ask when an entire signaling pathway is “downstream” of another pathway, and show two case studies from the Reactome pathway database that uncovers pathway influence. Finally, we show that the strict notion of connectivity can capture functional relationships among proteins using an independent benchmark dataset. Our approach to quantify connectivity in pathways considers a biologically-motivated definition of connectivity, laying the foundation for more sophisticated analyses that leverage the detailed information in pathway databases.

Integrating thermodynamic and sequence contexts improves protein-RNA binding prediction

**Su *et al* 2019: (A) An overall framework of ThermoNet.** The model receives an RNA Sequence as the injput and k-mers with various lengths are extracted as sequence features. (B) Prediction network structure. The convolutional neural network takes the sequence embedding and secondary structure representation as input. The output of the network is the predicted binding intensity for the input RNA

Author Summary: RNA-binding proteins (RBPs) play a key role in modulating various cellular processes, including transcription, alternative splicing, and translational regulation. Identifying protein-RNA interactions and the binding preferences of RBPs are critical to unraveling the mechanism of post-transcriptional gene regulation. In the current study, we present a computational approach that integrates both structure and sequence contexts for protein-RNA binding prediction. We propose to incorporate the structure information using a thermodynamic ensemble of secondary structures, which effectively identifies RBP-binding structural preferences, especially for structured RNAs. Our model is further empowered by a deep neural network that combines the sequence and structure information to achieve improved protein-RNA binding prediction. Extensive experiments on both in vitro and in vivo datasets demonstrate the superior performance of our method compared to several state-of-the-art approaches. This study suggests the great potential of our method as a practical tool for identifying novel protein-RNA interactions and binding sites of RBPs.

TreeCluster: Clustering biological sequences using phylogenetic trees

**Balaban *et al* 2019: When the phylogenetic tree is ultrametric, clustering is trivial.** For a threshold α, cut the tree at height (A). When the tree is not ultrametric, it is not obvious how to cluster leaves (B). In both cases, a set of cut edges defines a clustering.

Abstract: Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available on GitHub.