The Signaling Pathways Project: a FAIR portal into the cell signaling ‘omics data universe
Neil McKenna, Ph.D., Associate Professor of Molecular and Cellular Biology at Baylor College of Medicine and Director of the Signaling Pathways Project, introduces a collaboration between PLOS Collections and the Signalling Pathways Project to connect datasets imported to the SPP Knowledgebase with their associated PLOS article in the ’Omics of Cellular Signaling Pathways Collection.
Mammalian receptors, enzymes, transcription factors and ancillary factors connect extracellular signals to their downstream genomic endpoints through a series of interdependent interactions that are commonly referred to as “signaling pathways”. These pathway nodes act as points of convergence and integration on the one hand, and divergence and distribution on the other, to ensure an appropriate response of any given cell to its afferent signaling cues. In recent years, broad compass transcriptomic (expression array) and cistromic (ChIP-Seq) analysis platforms have gained widespread adoption in the dissection of these pathways. Whether they involve direct genetic manipulation of a node, or administration of a bioactive small molecule (BSM) that impacts the function of a node, publically archived transcriptomic and cistromic datasets, as a collective, can help researchers infer regulatory connections between pathway nodes and their downstream genomic targets. We emphasize “as a collective” because in their default archived state, often cryptically annotated or lacking associated article information, these datasets can be intimidating for researchers to re-use for hypothesis generation and data validation purposes. The result is that many archived datasets fall short of the ideals espoused by proponents of the FAIR (findability, accessibility, interoperability and reusability) data movement.
The Signaling Pathways Project
In response to this knowledge gap, we have developed the Signaling Pathways Project (SPP). SPP is built upon a FAIR biocurational pipeline that leverages publically archived transcriptomic and ChIP-Seq datasets to model regulatory relationships between the four major categories of signaling pathway node (transmembrane or nuclear receptors; signaling enzymes, transcription factors and co-nodes) and their downstream genomic targets1. To provide for discrimination of tissue-selective patterns of gene expression, the SPP data model also incorporates a hierarchy of physiological systems and organs and experimental biosamples (tissues and cultured or primary cell lines). The SPP resource is predicated on the idea that receptors, enzymes, transcription factors and other regulatory nodes are molecular free agents whose function is not necessarily tied to any single context and, by extension, have the theoretical potential to associate in any modular combination in a given cellular context. Table 1 shows some examples of the perspectives that SPP datasets and Regulation Reports afford users to formulate hypotheses around relationships between pathway nodes and their genomic targets in a biological system of interest.
Although other ‘omics dataset re-use resources exist, SPP (to our knowledge at least) is unique in mapping individual datasets to nodes and node families. The virtue of this additional biocurational step is that few datasets are singletons, and no dataset is expendable: if it was carried out in a human, mouse or rat tissue, it is relevant to SPP. This approach also allows us to carry out meta-analysis of datasets in the form of consensomes (for consensus ‘omics). In consensome analysis, we survey across these datasets in an unbiased and systematic manner to identify those genes that are most consistently and reproducibly impacted by genetic or small molecule pertubation of signaling pathway node1. In the past, if you wanted to find target genes of a receptor or transcription factor of interest, your best bet was screenscraping review articles, or maybe coming across the occasional lab web page curated by a diligent graduate student. This approach has the effect of reinforcing literature biases, where the same targets end up being studied again and again. Consensomes give researchers an alternative approach to this task, allowing them to use a purely data-driven approach to identifying those node targets that are most appropriate and relevant to their biological system of interest. Table 1 shows a couple of examples currently on the site – over the coming months we will be rolling out new ones, so please sign up for the SPP Twitter feed to keep up to date.
Table 1. Examples of Signaling Pathways Project queries.
|Query Type||Data type||Example|
|Dataset||Transcriptomic||Analysis of erb-b2 receptor tyrosine kinase 2 (ERBB2)-, and BCAR1 Cas family scaffolding protein (BCAR1)-dependent transcriptomes in human MCF10A breast cancer cells|
|ChIP-Seq||Analysis of the Kdm1a, Kmt2a, Kmt2c, Men1, Sin3a and Wdr5 cistromes in mouse C2C12 skeletal muscle myoblasts
|GO Term||Transcriptomic||Regulation of cell cycle genes by catalytic receptors
|ChIP-Seq||Regulation of inflammatory response genes by NFκB p50 subunit-like factors
|Consensome||Transcriptomic||EGF receptors in the mammary gland|
|PPARGC1 family in the mouse metabolic system|
|All nodes in any mouse adipose tissue biosample
|ChIP-Seq||Mouse C/EBP transcription factors|
|Human CBP/p300 acetyltransferases|
PLOS Collection: ‘Omics of Cellular Signaling Pathways
We believe that hardening connections between peer-reviewed articles and their associated datasets is an essential component of a FAIR data universe. Far too many datasets are languishing in public archives with no information on what research article they are associated with. As a new service to PLOS readers, we are connecting SPP datasets to their associated PLOS article and presenting them in a new PLOS Collection, ‘Omics of Cellular Signaling Pathways. As part of this Collection, we firstly identify PLOS articles describing original ‘omics-scale datasets that have been carried out in a human, mouse or rat biosample, and that have been archived in a public repository. We then bin the datasets into categories: some involve genetic or small molecule manipulation of receptors, cytoplasmic or nuclear signaling enzymes, transcription factors or other major components of cellular signaling pathways. Other datasets profile molecular events associated with a variety of animal and cellular model systems, such as the Zucker rat, Roux-en-Y gastric bypass, or – more topically – viral infections of human cells. Finally, clinical datasets deploy discovery-driven approaches to characterizing disease states in human subjects, such as obesity or Type 2 diabetes. Next, we give the datasets names and descriptions that more accurately convey their design and structure than those typically found in public repositories. Finally, for each dataset we mint a digital object identifier (DOI) that, much as a bar code does for a product on a shelf in a grocery store, unambiguously asserts the identity and provenance of the dataset, no matter where in the internet its metadata might be distributed. The re-annotated dataset is then hosted on the SPP website, whose features include: backlinks to the original article, firmly establishing the experimental context from which it emerged; one-click download of the citation into a favorite reference manager to provide attribution to the original dataset creators; and links into our pan-omics knowledgemine Ominer, to extend the biological compass of the dataset in ways not envisaged by the original investigators.
We foresee an ideal future in which ‘omics datasets and the literature exist in a mutually enhancing relationship, the former providing researchers with insights that are limited in resolution but broad in scope, the latter providing the focused mechanistic and functional detail required to properly interpret and contextualize the node-target relationships. Paramount to such a scenario is equal ease of access to both the literature and ‘omics datasets, such that hypotheses can be generated from ‘omics datasets as readily and intuitively as abstracts can be accessed through literature search engines. Moreover, in an era of tightening research budgets, there is a pressing responsibility on the biomedical research community to re-purpose existing assets to allow bench researchers to routinely generate future research hypotheses. An important next step therefore will be to establish interoperability between SPP and knowledgebases such as Reactome that are based upon expert manual curation of the research literature. In addition, in a collaboration with the NIDDK Information Network (dkNET), we are designing a Hypothesis Center, the aim of which is to make connections between ‘omics datasets and community biological annotations (IMPC, Monarch, Gene Ontology, etc) to enable researchers to formulate hypotheses around their signaling pathways in their biological contexts of interest.
Obviously there are practical limitations on our approach and in the absence of an army of biocurators, there is a need to design our annotation strategy to prioritize those datasets that map to pathways for which the greatest amount of information has been amassed in the research literature. We therefore welcome feedback from the PLOS reader and research community as to which pathway nodes they feel are under-represented in the research literature. In addition, although rates of deposition are improving across the community, we unfortunately encounter a considerable number of PLOS articles for which the associated ‘omics dataset has not been archived. Offering investigators the opportunity to have their articles featured in this Collection will hopefully incentivize them to ensure that ‘omics datasets associated with their papers are archived, and to let us know when this has been done. Although we are limited by budget as well as by the node demographics of archived datasets, we will do our best to respond to such feedback. In addition, we welcome input of any kind on the Collection and on SPP itself as a resource, and hope that PLOS readers will find it to be a useful tool in making sense of the growing volume of big data in cell signaling biology.
The peer-reviewed research literature is a rich source of information and has provided the vast majority of current knowledge on cellular signaling pathways. At SPP however, we believe that painstakingly detailed and essential though they are however, information in traditional research articles is only part of the puzzle. Our mission at SPP is to improve the accessibility and re-usability of discovery-scale datasets through biocuration and analysis tool development, so that scientists have greater access to a universe of data points that has for the most part flown under the radar to date.
Check out and bookmark the Collection: https://collections.plos.org/signalingpathways
1 Ochsner SA, Abraham D, Martin K, Ding W, McOwiti A, Kankanamge et al. The Signaling Pathways Project, an integrated ‘omics knowledgebase for mammalian cellular signaling pathways. Scientific data. 2019 Oct 31;6(1):252. pmid:31672983. Pubmed Central PMCID: 6823428
Neil McKenna Ph.D. is Associate Professor of Molecular and Cellular Biology at Baylor College of Medicine and leads the SPP knowledgebase. SPP is supported under the NIDDK NIDDK Network Coordinating Unit U24 DK097771 and by a grant from the American Thyroid Association. Follow SPP on Twitter at @sigpathproject.