Data sharing in a modern world; well, maybe not so modern

November 10, 2016 PLOS Collections Research Analysis & Science Policy

By Melissa Haendel and Nicole Vasilevsky

When I was asked to assist in selection of manuscripts to highlight for this collection, I was a bit stymied. What should be the most important inclusion criteria? The easiest place to start was to eliminate papers on which I’m an author. I don’t mention this because I am proud that my work was selected within the 80 + papers for me to review (obviously a conflict of interest). Rather, why I would not have selected my own papers helped me to think about what was most important. The second thing I did was consult one of my colleagues, Nicole Vasilevsky, a biocurator extraordinaire. I knew that our respective selections would not be the same, and that the ensuing banter would help us come up with a worthy strategy that did justice to the diverse issues highlighted in the Open Data Collection.

So what is important for data sharing? The impact on policy change? Highlighting ethical issues? Data science that advances our abilities to share or vice versa? Technologies that leverage shared data (the noble discipline of “data parasitism”), Community-focused efforts that implore the world to change? How sexy the figures are? And then of course, there is simply what do people think – how much is the article being discussed? Finally, we felt that we must consider disciplinary perspectives that foster cross-pollination of ideas and approaches. So what were my favorite picks? I chose 20 to be in the collection, but here I describe a few to better highlight what I think is so important across the aforementioned criteria.

The most “meta”

“The Dawn of Open Access to Phylogenetic Data” in PLOS ONE was my favorite manuscript; it was “meta” in that it evaluated the availability of data used to derive exchangeable knowledge artifacts (phylogenies). The authors used Bayesian logistic regression to estimate the effect of variables such as “from a professor” or “weak sharing policy” on data sharing. The authors discovered that as much as 60% of phylogenetic data is lost to science! But also, that things are rapidly improving.

Why data sharing doesn’t suck

“Sharing Detailed Research Data Is Associated with Increased Citation Rate” in PLOS ONE received a lot of attention, as it proved that data sharing was more impactful than not, fundamentally making data sharing terribly more inviting to many more people.

Why sucky data might lead to less sharing

In “Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results” (PLOS ONE), the authors hypothesized that researchers who have weaker data are more hesitant to share. The authors provided not only an outstanding statistics and deep analysis to confirm this hypothesis, but in doing so, they also revealed significant reporting errors. This further validates my belief in research parasitism for bettering science.

Show me all your genes

“Submission of Microarray Data to Public Repositories” in PLOS Biology was one of the first articles to adequately proclaim that data repositories should be working with the journals to ensure the data was open and compliant, paving the way for policy change at many journals, including PLOS.

“Things are not always what they seem; the first appearance deceives many; the intelligence of a few perceives what has been carefully hidden” (Phaedrus)

“Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm” in PLOS Biology illustrated how our scientific conclusions are dependent on which data and how they are visually presented. The authors analyzed numerous papers presenting continuous data in bar and line graphs; problematic, as “many different data distributions can lead to the same bar or line graph.” The authors concluded that authors can easily reach the wrong conclusion due to poor data visualization!

An obstinacy of concerns

With my ethics teacher hat on, I’d be remiss if I didn’t highlight “Ethical Challenges of Big Data in Public Health” published in PLOS Computational Biology. Here, we are (fore)warned that while big epidemiologic data can be used for disease surveillance – saving lives is clearly a “good thing” – but that that there are risks and unintended consequences, and there is a call to arms for the development of data sharing policies around specific types of data, whether they be social digital data or private genomic data in aggregate.

In the end, what is important to discuss with regard to data sharing? Basically anything and everything that assists the whole research cycle in moving forward more efficiently and effectively, clearly this is a team sport.

About the Authors

Melissa Haendel is trained as a developmental neuroscientist and is deeply invested making every graduate student’s data count.

Nicole Vasilevsky aims to improve all research by educating the scientific bucket brigade in open data practices.

Both are faculty at Oregon Health & Science University.