How open is ‘open data’?

by Matt Jaquiery & Joshua Calder-Travis

Science is about what’s objectively true, and that means that the same results should appear regardless of who does the work or when they do it. As part of a general commitment to reproducibility [1] (justifiably a watch-word for modern psychology), authors are increasingly making data and analysis scripts available. This means that other researchers (and members of the general public) can explore the relationship between the data and the results which are based upon them. Data being available, however, is only a part of the puzzle, another part concerns data accessibility: how easily can data be understood? [2]

Recently my colleague Joshua Calder-Travis and I became intrigued enough by a paper to invest a couple of hours in accessing the data and running a simple correlation which the authors’ arguments had led us to suspect would demonstrate a statistically significant effect (spoiler alter: we succeeded and it did). In the process of doing this exploration we gained some insights into the practicalities of ‘open data’: investigating data can be enormously helpful in understanding a paper [3] and sharing data is science communication.

Getting the data

The data and scripts were easy to download following links in the ‘data availability’ section of the paper. [4] We soon noticed that data from only 3 of the 4 experiments were included. Naturally, the data we particularly wanted to investigate were from the missing experiment! Still, hoping for a nice scatter plot before the afternoon was out, we pressed on with the data we had.

We turned to identifying the main analysis script. With no ‘readme’ file describing the contents of the download we were on our own... The names of the scripts were very vague, but with only a handful of script files it was a matter of minutes to identify the most likely candidate.

Having found the correct script, we tried to run it… and immediately encountered an error arising from a missing function. The authors had used several custom functions to locate and load the correct data files, and these functions were not included in the download. Presumably they had forgotten that the analysis called these functions. [5] Thankfully, Joshua and I are familiar enough with MATLAB that it only took a couple of minutes to write rudimentary replacements for the missing functions.

Analyzing data

With the data loaded, the script ran long enough to throw out a slew of analysis tables and figures, most of which were recognizable from the published paper, before crashing once again due to another missing function.

The missing function this time around was more serious: its role was to fit a variety of models to each participant. The modelling parameters formed a core part of the authors’ analysis in their paper, and one of the model parameters was involved in the effect we wanted to investigate. Given more time we would have contacted the authors and requested the modelling code. As it was, we calculated a simpler substitute variable from the raw data. To achieve this we had to familiarize ourselves with the script well enough to identify which, among a plethora of similarly-named, comment-free variables, were the specific ones required to calculate our derived value.

Evaluation

We were successful insofar as we managed to run our regression and observe the effect we had predicted. That said, we were able to use neither the dataset nor the variables we had originally intended to analyze. A victory for open science, but only an equivocal victory for open data.

Despite the difficulties we encountered, it was very satisfying to explore the ideas in the paper ourselves, rather than passively reading about them. It felt like the way science should be done: a community questioning and investigating ideas together. In sharing their data, the paper’s authors had taken psychology a step in that direction. Our comments here are intended to help others do likewise, and to lower the barriers to exploring data as a user.

Sharer’s perspective

Researchers are busy, we know. They’re also generous with their time when contacted by another researcher genuinely interested in their work, resulting in a trade-off whereby fulfilling a commitment to open data is done prospectively: data are provided in a rough-and-ready way, along with an acceptance that time will be found if an email arrives requesting help managing the data. As ‘open science’ becomes the norm, this approach doesn’t seem sustainable.

The ‘here if you need me’ approach does not scale well, and thus presupposes failure. When science was distinctly separate from the general public, and when enquiries were rare enough to be worth the time expended when contacted about results, the approach represented a good balance of immediate versus potential future time investment. However, just as few people would write a blog with the goal that no one will read it, no researcher should make their data available with the goal that no one will investigate it. The more widely accessible data are, the more widely accessed we expect them to be, and thus the more queries we expect them to generate. Better to spend the couple of hours upfront ensuring everything is in order than face the long tail of enquiries…

The commitment to open science should go beyond merely ensuring a data trail exists which can be appealed to for defense against disputes and allegations. Making data available is an act of science communication, the goal is to engage as many (relevant) people as possible. Data which is intended to be used by multiple people must be as user-friendly as possible.

Furthermore, the typical approach is fragile. So long as one of the original authors needs to be contacted in order to unlock the data for those who are investigating it, the data’s usability can easily be lost. It is also fragile to the vagaries of human memory: details of variables and scripts which could not be driven from one’s mind mere weeks ago can be entirely mysterious after a project is complete (as we know from firsthand experience).

Conclusions

It was very satisfying to investigate ideas from a paper by analyzing a large chunk of the supporting data. We encountered some serious problems, but overall, we feel that the goals of ‘open science’ were fairly well served in this interaction. Taking the perspective of the authors into account, here are a few points which will be on our checklist when we’re fulfilling our own commitments to open data.

Make data hygiene routine: comment code, give variables sensible names, and maintain a readme file identifying why each file in your data directory exists and what its purpose is. This will help future you as much as unknown data-consumers.
Download the packaged data and run the analysis scripts you’ve included on a non-lab computer. Just to check you’ve included all the odds and ends.
Make sure to include all the data from your paper, or include a note explaining what’s missing and why.

[1] Some researchers differentiate ‘reproduceable’ meaning ‘get the same results from the same data’ from ‘replicable’ meaning ‘see the same effect in a novel dataset’ (Alter & Gonzalez, 2018)

[2] The FAIR data principles suggest data must be findable, accessible, interoperable, and reusable

[3] The understanding gained can be similar to taking the time to read a paper closely, though it is less time-efficient as a rule.

[4] A simple download without having to sign up, sign in, or fill out any forms. We expect it to be this way, but it’s worth acknowledging the benefits all the same.

[5] This is a surprisingly easy mistake to make in MATLAB, as we know from code sharing in the lab!

Oxford ACClab16 April 20181 Comment