A crazy little thing called reproducible science

RAPIDS, 18th July, 2018 London, UK

Presented By
Tania Allard, PhD

A bit more about me

  • Research Software Engineer
  • Software carpentry trainer and instructor
  • Reproducibility nerd
  • OSS advocate, community builder, and contributor
  • I 💜 emojis and memes

Larger datasets, more computational resources, and science

Natural bedfellows?

Boon and challenge... all at once

  • Large datasets and complex models enable discovery of new and more subtle scientific phenomena
  • We have access to more data and reseources than ever before... the sky is the limit!
  • Data and computationally-intensive modern methods have outgrown traditional methods to communicate research
  • Plus forced new paradigms to exist

Who cares about reproducibility?


Many care and talk about it... but reproducible practices are not always included in the day-to-day taks of ML, DS, and research in general

What is reproducibility?

It depends who you ask...
For experimental science: method + environment = results

Get the same results even if repeating in a different laboratory

The father of reproducibility

~150 years ago Pasteur demonstrated how experiments can be conducted reproducibly and the value of doing it that way.


💉 Antibiotics


🍻 Beers!!!


In a computational environment

It refers to being able to get the same results on your own or other computer using the same code... even years after the initial study/first run. (time and machine-independent)

🤔 So same code same results...

What about the data?







⭐ Low reproducibility: paper only

A well-written paper should, in theory, explicitly lay out the methodology in enough detail to allow for reproduction.

Although it is often impractical due to incomplete information and the existence of conceptual dependencies needed to understand a paper.

Sylvain Gelly and David Silver. Combining online and offline knowledge in UCT. In Proceedings of the 24th international conference on Machine learning, pages 273–280. ACM, 2007

⭐⭐ Medium reproducibility: code and data

This is the level of reproducibility encouraged by ICML.
Matej Balog, Nilesh Tripuraneni, Zoubin Ghahramani, and Adrian Weller. Lost relatives of the Gumbel trick. In 34th International Conference on Machine Learning (ICML), August 2017.


⭐⭐⭐ High reproducibility: code, data, and environment

Here we refer to all the libraries and dependencies necessary to run the code provided on a new machine.

It avoids the "it runs on my machine" problem

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, pages 1885–1894, 2017.

https://worksheets.codalab.org/worksheets/0x2b314dc3536b482dbba02783a24719fd/ https://hub.docker.com/r/pangwei/tf1.1/


- Jenny Brian on Project oriented workflows

Let's revisit a typical scenario

What you did...

Open package 'x'. Click, click, drag, click, click, right-click, save, 'results.csv'

Load into Excel. Click, drag, generate graph, right-click, save, 'beautifulgraph1.png'

What you reported...

The data was analysed using package 'x' using the 'y' analysis. The results are shown in Figure 1.

A better technical scenario

Your objective is to have a complete chain of custody (provenance) from your raw data to your finished results and figures.

This way you would be able to figure out what code and data were used to generate which result.

If using version control you can also refer to specific versions of your study (i.e. manuscript, first quarter report, Nobel Prize committee version)

Practical scenario

Imagine someone manages to sneak into your office at night AND deletes E-V-E-R-Y-T-H-I-N-G except for your code and data ('cause these are in safe repositories)
Imagine being able to run a single command to generate everything including results, tables, and figures in their final polished form.

Wouldn't it be great?

Better yet, if someone completly unfamiliar with your project could be able to look at these files and understand what you did and why (i.e. readers, collaborators, your replacement, you in 6 months time).

From a sustainability point of view

Speed scientific progress

Contribute to open source and our beloved community

Acquire more varied, highly valuable skills

Change the current academic culture ✨

And my all time favourite... increase the bus factor

Why is it so hard?

Barriers to reproducible science/research
  • Not considered for promotion*
  • Requires additional skills
  • Takes time
  • Publication bias towards *innovative* findings
  • Held to higher standards than other

Many other barriers to open science

  • Paywalled articles
  • People decrying "methodological terrorism"
  • Broken institutions/processes

There is hope

The start to a long lasting relationship

What can I do to make my research more reproducible?

Start with small practical steps....

These are worth millions... just trust me

When shall I think about reproducibility?

The best time to think about reproducibility is also the best time to pick your evaluation/diagnostic/statistical methods. Right when you're scoping the project

This also increases your chances of success

Treat your digital research assets with care

The results are important but the process you followed and the tools you used to get there are just as important.
Your scripts/code, null results, datasets, and iterations can make a positive difference in research

Share with others

  • Well documented code.... even if for yourself in 6 months time
  • Data used to produce the results
  • The details of the workflows used
  • Information on how to cite your work
  • Information on how to use your work: licenses
  • Deterministic execution environments*
*To ensure that if anyone else runs your analysis on a different machine the would get the same results

But sharing is not enough...

Always always add a license! Data and code have copyright, without a license it is in a weird limbo: it's public but cannot be used, reused or modified.

And please add metadata... data about data

Make things traceable

When sharing your data and code provide citation files, DOIs, links to archives.

When using other's data make sure to cite it and add information on where others can access said data

Make sure the link works!!!

Version control is your friend

Make sure you version control your code, manuscripts, methods, execution environment, and data (but remember raw data are sacrosanct)

Sharing formats

Avoid sharing data using serialized data formats (pickles, Rdata).

Yes they are lightweight but they can break between languages, versions and envs. Favour .csv, .txt, .json, .hd5

Feeling self-conscious?

Get an extra pair of eyes on your code first... use a linter (lintr, pylint)

Choose the right tool for the job

  • Sharing a pipeline that needs to be run in an order? consider sharing plain scripts
  • Need to integrate narrative and viz? Use Jupyter notebooks or Rmarkdown
  • Need interactivity? Use binder
  • Care about portability/reusability? Consider making a package

How much data/code?

  • Raw data and pre-processing code
  • Pre-processed data and modelling code
  • Trained model and evaluation code
  • Documentation and example cases

The art of naming

Naming things is hard. Aim to achieve the following:
  • Machine readable (plays well with grep, ordering): 01_cleandata.py
  • Human readable: 02_lenghttalkvsinterest.png
  • Universal: 2018-19-07
  • Intentional use of delimiters: 05_embrace-the-slug.R

Is reproducible == open?

What if I cannot share my data/code?

That is fine...

Find what works for you

It is not only about disclosure

You can have FAIR assets without them being open

Adopt an open science approach

What does this even mean?

It is not only about the science

It is also about the people and empowering them to make better science

We are not the leaders of tomorrow, we are the leaders of today
Doing open science and reproducible science can often be hard and frustrating. But ...
"Unless someone like you cares an awful lot, nothing's going to get better.
It's not."

Dr Seuss