Down the rabbit hole.
A 101 on reproducible workflows with Python
"...the user interface conflated input, output, code, and presentation, making testing code and discovering bugs difficult"
"It is like agreeing that we will all drive on the left or the right. A hallmark of civilization is following conventions that constrain your behaviour a little, in the name of public safety"
Always be able to figure out what code and data were used to generate which result.
If using version control you can also refer to specific versions of your study (i.e. manuscript, first quarter report, Nobel Prize committee version)
Wouldn't it be great?
Better yet if someone completly unfamiliar with your project could be able to look at these files and understand what you did and why (i.e. readers, collaborators, your replacement, you in 6 months time).
Speed scientific progress
Contribute to open source and our beloved community 🐍
Acquire more varied, highly valuable skills
And my all time favourite... increase the bus factor
|Questions to ask||Possible scenarios||Tools and helpers|
|What is your MVP? Final product?||Thesis, article, internal report||Latex, Jupyter notebooks, scripts|
|Who will use my data and code?||Only me, colleagues, other groups||Git (private/public) , License, documentation|
|Who and how long will this be maintained for?||6 months, until studies completion||Collaborators, OSS community|
|What about my other assets? (data, slides, workflows)||Can be shared, published, deposited in a repository||Figshare, institutional database, web page|
"Open-source licenses allow people to coordinate their work freely, within the confines of copyright law, while making access and wide distribution a priority I’ve always thought that this is fundamentally aligned with the method of science, where we value academic freedom and wide dissemination of scientific findings."
We are going to use Cookie cutter to create our base filesystem and some support documents.
If you are not sure if you have set up git before or want to check your configuration use
$ git config --global user.name "Tania Allard" $ git config --global user.email "email@example.com"
$ git config --list
git statusin your shell
However, it is important to consider how you are going to share your data. The best practices for sharing data on the Web summary is a good place to get started.
We will be using the Kaggle wine
The first step is to download the data and store it in our
You want to make sure that all your data has information describing how you got the data, the meaning of the columns, etc.For your own use make sure to create at least a README file describing the data as best as you can. Create a README.txt (or .md) file inside the data directory and add the following content or something similar.
Title: Reproducible Python keywords: wine, reviews, magazine, kaggle Data collected from Kaggle winemag reviews URL: https://www.kaggle.com/residentmario/renaming-combining-data/data Collected on: 09/05/2018 by Tania Allard
Commit the README to git
We want to do the following:
winemag-130k-v2.csvcontaining only the following columns:
country, designation, points, price (in GBP). Save in a .csv file
Don't worry you do not have to generate all of the scripts... we have provided some scripts for you to get
You should now have a directory called
You need to make sure that all the scripts and notebooks from the directory are in the appropriate directory inside your newly created project.
Once this is done commit your changes to git
Let's face it.... there are going to be files LOTS of files
The three principles for (file) names:
* Avoid spaced, accented characters, files 'foo' and 'Foo'
|🚫 NO||✔ YES|
|Joey's filename has spaces and punctuation.xlsx||joeys-filenames-are-getting-better.xlsx|
$ jupyter lab
Let's start by checking the scripts and notebooks:
Make sure to be at the root of your directory e.g.
$ python src/data/01_subset-data-GBP.py data/raw/winemag-data-130k-v2.csv $ python src/visualization/02_visualize-wines.py data/interim/2018-05-09-winemag_priceGBP.csv $ python src/data/03_country-subset.py data/interim/2018-05-09-winemag_priceGBP.csv Chile
😕 What problems did you encounter?
Besides, this gets quite boring pretty soon and still depends a lot on the user.
We used a modular approach here, so we can use and reuse the functions more efficiently.
The next step it to make a
runall script to minimize the user interaction.
First, we need to make sure that Python recognizes our scripts as a package so we can call functions from the multiple modules.
From the shell:
$ touch src/data/__init__.py # Ensures Python understands $ touch src/visualization/__init__.py # that we are creating a package $ touch src/__init__.py
Since our modules start with digits (i.e.
02) we cannot do the import as we'd normally do
from mypackage import myAwesomeModule
Instead we need to do it like so:
Also, we need to make sure that the other subpackages/modules are imported correctly. Add the following to
subset = importlib.import_module('.data.01_subset-data-GBP', 'src') plotwines = importlib.import_module('.visualization.02_visualize-wines', 'src') country_sub = importlib.import_module('.data.03_country-subset', 'src')
from . import data from . import visualization
How would you do to run the analysis from step 01 (process the data) to 03 subset for a country and plot the results?
Once you have done this and make sure you can run it from your shell and commit the changes to git.
Note you might need to run this from the shell like so
python -m src.runall-wine-analysis
😕 Such a shame we still cannot guarantee the results are correct... or that there are no bugs.
The next step is to include tests... in fact testing should be a core part of our development process. In fact all of our reproducible workflows are analogous to experimental design in the scientific world
There are various approaches to test software:
Remember when you tried to run
It would not work unless you had created a figures directory beforehand.
We can catch this kinds of errors by adding this piece of code:
runall script should work!!! 🎉🎉
$ python src.runall-wine-analysis
03_country-subset.py and add the following function:
Now we can create our tests:
$ mkdir tests # Create tests directory $ touch tests/__init__.py # Help find the test $ touch test_03_country_subset.py # Create our first test
⭐ Your test scripts name must start with:
We tested each of the functions in our module.
Notice something in the functions we just wrote?
mean = country.get_mean(interim_data)
assert mean_price == 20.786
Now don't forget to commit your code:
$ git add . $ git commit -m "Add unit test suite"
Regression tests assume that the past is “correct.” They are great for letting developers know when and how a code base has changed. They are not great for letting anyone know why the change occurred. The change between what a code produces now and what it computed before is called a regression.
How many times have you tried to run a script or a notebook you found online just to realize it is broken?
Let's do some regression testing on the Jupyter notebook using nbval
We first need to understand how a Jupyter notebook works. All the data is stored in a .json like format (organised key, data values)... this includes the results, code, and markdown.
Nbval checks the stored values while doing a mock run on the notebook and compares the saved version of the notebook vs the results obtained from the mock run
Try it on your shell
$ pytest --nbval src/data/00_explore-data.ipynb
What would happen if you were to have a cell like this one?
import time print('This notebook was last run on: ' + time.strftime('%d/%m/%y') + ' at: ' + time.strftime('%H:%M:%S'))
Note that this has to be the firts import Run the script again
python -m src.runall-wine-analysis
type those command on your shell (or copy and paste). Use your own details
Refresh your web browser and... ta dah! Your project is online
Now, instead of running our tests manually every time we want for this to be tested every time we push something from our local computer to our GitHub account.
Some of the advantages of doing this are:
language: python python: - 3.6 before_install: # Here we download miniconda and create our conda environment - export MINICONDA=$HOME/miniconda - export PATH="$MINICONDA/bin:$PATH" - hash -r - wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh - bash miniconda.sh -b -f -p $MINICONDA - conda config --set always_yes yes - conda update conda - conda info -a # Create the environment from a yml file # Is this familiar? - conda env create -f testenv.yml -v - source activate testenv script: - pytest - pytest --nbval notebooks/00_explore-data.ipynb
Complex analysis depend on a number of packages and libraries native to your OS, packages the user installes, environmental variables and so on. Manually maintaining these dependencies is a rather tedious task.
That is why we use package managers, such as Conda. Do you remember we started this workshop by installing all the packages we needed and one of the steps was to use an environment.yml file? This ensures we all have the same packages.
Then at the beginning of the course we activated our reproPython environment and have been using the packages installed in this environment.
name: testenv channels: - conda-forge - defaults dependencies: - python>3.6 - pytest - pandas - matplotlib - jinja2 - pip: - nbval
#NBVAL_SKIPto the top of cells 3 and 4.
This will trigger an automatic test of your project
$ git add . $ git commit -m "Add files for CI" $ git push
# Add data for test !data/interim/2018-05-09-winemag_priceGBP.* !data/processed/2018-05-09-winemag_Chile.*
We will be redirected to GitHub to create a release of the repository.
After creating the release, go back to Zenodo and refresh the page. You should now see the newly created DOI 🤩
Click on the DOI and copy the markdown text, add this to your README.md and push to GitHub!
Save as CITATION.cff and commit and push to GitHub
cff-version: 1.0.3 message: If you use this software, please cite it as below. authors: - family-names: Allard given-names: Tania orcid: https://orcid.org/0000-0003-4925-7248 title: My reproducibel python workflow version: 0.1 doi: 10.5281/zenodo.1241049 date-released: 2018-05-09