Many care and talk about it... but reproducible practices are not always included in the day-to-day taks of ML, DS, and research in general
Although it is often impractical due to incomplete information and the existence of conceptual dependencies needed to understand a paper.
It avoids the "it runs on my machine" problem
"It is like agreeing that we will all drive on the left or the right. A hallmark of civilization is following conventions that constrain your behaviour a little, in the name of public safety"
Open package 'x'. Click, click, drag, click, click, right-click, save, 'results.csv'
Load into Excel. Click, drag, generate graph, right-click, save, 'beautifulgraph1.png'
The data was analysed using package 'x' using the 'y' analysis. The results are shown in Figure 1.
This way you would be able to figure out what code and data were used to generate which result.
If using version control you can also refer to specific versions of your study (i.e. manuscript, first quarter report, Nobel Prize committee version)
Wouldn't it be great?
Better yet, if someone completly unfamiliar with your project could be able to look at these files and understand what you did and why (i.e. readers, collaborators, your replacement, you in 6 months time).
Speed scientific progress
Contribute to open source and our beloved community
Acquire more varied, highly valuable skills
Change the current academic culture ✨
And my all time favourite... increase the bus factor
This also increases your chances of success
And please add metadata... data about data
When using other's data make sure to cite it and add information on where others can access said data
Make sure the link works!!!
Find what works for you
What does this even mean?
It is also about the people and empowering them to make better science