Research Division Seminar
Big data, Big responsability: reproducible, archivable and branchable pipelines
It is 400 years that we are studying the sky through the filter of a telescope, not only our naked eyes. In the last few decades, we are now "see"ing the sky through software (major processing is necessary on raw hardware outputs before anything can be done with the data). Therefore, just as its important for an astronomer to calibrate and control the optical elements/layers of a telescope and camera that the light passes through, we also need to perfectly calibrate and control the layers of software that data passes through. In this talk, I will introduce a solution. It is a project workflow/pipeline management template/standard that can easily be adopted to any computational analysis. This workflow matured in the IAC and was awarded a Research Data Alliance (RDA) adoption grant. RDA is a forum with +9500 members to suggest and propose best practices and policies in open data. The template is a complete set of instructions for a project: 1) defining all necessary inputs (data or software source code, validated by checksums), 2) building the high level science software and their dependencies, all the way down to the C library and C compiler(!), making it fully independent of the host operating system with all components under control, 3) How to run the software on the data, i.e., do the analysis (to any level of complexity) and 4) a narrative description of the project and its outputs in PDF (for example figures in a paper or, quality checks in a reduction pipeline). This highly complete description of the project is version controlled in Git, preserving its history as the project evolves, or more importantly allowing projects to branch from each other, with parallel evolution, and later merging to import infrastructure improvements. This makes it very useful in the design of the reduction pipeline of astronomical instruments. In the case of scientific papers, the full project can easily be uploaded to arXiv with the paper (its all in plain text: only ~200kb), enabling world-wide mirroring and preservation far beyond the original authors. For example see the following two recent papers: arXiv:1909.11230 and arXiv:1911.01430 (note the Git commit at the end of the abstract). A workshop will be held at the IAC (March 30th to April 3rd 2020) to help in adopting this template into your research, please join if you are interested.