Probably Overthinking It: The six stages of computational science

This is the second in a series of articles related to computational science and education. The first article is here.

The Six Stages of Computational Science

When I was in grad school, I collaborated with a research group working on computational fluid dynamics. They had accumulated a large, complex code base, and it was starting to show signs of strain. Parts of the system, written by students who had graduated, had become black magic: no one knew how they worked, and everyone was afraid to touch them. When new students joined the group, it took longer and longer for them to get oriented. And everyone was spending more time debugging than developing new features or generating results.

When I inspected the code, I found what you might expect: low readability, missing documentation, large functions with complex interfaces, poor organization, minimal error checking, and no automated tests. In the absence of version control, they had many versions of every file, scattered across several machines.

I'm not sure if anyone could have helped them, but I am sure I didn't. To be honest, my own coding practices were not much better than theirs, at the time.

The problem, as I see it now, is that we were caught in a transitional form of evolution: the nature of scientific computing was changing quickly; professional practice, and the skills of the practitioners, weren't keeping up.

To explain what I mean, I propose a series of stages describing practices for scientific computing.

Stage 1, Calculating: Mostly plugging numbers into into formulas, using a computer as a glorified calculator.

Stage 2, Scripting: Short programs using built in functions, mostly straight line code, few user-defined functions.

Stage 3, Hacking: Longer programs with poor code quality, usually lacking documentation.

Stage 4, Coding: Good quality code which is readable, demonstrably correct, and well documented.

Stage 5, Architecting: Code organized in functions, classes (maybe), and libraries with well designed APIs.

Stage 6, Engineering: Code under version control, with automated tests, build automation, and configuration management.

These stages are, very roughly, historical. In the earliest days of computational science, most projects were at Stages 1 and 2. In the last 10 years, more projects are moving into Stages 4, 5, and 6. But that project I worked on in grad school was stuck at Stage 3.

The Valley of Unreliable Science

These stages trace a U-shaped curve of reliability:

By "reliable", I mean science that provides valid explanations, correct predictions, and designs that work.

At Stage 1, Calculating, the primary scientific result is usually analytic. The correctness of the result is demonstrated in the form of a proof, using math notation along with natural and technical language. Reviewers and future researchers are expected to review the proof, but no one checks the calculation. Fundamentally, Stage 1 is no different from pre-computational, analysis-based science; we should expect it to be as reliable as our ability to read and check proofs, and to press the right buttons on a calculator.

At Stage 2, Scripting, the primary result is still analytic, the supporting scripts are simple enough to be demonstrably correct, and the libraries they use are presumed to be correct.

But Stage 2 scripts are not always made available for review, making it hard to check their correctness or reproduce their results. Nevertheless, Stage 2 was considered acceptable practice for a long time; and in some fields, it still is.

Stage 3, Hacking, has the same hazards as Stage 2, but at a level that's no longer acceptable. Small, simple scripts tend to grow into large, complex programs. Often, they contain implementation details that are not documented anywhere, and there is no practical way to check their correctness.

Stage 3 is not reliable because it is not reproducible. Leek and Peng define reproducibility as "the ability to recompute data analytic results given an observed dataset and knowledge of the data analysis pipeline."

Reproducibility does not guarantee reliability, as Leek and Peng acknowledge in the title of their article, "Reproducible research can still be wrong". But without reproducibility as a requirement of published research, there is no way to be confident of its reliability.

Climbing out of the valley

Stages 4, 5, and 6 are the antidote to Stage 3. They describe what's needed to make computational science reproducible, and therefore more likely to be reliable.

At a minimum, reviewers of a publication and future researchers should be able to:

1) Download all data and software used to generate the results.

2) Run tests and review source code to verify correctness.

3) Run a build process to execute the computation.

To achieve these goals, we need the tools of software engineering:

1) Version control makes it possible to maintain an archived version of the code used to produce a particular result. Examples include Git and Subversion.

2) During development, automated tests make programs more likely to be correct; they also tend to improve code quality. During review, they provide evidence of correctness, and for future researchers they provide what is often the most useful form of documentation. Examples include unittest and nose for Python and JUnit for Java.

3) Automated build systems document the high-level structure of a computation: which programs process which data, what outputs they produce, etc. Examples include Make and Ant.

4) Configuration management tools document the details of the computational environment where the result was produced, including the programming languages, libraries, and system-level software the results depend on. Examples include package managers like Conda that document a set of packages, containers like Docker that also document system software, and virtual machines that actually contain the entire environment needed to run a computation.

These are the ropes and grappling hooks we need to climb out of the Valley of Unreliable Science.

Unfortunately, most people working in computational science did not learn these tools in school, and they are not easy to learn. For example, Git, which has emerged as the dominant version control system, is notoriously hard to use. Even with GitHub and graphical clients, it's still hard. We have a lot of work to do to make these tools better.

Nevertheless, it is possible to learn basic use of these tools with a reasonable investment of time. Software Carpentry offers a three hour workshop on Git and a 4.5 hour workshop on automated build systems. You could do both in a day (although I'm not sure I'd recommend it).

Implications for practitioners

There are two ways to avoid getting stuck in the Valley of Unreliable Science:

1) Navigate Through It: One common strategy is to start with simple scripts; if they grow and get too complex, you can improve code quality as needed, add tests and documentation, and put the code under version control when it is ready to be released.

2) Jump Over It: The alternative strategy is to maintain good quality code, write documentation and tests along with the code (or before), and keep all code under version control.

Naively, it seems like Navigating is better for agility: when you start a new project, you can avoid the costs of over-engineering and test ideas quickly. If they fail, they fail fast; and if they succeed, you can add elements of Stages 4, 5, and 6 on demand.

Based on that thinking, I used to be a Navigator, but now I am a Jumper. Here's what changed my mind:

1) The dangers of over-engineering during the early stages of a project are overstated. If you are in the habit of creating a new repository for each project (or creating a directory in an existing repository), and you start with a template project that includes a testing framework, the initial investment is pretty minimal. It's like starting every program with a copy of "Hello, World".

2) The dangers of engineering too late are much greater: if you don't have tests, it's hard to refactor code; if you can't refactor, it's hard to maintain code quality; when code quality degrades, debugging time goes up; and if you don't have version control, you can't revert to a previous working (?) version.

3) Writing documentation saves time you would otherwise spend trying to understand code.

4) Writing tests saves time you would otherwise spend debugging.

5) Writing documentation and tests as you go along also improves software architecture, which makes code more reusable, and that saves time you (and other researchers) would otherwise spend reimplementing the wheel.

6) Version control makes collaboration more efficient. It provides a record of who changed what and when, which facilitates code and data integrity. It provides mechanisms for developing new code without breaking the old. And it provides a better form of file backup, organized in coherent changes, rather than by date.

Maybe surprisingly, using software engineering tools early in a project doesn't hurt agility; it actually facilitates it.

Implications for education

For computational scientists, I think it's better to jump over the Valley of Unreliable Science than try to navigate through it. So what does that imply for education? Should we teach the tools and practices of software engineering right from the beginning? Or do students have to spend time navigating the Valley before they learn to jump over it?

I'll address these questions in the next article.

Probably Overthinking It

Friday, February 23, 2018

The six stages of computational science

The Six Stages of Computational Science

The Valley of Unreliable Science

Climbing out of the valley

Implications for practitioners

Implications for education

No comments:

Post a Comment