Friday, February 23, 2018

The six stages of computational science

This is the second in a series of articles related to computational science and education.  The first article is here.

The Six Stages of Computational Science

When I was in grad school, I collaborated with a research group working on computational fluid dynamics.  They had accumulated a large, complex code base, and it was starting to show signs of strain.  Parts of the system, written by students who had graduated, had become black magic: no one knew how they worked, and everyone was afraid to touch them.  When new students joined the group, it took longer and longer for them to get oriented.  And everyone was spending more time debugging than developing new features or generating results.

When I inspected the code, I found what you might expect: low readability, missing documentation, large functions with complex interfaces, poor organization, minimal error checking, and no automated tests.  In the absence of version control, they had many versions of every file, scattered across several machines.

I'm not sure if anyone could have helped them, but I am sure I didn't.  To be honest, my own coding practices were not much better than theirs, at the time.

The problem, as I see it now, is that we were caught in a transitional form of evolution: the nature of scientific computing was changing quickly; professional practice, and the skills of the practitioners, weren't keeping up.

To explain what I mean, I propose a series of stages describing practices for scientific computing.
  • Stage 1, Calculating:  Mostly plugging numbers into into formulas, using a computer as a glorified calculator.
  • Stage 2, Scripting: Short programs using built in functions, mostly straight line code, few user-defined functions.
  • Stage 3, Hacking: Longer programs with poor code quality, usually lacking documentation.
  • Stage 4, Coding: Good quality code which is readable, demonstrably correct, and well documented.
  • Stage 5, Architecting: Code organized in functions, classes (maybe), and libraries with well designed APIs.
  • Stage 6, Engineering: Code under version control, with automated tests, build automation, and configuration management.
These stages are, very roughly, historical.  In the earliest days of computational science, most projects were at Stages 1 and 2.  In the last 10 years, more projects are moving into Stages 4, 5, and 6.  But that project I worked on in grad school was stuck at Stage 3.

The Valley of Unreliable Science

These stages trace a U-shaped curve of reliability:

By "reliable", I mean science that provides valid explanations, correct predictions, and designs that work.

At Stage 1, Calculating, the primary scientific result is usually analytic.  The correctness of the result is demonstrated in the form of a proof, using math notation along with natural and technical language.  Reviewers and future researchers are expected to review the proof, but no one checks the calculation.  Fundamentally, Stage 1 is no different from pre-computational, analysis-based science; we should expect it to be as reliable as our ability to read and check proofs, and to press the right buttons on a calculator.

At Stage 2, Scripting, the primary result is still analytic, the supporting scripts are simple enough to be demonstrably correct, and the libraries they use are presumed to be correct.

But Stage 2 scripts are not always made available for review, making it hard to check their correctness or reproduce their results.  Nevertheless, Stage 2 was considered acceptable practice for a long time; and in some fields, it still is.

Stage 3, Hacking, has the same hazards as Stage 2, but at a level that's no longer acceptable.  Small, simple scripts tend to grow into large, complex programs.  Often, they contain implementation details that are not documented anywhere, and there is no practical way to check their correctness.

Stage 3 is not reliable because it is not reproducible. Leek and Peng define reproducibility as "the ability to recompute data analytic results given an observed dataset and knowledge of the data analysis pipeline."

Reproducibility does not guarantee reliability, as Leek and Peng acknowledge in the title of their article, "Reproducible research can still be wrong". But without reproducibility as a requirement of published research, there is no way to be confident of its reliability.

Climbing out of the valley

Stages 4, 5, and 6 are the antidote to Stage 3.  They describe what's needed to make computational science reproducible, and therefore more likely to be reliable.

At a minimum, reviewers of a publication and future researchers should be able to:

1) Download all data and software used to generate the results.

2) Run tests and review source code to verify correctness.

3) Run a build process to execute the computation.

To achieve these goals, we need the tools of software engineering:

1) Version control makes it possible to maintain an archived version of the code used to produce a particular result.  Examples include Git and Subversion.

2) During development, automated tests make programs more likely to be correct; they also tend to improve code quality.  During review, they provide evidence of correctness, and for future researchers they provide what is often the most useful form of documentation.  Examples include unittest and nose for Python and JUnit for Java.

3) Automated build systems document the high-level structure of a computation: which programs process which data, what outputs they produce, etc.  Examples include Make and Ant.

4) Configuration management tools document the details of the computational environment where the result was produced, including the programming languages, libraries, and system-level software the results depend on.  Examples include package managers like Conda that document a set of packages, containers like Docker that also document system software, and virtual machines that actually contain the entire environment needed to run a computation.

These are the ropes and grappling hooks we need to climb out of the Valley of Unreliable Science.

Unfortunately, most people working in computational science did not learn these tools in school, and they are not easy to learn.  For example, Git, which has emerged as the dominant version control system, is notoriously hard to use.  Even with GitHub and graphical clients, it's still hard.  We have a lot of work to do to make these tools better.

Nevertheless, it is possible to learn basic use of these tools with a reasonable investment of time.  Software Carpentry offers a three hour workshop on Git and a 4.5 hour workshop on automated build systems.  You could do both in a day (although I'm not sure I'd recommend it).

Implications for practitioners

There are two ways to avoid getting stuck in the Valley of Unreliable Science:

1) Navigate Through It: One common strategy is to start with simple scripts; if they grow and get too complex, you can improve code quality as needed, add tests and documentation, and put the code under version control when it is ready to be released.

2) Jump Over It: The alternative strategy is to maintain good quality code, write documentation and tests along with the code (or before), and keep all code under version control.

Naively, it seems like Navigating is better for agility: when you start a new project, you can avoid the costs of over-engineering and test ideas quickly.  If they fail, they fail fast; and if they succeed, you can add elements of Stages 4, 5, and 6 on demand.

Based on that thinking, I used to be a Navigator, but now I am a Jumper.  Here's what changed my mind:

1) The dangers of over-engineering during the early stages of a project are overstated.  If you are in the habit of creating a new repository for each project (or creating a directory in an existing repository), and you start with a template project that includes a testing framework, the initial investment is pretty minimal.  It's like starting every program with a copy of "Hello, World".

2) The dangers of engineering too late are much greater: if you don't have tests, it's hard to refactor code; if you can't refactor, it's hard to maintain code quality; when code quality degrades, debugging time goes up; and if you don't have version control, you can't revert to a previous working (?) version.

3) Writing documentation saves time you would otherwise spend trying to understand code.

4) Writing tests saves time you would otherwise spend debugging.

5) Writing documentation and tests as you go along also improves software architecture, which makes code more reusable, and that saves time you (and other researchers) would otherwise spend reimplementing the wheel.

6) Version control makes collaboration more efficient.  It provides a record of who changed what and when, which facilitates code and data integrity.  It provides mechanisms for developing new code without breaking the old.  And it provides a better form of file backup, organized in coherent changes, rather than by date.

Maybe surprisingly, using software engineering tools early in a project doesn't hurt agility; it actually facilitates it.

Implications for education

For computational scientists, I think it's better to jump over the Valley of Unreliable Science than try to navigate through it.  So what does that imply for education?  Should we teach the tools and practices of software engineering right from the beginning?  Or do students have to spend time navigating the Valley before they learn to jump over it?

I'll address these questions in the next article.

Friday, February 16, 2018

Learning to program is getting harder

I have written several books that use Python to explain topics like Bayesian Statistics and Digital Signal Processing.  Along with the books, I provide code that readers can download from GitHub.  In order to work with this code, readers have to know some Python, but that's not enough.  They also need a computer with Python and its supporting libraries, they have to know how to download code from GitHub, and then they have to know how to run the code they downloaded.

And that's where a lot of readers get into trouble.

Some of them send me email.  They often express frustration, because they are trying to learn Python, or Bayesian Statistics, or Digital Signal Processing.  They are not interested in installing software, cloning repositories, or setting the Python search path!

I am very sympathetic to these reactions.  And in one sense, their frustration is completely justified:  it should not be as hard as it is to download a program and run it.

But sometimes their frustration is misdirected.  Sometimes they blame Python, and sometimes they blame me.  And that's not entirely fair.

Let me explain what I think the problems are, and then I'll suggest some solutions (or maybe just workarounds).

The fundamental problem is that the barrier between using a computer and programming a computer is getting higher.

When I got a Commodore 64 (in 1982, I think) this barrier was non-existent.  When you turned on the computer, it loaded and ran a software development environment (SDE).  In order to do anything, you had to type at least one line of code, even if all it did was another program (like Archon).

Since then, three changes have made it incrementally harder for users to become programmers

1) Computer retailers stopped installing development environments by default.  As a result, anyone learning to program has to start by installing an SDE -- and that's a bigger barrier than you might expect.  Many users have never installed anything, don't know how to, or might not be allowed to.  Installing software is easier now than it used to be, but it is still error prone and can be frustrating.  If someone just wants to learn to program, they shouldn't have to learn system administration first.

2) User interfaces shifted from command-line interfaces (CLIs) to graphical user interfaces (GUIs).  GUIs are generally easier to use, but they hide information from users about what's really happening.  When users really don't need to know, hiding information can be a good thing.  The problem is that GUIs hide a lot of information programmers need to know.  So when a user decides to become a programmer, they are suddenly confronted with all the information that's been hidden from them.  If someone just wants to learn to program, they shouldn't have to learn operating system concepts first.

3) Cloud computing has taken information hiding to a whole new level.  People using web applications often have only a vague idea of where their data is stored and what applications they can use to access it.  Many users, especially on mobile devices, don't distinguish between operating systems, applications, web browsers, and web applications.  When they upload and download data, they are often confused about where is it coming from and where it is going.  When they install something, they are often confused about what is being installed where.

For someone who grew up with a Commodore 64, learning to program was hard enough.  For someone growing up with a cloud-connected mobile device, it is much harder.

Well, what can we do about that?  Here are a few options (which I have given clever names):

1) Back to the future: One option is to create computers, like my Commodore 64, that break down the barrier between using and programming a computer.  Part of the motivation for the Raspberry Pi, according to Eben Upton, is to re-create the kind of environment that turns users into programmers.

2) Face the pain: Another option is to teach students how to set up and use a software development environment before they start programming (or at the same time).

3) Delay the pain: A third option is to use cloud resources to let students start programming right away, and postpone creating their own environments.

In one of my classes, we face the pain; students learn to use the UNIX command line interface at the same time they are learning C.  But the students in that class already know how to program, and they have live instructors to help out.

For beginners, and especially for people working on their own, I recommend delaying the pain.  Here are some of the tools I have used:

1) Interactive tutorials that run code in a browser, like this adaptation of How To Think Like a Computer Scientist;

2) Entire development environments that run in a browser, like PythonAnywhere; and

3) Virtual machines that contain complete development environments, which users can download and run (providing that they have, or can install, the software that runs the virtual machine).

4) Services like Binder that run development environments on remote servers, allowing users to connect using browsers.

On various projects of mine, I have used all of these tools.  In addition to the interactive version of "How To Think...", there is also this interactive version of Think Java, adapted and hosted by Trinket.

In Think Python, I encourage readers to use PythonAnywhere for at least the first four chapters, and then I provide instructions for making the transition to a local installation.

I have used virtual machines for some of my classes in the past, but recently I have used more online services, like this notebook from Think DSP, hosted by O'Reilly Media.  And the repositories for all of my books are set up to run under Binder.

These options help people get started, but they have limitations.  Sooner or later, students will want or need to install a development environment on their own computers.  But if we separate learning to program from learning to install software, their chances of success are higher.

UPDATE: Nick Coghlan suggests a fourth option, which I might call Embrace the Future: Maybe beginners can start with cloud-based development environments, and stay there.

UPDATE: Thank you for all the great comments!  My general policy is that I will publish a comment if it is on topic, coherent, and civil.  I might not publish a comment if it seems too much like an ad for a product or service.  If you submitted a comment and I did not publish it, please consider submitting a revision.  I really appreciate the wide range of opinion in the comments so far.

Thursday, February 8, 2018

Build your own SOTU

In the New York Time on Tuesday, John McWhorter argues that Donald Trump's characteristic speech patterns are not, as some have suggested, evidence of mental decline.  Rather, the quality of Trump's public speech has declined because, according to McWhorter:

1) "The younger Mr. Trump [...] had a businessman’s normal inclination to present himself in as polished a manner as possible in public settings", and 

2)  The older Trump has "settled into his normal" because as president, he "has no impetus to speak in a way unnatural to him in public".

It's an interesting article, and I encourage you to read it before I start getting silly about it.

I would like to suggest an alternative interpretation, which is that the older Trump's speech sounds as it does because it is being generated by a Markov chain.

A Markov chain is a random process that generates a sequence of tokens; in this case, the tokens are words.  I explain the details below, but first I want to show some results.  Compare these two paragraphs:
"You know, if you're a conservative Republican, if I were a liberal, if, like, okay, if I ran as a liberal Democrat, they would say I'm one of the smartest people anywhere in the world – it’s true! – but when you're a conservative Republican they try – oh, they do a number"
"I mean—think of this—I hate to say it but it’s the same wall that we’re always talking about. It’s—you know, wherever we need, we don’t make a good chance to make a deal on DACA, I really have gotten to like. And I know it’s a hoax."
One of those paragraphs was generated by a Markov chain I trained with the unedited transcript from this recent interview with the Wall Street Journal.  The other was generated by Donald Trump.  Can you tell which is which?

Ok, let's make it a little harder.  Here are ten examples: some are from Trump, some are from Markov.  See if you can tell which are which.

1) I would have said it’s all in the messenger; fellas, and it is fellas because, you know, they don’t, they haven’t figured that the women are smarter right now than the men, so, you know, it’s gonna take them about another 150 years — but the Persians are great negotiators, the Iranians are great negotiators, so, and they, they just killed, they just killed us.

2) And we have sort of interesting, but when people make misstatements somebody has some, you know, I went through some that weren’t so elegant. But all I’m asking is one thing, you know Obama felt—President Obama felt it was his biggest problem is going to be Dreamers also. But there’s a big difference—first of all, there’s a big problem, and they were only going to be solved. 

3) One of the promises that you know is being very seriously negotiated right now is the wall and the wall will happen. And if you look—point, after point, after point—now we’ve had some turns. You always have to have flexibility. 

4) Yeah, Rex and I think we’ll have something on that. We’ll find out. But people do leave. You guys may leave but I don’t know of one politician in Washington—if you’re a politician and somebody called up that they have phony sources, when the sources don’t exist, yeah I think would be frankly a positive for our country made wealthy.

5) They have an election coming up fairly shortly, and I understand that that makes it a little bit difficult for them, and I’m not looking to make the other side—so we’ll either make a deal or—there’s no rush, but I will say that if we don’t make a fair deal for this country, a Trump deal, then we’re not going to have—then we’re going to have a—I will terminate.

6) You’re here, you’ve got the wall is the same wall I’ve always talked about. I think we have companies pouring back into this country and you don’t know who’s there, you’ve got the wall will happen.  We have a very old report. Business, generally, manufacturing the same wall that we’re talking about or whatever it may be.

7) And they endorsed us unanimously. I had meetings with them, they need see-through. So, we need a form of fence or window. I said why you need that—makes so much sense? They said because we have to see who’s on the other side.

8) Well they will make sure that no country including Russia can have anything to do with my win. Hope, just out of the most elegant debate—I thought it was a dead meeting. No, I never forget, when I fired, all these people, they all wanted him fired until I said, ‘We got to get worse'. 

9) The governor of Wisconsin has been fantastic in their presentations and everything else. But I’m the one who got them to look at it. Now we need people because they’re going to have thousands of people working it’s going to be a—you know—that’s—that’s the company that makes the Apple iPhone.

10) So, they make up a television show. As you know, I went to the—I went to the—I went to the—I went to the employees—to millions and millions of employees. And AT&T started it, but I will terminate Nafta. OK? You know, we only have a thing called trade. 

The first person to submit correct answers will be sequestered in a sensory deprivation tank until January 20, 2021.

Here's the Jupyter notebook I used to generate the examples.  If you want to know more about how it works, see this section of Think Python, second edition.