xkcd and I are in sync. In case you live in a cave, xkcd is a webcomic "of romance, sarcasm, math, and language," by Randall Munroe. Several times in the last year, the topics that appear in xkcd have been strangely synchronized with topics I am covering in class.
It happened again this week: I was planning to write something about survival analysis and yoga when I read this:
The figure in the first frame is a survival curve. The solid line is the survival function, S(t), which is the percentage of people who survive at least t years. The dotted line is the hazard function, which is proportional to the fraction of deaths that occur during year t. In this figure the death rate is low immediately after diagnosis, highest after 5 years, then decreasing.
The numbers in the second frame are survival rates: 81% survive at least 5 years and 77% survive at least 10.
To explore this topic, and to demonstrate computational techniques in survival analysis, I downloaded data from the Surveillance, Epidemiology and End Results (SEER) Program run by the National Cancer Institute, which is one of the U.S. National Institutes of Health.
SEER collects "data on patient demographics, primary tumor site, tumor morphology and stage at diagnosis, first course of treatment, and follow-up for vital status," currently covering "approximately 28 percent of the US population."
I will use data from the SEER 9 registry, which covers Atlanta, Connecticut, Detroit, Hawaii, Iowa, New Mexico, San Francisco-Oakland, Seattle-Puget Sound, and Utah. This dataset includes cases diagnosed from 1973 through 2007.
As usual I wrote Python programs to read and parse the datasets. For my analysis, I select cases of breast cancer where the behavior of the tumor was classified as malignant, where the patient had only one primary tumor, and where the case has been followed up. There are 370263 records in this subset.
I consider all causes of death, not just deaths due to cancer. This kind of analysis is most useful for prognosis, but because it includes other causes of death, it depends on patient characteristics like age, health, income, etc. Also, survival depends strongly on the stage of the cancer. The best way to deal with that problem is to generate curves that control for these characteristics. For this example, I limit the analysis to 19168 patients who were in their 30s at the time of diagnosis, but I don't control for other factors.
To estimate survival curves, I use the Kaplan-Meier product-limit method, which is sufficiently obvious that I re-invented it before I knew what it was called. The following figure shows the survival and hazard curves for this cohort:
These curves are similar to the ones in the cartoon figure, but the 5 and 10 year survival rates are lower, about 75% and 64%. However, there are a few things to keep in mind:
1) You should not use these curves to make a prognosis for any patient. This curve includes patients with different stages of cancer and other health factors. Also...
2) For purposes of demonstration, I am keeping the analysis simple. People who do this for a living do it more carefully. And finally...
3) There is a systematic source of bias in this figure: the longest survival times are necessarily based on patients with earlier dates of diagnosis. In this figure, the leftmost data point is based on a cohort with a mean date of diagnosis in 1992. The rightmost data point is based on a cohort centered around 1982. Since patient prognosis improves over time, the entire curve is too pessimistic for current patients, and it gets more pessimistic from left to right.
To see how prognoses are changing over time, I partitioned the dataset by date of diagnosis. The following figure shows the results:
So prognoses have been improving consistently. Part of this effect is due to improvements in treatment; part is due to earlier diagnosis. To distinguish between these causes, we could control for stage of cancer, but I have not done that analysis.
Another reason for optimism is that survival rates generally increase after diagnosis; that is, the longer a patient survives, the longer they are likely to survive. The following figure shows 5 and 10 year survival rates conditioned on t; that is, Pr(T > t+5 | T > t) and Pr(T > t+10 | T > t):
At the time of diagnosis, the 5-year survival rate is 75%. But for patients who survive 5 years, the chance of surviving 10 years is 85%, and if you survive 10 years, the chance of surviving 15 is 92%.
In manufacturing, this property is called UBNE, which stands for "used better than new in expectation:" a used part is expected to last longer than a new part, because it has demonstrated its longevity.
Newborn babies, cancer patients and novice motorcyclists have the UBNE property; the longer they live, the longer they are expected to live. So that's a reason to be optimistic.
Note: Sadly, the xkcd strip might be autobiographical. I don't know details, but Munroe has indicated that he is dealing with an illness in his family, and several recent strips have been on medical topics. If my inference is right, I want to extend my best wishes to Mr. Munroe and his family.
Data source: Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov) Research Data (1973-2007), National Cancer Institute, DCCPS, Surveillance Research Program, Cancer Statistics Branch, released April 2010, based on the November 2009 submission.
Update July 31, 2011: Here's an update from xkcd, including a first-person view of a survival curve: