What I’m hoping for with these blogs is to stimulate debate about some of the things that flit across my mind. Last month’s missive about the impact of sloppy science did this, and I think we should continue a bit longer on this topic.
Last month, I wrote about sloppy science because I heard someone from the pharmaceutical industry say that when they try to repeat findings from the scientific literature, they often find that they are not reproducible.
Inadequate basic controls
The first response I got was an email from Dylan Taatjes, PhD, a cancer center member from CU Boulder, who pointed that even the most basic controls don’t get done adequately.
For example: People do experiments with antibodies that are not validated as being specific for what they think they are targeting. Dylan thinks that this is a particularly troubling problem, and I’m sure he’s right. If you are going to spend a lot of time, effort and money on something like a ChIPSeq experiment but don’t know that the key reagent that you are using is suitable, then one has to ask what on earth you were thinking (or whether you are thinking at all).
However, maybe sloppiness like this isn’t the only problem.
Randomness in statistics
I got another message from Anna Barón, PhD. Anna is director of the CU Cancer Center Biostatistics and Bioinformatics Shared Resource and she wanted me to consider the role of statistics, specifically randomness, in the lack of reproducibility that’s so often seen in our field.
Now, my statistics education was—well, let’s just say it wasn’t quite as optimal as it might have been. And, it was a long time ago, right around the time I was learning about metabolic pathways, and metabolic pathways led me to go to the pub a lot, which in turn meant that I was often nursing a hangover in stats class.
Even though I didn’t learn enough statistics when I was a boy, that’s really not an excuse for not using them properly. Why? Well, because the famous quote in my title (reputed to have originated from the British prime minister Benjamin Disraeli) isn’t really true.
Statistics isn’t just a third and most effective kind of lie. Statistics is the way we determine whether the effect we are seeing (or not seeing) is likely to be real or just a chance occurrence. Pretty important, right?
P values and statistical significance
One of my favorite questions for students in a PhD Comprehensive Exam comes when they let slip the phrase “this result is statistically significant.” (Students reading this should feel free to prepare yourselves—this question has beaten out my previous champion “How does an SDS gel work?” and even the one about siRNA controls.)
When I hear “statistically significant,” I interrupt and ask what they mean. Usually, the answer is “p value of less than 0.05.” So I ask, “What’s a p value and how did you get it?” Squirming often ensues at this point. The faculty on the exam committee sit up straight, because, like sharks smelling blood in the water, they sense that now things might be getting interesting. The PhD advisor begins closely examining that hangnail…
Frequently it turns out that the student doesn’t know how they got the p value, other than “a statistical test.” Sometimes they didn’t know what the test was, which means that the obvious follow up question of why they chose that particular test becomes moot.
So, after the silence has become REALLY painful, I try an easier question: “What’s special about 0.05?”
The answer: Nothing.
The number is an arbitrary significance level that derives from the famous statistician R.A. Fisher, who introduced the idea of significance testing in the first half of the last century. Fisher wasn’t proposing that a p value of 0.05 is a magical threshold when suddenly your conclusion becomes true. Instead he wrote: “If p is between 0.1 and 0.9 there is no good reason to suspect the hypothesis being tested … We shall not often be led astray if we draw a conventional line at 0.05.”
Note, Fisher said, “We shall not often be led astray…”
Not, “Moses brought this down on a big lump of rock off Mt. Sinai, and you’ve got to follow it.”
Let’s imagine you are studying cancer cell growth after you treat tumor cells with two drugs, and as a result of your treatments your calculated p values for the treated cells being different from the untreated controls are 0.055 for drug 1. And 0.049 for drug 2.
Did something special happen here with the second drug?
For too many people, this step across the “0.05 boundary” leads them to wrap up their study, write the paper and sit back to bask in the glory of their discovery that drug number two is the way to treat the tumor.
But in fact, what you have here, everything being equal (like sample size and variability), are two drugs that are very similarly likely (or unlikely) to affect the growth of the tumor cells. And, if you are going to bet that drug two is the better one, it is quite possible that further studies will prove you wrong.
Drop “significant” and get specific about p values
There is an easy way to make this kind of situation less problematic: Don’t talk about “significant” and “non-significant” results. Instead tell us what the actual p value is. And, use the p value as just one aspect of how you draw conclusions in the context of the other evidence that you have accumulated– i.e. do not think that just because you got a p value smaller than 0.05 that this evidence alone ensures the validity of your conclusion. It doesn’t.
Does Dr. Ioannidis speak the truth?
Anna’s comments pointed me towards some interesting papers too. The most arresting title on my reading list was “Why most published research findings are false,” published in PLoS Medicine by John Ioannidis in 2005. This is well worth a read (how could you not read something with that title?). To give you a flavor, Dr. Ioannidis lists several interesting corollaries about the probability that a research finding is actually true.
Now, “actual truth” is a difficult concept for some people. The word “actual” means “existing in fact.” For those of you following current affairs, you will know that “actual truth” doesn’t necessarily determine how we run things like, say, the United States of America. However, when we are tackling disease, there isn’t any way forward except to uncover “actual truth” and then use this to determine our response. So, unlike some politicians (feel free to guess who I’m thinking about), as cancer researchers we have to reside in the reality-based community.
But, I digress.
Dr. Ioannidis’s corollaries of things that lead to actual truth make for a great list; they include:
The smaller the study, the less likely the research findings are true.
Consider one of my favorite experiments where you take a group of mice with cancer, treat them with your favorite drug (perhaps that one with the p value of 0.049) and find that one third of the mice get cured, one third show no difference in their tumor, but the other mouse can’t be analyzed because he escaped. This is a useless experiment, and I don’t blame the third mouse for not wanting to be a part of it.
The smaller the effect size, the less likely the research findings are to be true.
If you are chasing after a really small effect, it is more likely that any differences you see will turn out not to be real. And even if your difference is “actually true,” a real but small effect may not be important. Consider the case of a real but small increase in tumor cell killing when we add a second drug to your favorite.
Let’s say we increase the number of tumor cells that are killed from 60 percent to 70 percent at a given dose of drug 1 when we add a second drug that, on its own, has no effect. Now, if this is a real effect, it may point to an interesting interaction between the biological pathways that are targeted by the two drugs. However, even if that’s true, killing 10 percent more tumor cells is not likely to make a real difference in cancer treatment.
Dr. Ioannidis has other corollaries related too. My favorite was this:
The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.
The argument is that a hot field attracts lots of people who are all trying to beat the competition to get their papers out first and in the highest impact journals. The teams tend to prioritize and focus on their most impressive “positive” results and ignore things that argue the other way. This bias in favor of some results but not others, which is more common in a hot field, makes it more likely that you will believe and report a result that fits with general views of the field irrespective of whether it is really right or not.
I’m still thinking about this issue, so expect more on this idea at a later time. It has some interesting implications, and has got me reading about Bayesian statistics.
Learn more about statistics, or use our shared resource
So, can we minimize these problems? I’m going to leave you with just one possibility to consider just now.
Maybe, things would not be so bad if the people doing the experiments had a better grasp of statistics. But don’t worry, if you are like me and suffered from inadequate education that took place a long time ago when you were mostly hung-over, we have a solution.
It’s called the CU Cancer Center Biostatics and Bioinformatics Shared Resource. Go talk with them about your experiment and let them help you work out whether your data shows something that is “actually true.” But please, talk to them before you do the experiment. You see, that way, they will also help you design the experiment better and be more likely to come to a conclusion that will turn out to be true in the reality-based community.
Oh, and remember, if you have bad data, your conclusions will always be false no matter whether your statistics are good or not. So, test your antibody before you start the experiment, too.