4-stars

Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions

Max Nova

Apr 23, 2017 • 32 min read

"Rigor Mortis" explores how perverse incentives and a "broken" scientific culture are fueling the reproducibility crisis in modern biomedical research. Published in April, Harris's jeremiad was a perfect fit for my 2017 reading theme on "The Integrity of Western Science" and an excellent companion to Goldacre's "Bad Science".

Harris argues that competition for funding, politicized peer-review, and pressure to be the first to publish have enabled sloppiness and dubious practices like "p-hacking" to flourish. He investigates how these incentives have created a lack of transparency in research and how they have damaged the self-correcting nature of scientific research. As one of his interviewees points out, "it’s unfortunately in nobody’s interest to call attention to errors or misconduct".

Diving into the technical details, Harris points out how cell line contamination, non-transferability of animal models to human systems, and statistical blundering (the batch effect!) often render vast swathes of biomedical research unreliable. He also explains how much bad science is driven by conflating "exploratory" and "confirmatory" research. You can't run an "exploratory" experiment, mine the results for correlations, and then publish it as a "confirmatory" result. Or rather, you can - and apparently it is done all the time - but it is bad science. This was a "big idea" for me and is one that I suspect the general public (and even most scientists) are not aware of.

One of the major questions I'm trying to answer in my year of reading about "The Integrity of Western Science" is "How much confidence do we need to have in the 'science' before we start making policy decisions based on it?" As another one of Harris's interviewees notes, "it takes years for a field to self-correct" - this complicates the situation even further. How long do we need to wait before we can have a high level of confidence that a particular finding is real? Harris offers no answers here - he just hammers home how difficult of a question this actually is.

As he discusses the inability to impose standards of rigor and transparency on research, Harris briefly touches on how most biomedical science is funded (The National Institutes of Health) and includes the odd and unexplained quote, “The NIH is terrified of offending institutions.” This strikes me as completely absurd. The NIH is funding the research. Don't they know the golden rule? "He who has the gold, rules." Why shouldn't the NIH mandate some easy wins like an experimental registry and a best-practices checklist for each publication? The institutions (read: universities) can either play ball or... find a more lenient funding source. Would love to read more about this.

As far as concrete ideas for cleaning up science, Harris doesn't offer many new ideas. There were three that caught my eye though. The first is that researchers should report their level of confidence in the results of every study they publish. The second is the idea of a one-time scientific "Jubilee" where researchers can retract any of their old research with no penalty. It's tough to see how these first two ideas could work culturally, but they're intriguing nonetheless. The final idea probably has the best chance of working because it has worked before. Harris discusses the 1975 Asilomar conference where leading genetics scientists developed guidelines to ensure the safety of recombinant DNA research. This seems to be an instance of relatively successful self-regulation of the scientific community, and it's nice to think there might be some hope of the community internally cleaning up its act. I'm not holding my breath though.

One final side-effect of reading this book was being introduced to the institutions leading the effort to clean up science:

Overall, Harris's book is a great summary of current issues in biomedical research and it comes at a time when the general public is finally becoming aware of some of these problems.

My top highlights are below.

PREFACE

And lately I’ve come to realize that the reason medical research is so painfully slow isn’t simply because it’s hard — which indeed it is. It also turns out that scientists have been taking shortcuts around the methods they are supposed to use to avoid fooling themselves. The consequences are now haunting biomedical research. Simply too much of what’s published is wrong. It doesn’t have to be that way.

American taxpayers contribute more than $30 billion a year to fund the National Institutes of Health. Add in other sources, like the price of research that’s baked into the pills we swallow and the medical treatments we receive, and the average American household spends $900 a year to support biomedical studies.

In fact, of the 7,000 known diseases, only about 500 have treatments, many offering just marginal benefits.

Scientists often face a stark choice: they can do what’s best for medical advancement by adhering to the rigorous standards of science, or they can do what they perceive is necessary to maintain a career in the hypercompetitive environment of academic research. It’s a choice nobody should have to make.

The challenge now isn’t identifying the technical fixes. The much harder challenge is changing the culture and the structure of biomedicine so that scientists don’t have to choose between doing it right and keeping their labs and careers afloat.

Chapter One - BEGLEY’S BOMBSHELL

Each year about a million biomedical studies are published in the scientific literature. And many of them are simply wrong.

“So it wasn’t just that Amgen was unable to reproduce them,” Begley said. “What was more shocking to me was the original investigators themselves were not able to.” Of the fifty-three original exciting studies that Begley put to the test, he could reproduce just six. Six. That’s barely one out of ten.

Lee Ellis from the MD Anderson Cancer Center in Houston lent his name and analysis to the effort. He, too, had been outspoken about the need for more rigor in cancer research.

But he said the conversation was always different at the hotel bar, where scientists would quietly acknowledge that this was a corrosive issue for the field. “It was common knowledge; it just was unspoken. The shocking part was that we said it out loud.”

In 2005, John Ioannidis published a widely cited paper, titled “Why Most Published Research Findings Are False,” that highlighted the considerable problems caused by flimsy study design and analysis.

The ecosystem in which academic scientists work has created conditions that actually set them up for failure. There’s a constant scramble for research dollars. Promotions and tenure depend on their making splashy discoveries. There are big rewards for being first, even if the work ultimately fails the test of time. And there are few penalties for getting it wrong. In fact, given the scale of this problem, it’s evident that many scientists don’t even realize that they are making mistakes.

From the director’s office in Building 1 at the National Institutes of Health (NIH), Francis Collins and his chief deputy, Lawrence Tabak, declared in a 2014 Nature comment, “We share this concern” over reproducibility. In the long run, science is a self-correcting system, but, they warn, “in the shorter term — the checks and balances that once ensured scientific fidelity have been hobbled.”

“The scientific revolution has allowed humanity to avoid a Malthusian crisis over and over again,” he said. To get through the next couple of centuries, “we need to have a scientific enterprise that is working as best as it can. And I fundamentally think that it isn’t.”

The rate of new-drug approval has been falling since the 1950s. In 2012, Jack Scannell and his colleagues coined the term “Eroom’s law” to describe the steadily worsening state of drug development. “Eroom,” they explained, is “Moore” spelled backward. Moore’s law charts the exponential progress in the efficiency of computer chips; the pharmaceutical industry, however, is headed in the opposite direction.

These researchers blame Eroom’s law on a combination of economic, historical, and scientific trends. Scannell told me that a lack of rigor in biomedical research is an important underlying cause.

“Everyone says, ‘It’s not my problem,’” Barker told me. “But it has to be someone’s problem. What about accountability? At the National Cancer Institute, we spent a lot of money. Our budget was $5 billion a year. That’s not a trivial amount of investment. If a major percentage of our data is not reproducible, is the American taxpayer being well served?” Barker reads scientific journals with trepidation. “I have no clue whether to trust the data or not,” she said.

When an exciting scientific discovery is reported, scientists are quick to jump on the bandwagon, often without considering whether the original finding is in fact true. Here’s a case in point. In 1999 and 2000, several scientists made a startling claim: they announced that bone marrow stem cells could spontaneously transform themselves into cells of the liver, brain, and other organs.

In 2002, Wagers concluded with typical scientific understatement that transdifferentiation is “not a typical function” of normal stem cells found in the bone marrow.

“This episode illustrated how the power of suggestion could cause many scientists to see things in their experiments that weren’t really there and how it takes years for a field to self-correct,” Morrison wrote in a scientific editorial, noting that scientists are sometimes too eager to rush forward “without ever rigorously testing the central ideas. Under these circumstances dogma can arise like a house of cards, all to come crumbling down later when somebody has the energy to do the careful experiments and the courage to publish the results.”

Begley explained that the deal he’d cut with the researchers prevented him from exposing them, but the question did disturb him. His solution was to write a follow-up comment in Nature titled “Six Red Flags for Suspect Work.” In it, he ran down the list of the six most common preventable failures he encountered. They’re worth repeating here because they are very common failings found in biomedical research, and they explain a good deal of the reproducibility problem. Here are the questions that researchers should ask:

Were experiments performed blinded — that is, did scientists know, as they were doing the experiment, which cells or animals were the test group and which were the comparison group?
Were basic experiments repeated?
Were all the results presented? Sometimes researchers cherry-pick their best-looking results and ignore other attempts that failed, skewing their results.
Were there positive and negative controls? This means running parallel experiments as comparisons, one of which should succeed and the other of which should fail if the scientist’s hypothesis is correct.
Did scientists make sure they were using valid ingredients?
Were statistical tests appropriate? Very often biomedical scientists choose the wrong methods to analyze their data, sometimes invalidating the entire study.

This list should be as familiar to scientists as the carpenter’s dictum is to home builders: measure twice, cut once. Alas, the rules are often not applied.

Chapter Two - IT’S HARD EVEN ON THE GOOD DAYS

A natural scientist realized he could test “strange” facts by trying to replicate them, as David Wootton explains in The Invention of Science.

British philosopher Francis Bacon had not long before formalized the scientific method: make a hypothesis, devise a test, gather data, analyze and rethink, and ultimately draw broader conclusions. That rubric worked reasonably well when scientists were exploring easily repeatable experiments in the realm of physics (for example, studies involving vacuum pumps and gases). But biology is a much tougher subject, since there are many variables and a great deal of natural variation. It’s harder to see phenomena and harder to make sure personal biases don’t creep in.

But it’s not always comfortable for a scientist to raise these issues. The people you criticize “might be reviewing your grants” and deciding whether you deserve funding, Cech told me. “They might make a decision about whether they’ll give you a job offer. They might be reviewing some other papers of yours. So there’s a tendency to be careful about being too negative about other people’s work.”

Although deliberately vague about the unpublished ideas, he said his team had looked at some basic immunological findings observed in mice and often extrapolated to people. The new research shows that the mouse data don’t apply to humans and are leading immunologists astray. But journals keep rejecting the paper, Davis said. “One comment came back: ‘If this paper is published it will set the field back 10 or 20 years!’ And I thought that was a really remarkable statement. But ultimately I interpreted it as a cry for help. If you’re so insecure about your field that one paper could do so much damage, what have you got? What are you so proud of here that could be swept away so easily?”

Chapter Three - A BUCKET OF COLD WATER

Sean Scott, then head of the institute, decided to rerun those tests, this time with a valid experimental design involving an adequate number of mice that were handled more appropriately. He discovered that none of those drugs showed any signs of promise in mice. Not one. His 2008 study shocked the field but also opened a path forward. ALS TDI would devote its efforts to doing this basic biology right.

Perrin’s institute has shown clearly that cutting corners here can lead to pointless and wasteful experiments. Even so, “we still get some pushback from the academic community that we can’t afford to do an experiment like that,” he said. It’s so expensive that they choose to do the experiments poorly.

Congress started to wake up to the issue. Republican senator Richard Shelby of Alabama raised it at a hearing on March 28, 2012. He brought up a December 2011 Wall Street Journal story based on the Bayer study of replication failures from that fall. “This is a great concern, Dr. Collins,” Senator Shelby said at the hearing. “I don’t want to ever discourage scientific inquiry, and I know you don’t, or basic biomedical research. But I think we on this subcommittee, we need to know why so many published results in peer-reviewed publications are unable to be successfully reproduced. When the NIH requests $30 billion or more in taxpayer dollars for biomedical research — which I think is not enough — shouldn’t reproducibility, replication of these studies, be a part of the foundation by which the research is judged? And how can NIH address this problem? Is that a concern to you?”

As of January 2016, researchers must take some basic steps to avoid the most obvious pitfalls. When applying for a grant, they need a plan to show that the cells they are using are actually what they think they are (this is not a trivial issue, as we shall see). They need to show they’ve considered the sex of the animals they will use in their studies. They need to show that they’ve taken the time to find out whether the underlying science looks solid. And scientists must show in their applications that they will use “rigorous experimental design.” Researchers are supposed to be held accountable for all this during the annual reviews of their grants. It’s not clear how aggressively the various grant managers at NIH will enforce these new rules — officials historically have only canceled grants for egregious behavior, like fraud. So these steps are hardly cure-alls, but they are moves in the right direction.

TACT, which stands for the TREAT-NMD Advisory Committee for Therapeutics, was originally funded by the European Union but now runs with its own resources. It is a no-nonsense venue for reviewing potential drugs for neuromuscular diseases like muscular dystrophy. Twice a year, some of the world’s experts in the field review submissions, ask tough questions, and render judgment—often harsh judgment.

Chapter Four - MISLED BY MICE

“Nobody knows how well a mouse predicts a human,” said Thomas Hartung at Johns Hopkins University. In fact a test on mice doesn’t even predict how a drug will work in another rodent. For instance, certain drug-toxicity tests run separately on rats and mice only reach the same conclusion about 60 percent of the time.

Hartung said roughly half of the chemicals that show up as potential cancer-causing agents in mouse experiments are probably not human health hazards. Coffee is one example. Researchers have tested thirty-one compounds isolated from coffee, and of those twenty-three flunked the safety test.

It is sobering that of over 1,000 publications from leading UK institutions, over two-thirds did not report even one of four items considered critical to reducing the risk of bias,” Macleod and colleagues wrote, “and only one publication [out of 1,000] reported all four measures.”

The biology of inflammation seemed to be dramatically different between the two species. “That was a bit of a shocking result,” he told me. It suggested that decades of inflammation research using mice was misguided and that scientists who continue to use mice for this research could be wasting their time. It was not a message the sepsis field wanted to hear.

when the researchers relaxed some of their strict requirements and tested a more heterogeneous group of mice, they paradoxically got more consistent results.

Neuroscience professor Gregory Petsko at Weill Cornell Medical College is in the latter camp. He has spent his career studying neurological diseases, including ALS and Alzheimer’s. “The animal models are a disaster,” he said. “I worry not just that they might be wrong. ‘Wrong’ animal models you can work with. If you know why it’s wrong, you can use the good parts of the model, and you don’t take any information from the parts that are bad. But what if the neurodegenerative disease models are not wrong but irrelevant? Irrelevant is much worse than wrong. Because irrelevance sends you in the wrong direction. And I think the animal models for nearly all the neurological disorders are in fact irrelevant. And that scares the shit out of me, if you pardon the expression.”

Back in the decades when drug development was progressing rapidly, doctors weren’t trying to create new drugs based on a deep understanding of biology. They just experimented on people — not mice — to see what worked. “I wouldn’t necessarily seek to defend the historic approach,” Scannell said. “I think today people would be horrified if they knew how drug discovery really worked in the fifties and sixties. But I also think it is a historical fact that it was an efficient way to discover drugs. It may be an ethically unpalatable fact and something you would never wish to revisit, but I think probably bits of it could be revisited with not huge risk.”

The lesson here is twofold. First, it’s better not to assume that a specific drug works with pinpoint precision. Most drugs “are magic shotguns, not magic bullets,” he said, and sometimes the “off-target” effects can be useful (clearly, they can also be nasty side effects). Second, it’s important to remember that many important discoveries start with human beings in a medical clinic rather than with mice in cages.

Chapter Five - TRUSTING THE UNTRUSTWORTHY

Through the 1960s and 1970s, Walter Nelson-Rees made no end of enemies in science by testing cell lines purported to be from many different cancers and pointing out correctly — but brusquely — that they were in fact HeLa cells.

In 1986, science writer Michael Gold wrote A Conspiracy of Cells, a lively history of Nelson-Rees and his campaign against HeLa. And how did the scientists using cells in their research respond? They mostly ignored the problem.

Even so, more than 7,000 published studies have used HEp-2 or Int-407 cells, unaware that they were actually HeLa, at an estimated cost of more than $700 million.

A 2007 study estimated that between 18 and 36 percent of all cell experiments use misidentified cell lines.

With the standard in hand, Capes-Davis was then anointed to chair an organization that sprang up around this issue, the International Cell Line Authentication Committee. It maintains and updates the list of corrupted cell lines, which by 2016 had grown to 438, with no end in sight.

It turned out that MDA-MB-435 was an imposter. The cell was unmasked quite by accident. Back in the late 1990s, scientists at Stanford University were developing a test that would allow them to look at a biological sample and see which genes are switched on or off in any given cell. Doug Ross was a postdoctoral researcher in a star-studded laboratory that helped develop these powerful new genetic tools.

There are now more than 1,000 papers in scientific journals featuring MDA-MB-435—most of them published since Ross’s 2000 report. It’s impossible to know how much this sloppy use of the wrong cells has set back research into breast cancer.

As a result of this painful experience, Rimm has become an evangelist for cleaning up the mess with antibodies. And it’s a big mess. Glenn Begley said faulty antibodies were apparently responsible for a lot of the results he was unable to reproduce at Amgen.

And as the story of cell lines makes clear, simply having a standard isn’t enough. Most scientists must be coerced by funding agencies or their employers to run these tests.

Chapter Six - JUMPING TO CONCLUSIONS

The batch effect is a stark reminder that, as biomedicine becomes more heavily reliant on massive data analysis, there are ever more ways to go astray. Analytical errors alone account for almost one in four irreproducible results in biomedicine, according to Leonard Freedman’s estimate. A large part of the problem is that biomedical researchers are often not well trained in statistics. Worse, researchers often follow the traditional practices of their fields, even when those practices are deeply problematic.

Baggerly now routinely checks the dates when data are collected—and if cases and controls have been processed at different times, his suspicions quickly rise. It’s a simple and surprisingly powerful method for rooting out spurious results.

Over the years breathless headlines have celebrated scientists claiming to have found a gene linked to schizophrenia, obesity, depression, heart disease — you name it. These represent thousands of small-scale efforts in which labs went hunting for genes and thought they’d caught the big one. Most were dead wrong. John Ioannidis at Stanford set out in 2011 to review the vast sea of genomics papers. He and his colleagues looked at reported genetic links for obesity, depression, osteoporosis, coronary artery disease, high blood pressure, asthma, and other common conditions. He analyzed the flood of papers from the early days of genomics. “We’re talking tens of thousands of papers, and almost nothing survived” closer inspection. He says only 1.2 percent of the studies actually stood the test of time as truly positive results. The rest are what’s known in the business as false positives.

The formula for success was to insist on big studies, to make careful measurements, to use stringent statistics, and to have scientists in various labs collaborate with one another

These improved standards for genomics research have largely taken hold, Ioannidis told me. “We went from an unreliable field to a highly reliable field.” He counts this as one of the great success stories in improving the reproducibility of biomedical science.

Fisher’s idea was that when scientists perform experiments, they should use this test as a guide to gauge the strength of their findings, and the p-value was part of that. He emphatically urged them to perform their experiments many times over to see whether the results held. And he didn’t establish a bright line that defines what qualifies as statistically significant. Unfortunately, most modern researchers have summarily dismissed his wise counsel. For starters, scientists have gradually come to use p-values as a shortcut that allows them to draw a bright line. The result of any one experiment is now judged statistically significant if it reaches a p-value of less than 0.05.

“We wouldn’t be where we are today if even 5 percent of people [scientists] understood this particular point,” said Stanford’s Steve Goodman, one of the beacons of statistical reasoning.

That’s not to say that all those results are wrong, just that scientists and journal editors place far too much confidence in them.

In a widely read 2011 paper, Simonsohn and his colleagues described this kind of manipulation. They called it p-hacking.

In the years since he published that paper, Simonsohn has come to realize that p-hacking is incredibly common in all branches of science. “Everybody p-hacks to some extent,” he told me. “Nobody runs an analysis once, and if it doesn’t work, throws everything away. Everybody tries more things.”

If p-hacking weren’t trouble enough, Simonsohn points to one other pervasive problem in research: scientists run an experiment first and come up with a hypothesis that fits the data only afterward. The “Texas sharpshooter fallacy” provides a good analogy. A man wanders by a barn in Texas and is amazed to see bullet holes in the exact bull’s-eyes of a series of targets. A young boy comes out and says he’s the marksman. “How did you do it?” the visitor asks. “Easy. I shot the barn first and painted the targets later,” the boy answers. In science, the equivalent practice is so common it has a name: HARKing, short for “hypothesizing after the results are known.”

It often starts out in all innocence, when scientists confuse exploratory research with confirmatory research. This may seem like a subtle point, but it’s not. Statistical tests that scientists use to differentiate true effects from random noise rest on an assumption that the scientist started with a hypothesis, designed an experiment to test that hypothesis, and is now measuring the results of that test. P-values and other statistical tools are set up explicitly for that kind of confirmatory test. But if a scientist fishes around and finds something provocative and unexpected in his or her data, the experiment silently and subtly undergoes a complete change of character. All of a sudden it’s an exploratory study.

Scientists at the lab bench slip easily back and forth between the exploratory and confirmatory modes of research. Both are vital to the enterprise. The problems come when scientists lose track of where they are in this fluid world of confirmation and exploration.

Chapter Seven - SHOW YOUR WORK

After thinking about his own research practices, Nosek had an epiphany. Simply increasing transparency could go a long way toward reducing the reproducibility problems that plague biomedical research. For starters, scientists would avoid the pitfall of HARKing if they did a better job of keeping track of their ideas — especially if they documented what they were planning to do before they actually sat down to do it. Though utterly basic, this idea is not baked into the routines in Nosek’s field of psychology or in biomedical research. So Nosek decided to do something about that. He started a nonprofit called the Center for Open Science, housed incongruously in the business center of the Omni Hotel in downtown Charlottesville, Virginia. His staff, mostly software developers, sit at MacBook computers hooked up to gleaming white displays. Everyone works in one big, open room and can wander over to cupboards stocked with free food. The main project at the center is a data repository called the Open Science Framework.

“Psychology’s Fears Confirmed: Rechecked Studies Don’t Hold Up,” read the page-one New York Times headline on August 28, 2015.

The Food and Drug Administration Modernization Act of 1997 requires scientists running clinical trials on potential new drugs or devices to register their hypotheses in advance in a federal repository called ClinicalTrials.gov, set up by the National Institutes of Health (NIH) in 2000.

Ben Goldacre, a doctor and gadfly in the United Kingdom, has exposed many examples of studies that don’t report the results they said they would or present unanticipated results that are in fact exploratory rather than confirmatory. He hectors journals to publish clarifications when he finds evidence of this but has met with limited success.

Robert Kaplan and Veronica Irvin at the NIH set out to see whether the law requiring scientists to declare their end points in advance really made a difference. They reviewed major studies of drugs or dietary supplements supported by the National Heart, Lung and Blood Institute between 1970 and 2012 and came up with a startling answer. Of the thirty big studies done before the law took effect, 57 percent showed that the drug or supplement being tested was beneficial. But once scientists had to announce in advance what exactly they were looking for, the success rate plummeted. Only 8 percent of the studies (two out of twenty-five) published after 2000 confirmed the preregistered hypothesis.

Federal rules have openness requirements as well, but these are rarely enforced. In truth, scientists don’t reliably play by these rules, even when taxpayers fund their research as a public good.

This is another example of the perverse incentives in biomedical research. What’s best for moving science forward isn’t necessarily best for a researcher’s career.

Researchers studying people in clinical trials also often hoard that data. Sharing isn’t straightforward — scientists have to be careful not to reveal private personal information in the process, and it’s not as easy to strip out every potentially revealing detail as you might think. That becomes a convenient excuse not to even make the effort. “The group that’s doing the study is quite happy with that,” Salzberg told me, “but that doesn’t really help public health, doesn’t help our understanding of cancer or other diseases.” And subjects aren’t in favor of keeping their data locked up in the files of the doctor doing the study. “If you ask the patient is it OK to share your data with every scientist who’s working on your type of cancer, of course they’ll say yes. That’s why they’re doing it. But they [researchers] don’t ask that question! I’d like to see that change.”

Transparency is at the core of a major effort to measure just how much basic cancer research can be reproduced reliably. Brian Nosek paired up with a Palo Alto company called Science Exchange in an effort to replicate fifty widely cited findings. The Reproducibility Project: Cancer Biology, as it is named, not only turned out to be a lesson in how to conduct science with maximum transparency but revealed just how challenging—and controversial—it can be to design credible experiments to reproduce the work of other labs. (It also revealed that science costs a lot of money: the team busted its multi-million-dollar budget and so had to drop about one-third of the experiments it had planned to reproduce.)

(notably, not a single experiment was described well enough in the original publication to be redone by anybody else).

He compared it with the 1975 conference in Asilomar, California, at which scientists voluntarily wrote rules to govern the early days of genetic engineering research.

Right now, the structure of science makes it difficult for scientists to live by the values that often motivated them to go into research. This is a first step. Nosek has grand ambitions: he wants to change the entire culture of science.

I say, ‘Why did you go into science in the first place? Didn’t you go into science because you wanted to make the world a better place?’ Yeah, they did, but that was when they were a grad student or an undergraduate. They’ve forgotten that long ago. They’re in the rat race now.” They need to get their next grant, publish their next paper, and receive credit for everything they do. They have stepped into a world where career motivations discourage best scientific practices. As a result, the practice of science has drifted far from its intellectual roots.

A few years ago, the NIH put out a request to the nation’s graduate schools, asking for a list of the classes that teach biomedical methods. The idea was to take the best of these classes and make the curriculum more broadly available. Lorsch said the survey was a bust. Universities apparently don’t offer a deep curriculum in research methodology for biomedical students.

“But we should be encouraging people to move and not punish them. We punish them tremendously.” If you leave a field to change subjects, “they say you are not serious,” Casadevall said. And the new field will also consider you fickle. That’s unfortunate: switching fields can help break ideas that are accepted as dogma. “When a newcomer comes in the first thing they usually do is they disturb the dogma,” he said. They may have trouble getting their ideas published and are otherwise harassed, “but those people are incredibly important. Because they come in and they unsettle the table. It’s the only way to move forward.”

Chapter Eight - A BROKEN CULTURE

Darwin’s nineteenth-century career is also different in another important way. As a gentleman-scientist, he had no need to hustle for money. And he was in no hurry to publish his discoveries.

“If you think about the system for incentives now, it pays to be first,” Veronique Kiermer, executive editor of the Public Library of Science (PLOS) journals told me. “It doesn’t necessarily pay to be right. It actually pays to be sloppy and just cut corners and get there first. That’s wrong. That’s really wrong.”

Once young biomedical scientists finish their PhDs, they go into a twilight world of academia: postdoctoral research. This is nominally additional training, but in fact postdocs form a cheap labor pool that does the lion’s share of the day-to-day research in academic labs. Nobody tracks how many postdocs are in biomedicine, but the most common estimate is that there are at least 40,000 at any given point. They often work for five years in these jobs, which, despite heavy time demands, usually pay less than $50,000 a year — a rather modest salary for someone with an advanced degree and quite possibly piles of student debt.

Martinez figured she would need to get into a journal with a high “impact factor,” a measurement invented for commercial purposes: the rating helps journals sell ads and subscriptions. But these days it’s often used as a surrogate to suggest the quality of the research. Journals with higher impact factors publish papers that are cited more often and therefore commonly presumed to have more significance. At the top of the heap, the journal Nature has an impact factor over 40; Cell and Science have impact factors over 30.

In those sessions, “I’ve raised my hand and asked in my best meek voice... ‘Can you tell me what’s in those papers?’ Most of the time they can’t. They haven’t had time to read those papers. So they’re using where someone publishes as a proxy for the quality of what they published. I’m sorry. That’s wrong.”

She’s dismayed that the editors at Nature are essentially determining scientists’ fates when choosing which studies to publish. Editors “are looking for things that seem particularly interesting. They often get it right, and they often get it wrong. But that’s what it is. It’s a subjective judgment,” she told me. “The scientific community outsources to them the power that they haven’t asked for and shouldn’t really have.”

Schekman said the problem with impact factors is not only that they warp science’s career system. “It’s hand in hand with the issue of reproducibility because people know what it takes to get their paper into one of these journals, and they will bend the truth to make it fit because their career is on the line.” Scientists can be tempted to pick out the best-looking data and downplay the rest, but that can distort or even invalidate results. “I don’t want to impugn their integrity, but cherry picking is just too easy,” he said. And bad as it is in the United States, Schekman said, it’s even worse in Asia, “where the [impact factor] number is sacred. In China it’s everything.”

Chinese scientists get cash bonuses for publishing in Science, Nature, or Cell, and Schekman said they sell coauthorships for cash.

Schekman helped establish eLife in part to combat the tyranny of impact factors. He said he told people at Thomson Reuters, the company that generates the rating, that he didn’t want one. They calculated one anyway.

A journal can crank up the pressure even more by telling scientists that it will likely accept their paper if they can conduct one more experiment backing up their findings. Just think of the incentive that creates to produce exactly what you’re looking for. “That is dangerous,” Kiermer said. “That is really scary.”

Outright fraud also creeps into science, just as in any other human endeavor. Scientists concerned about reproducibility broadly agree that fraud is not a major factor, but it does sit at the end of a spectrum of problems confronting biomedicine. The website of the thinly staffed federal Office of Research Integrity, which identifies about a dozen cases of scientific misconduct a year, catalogues the agency’s formal findings on its website.

Another way to measure misconduct, as well as less serious offenses, is to watch for retractions in the scientific literature. Ivan Oransky and Adam Marcus started doing that as a hobby in 2010 on a blog they set up called Retraction Watch.

Arturo Casadevall at Johns Hopkins University and colleague Ferric Fang at the University of Washington dug into retractions and discovered a more disturbing truth: 70 percent of the retractions they studied resulted from bad behavior, not simply error.

“We’re dealing with a real deep problem in the culture,” Casadevall said, “which is leading to significant degradation of the literature.” And even though retractions are on the rise, they are still rarities — only 0.02 percent of papers are retracted, Oransky estimates.

Allison and his colleagues sent letters to journals pointing out mistakes and asking for corrections. They were flabbergasted to find that some journals demanded payment — up to $2,100 - just to publish their letter pointing out someone else’s error.

“If we created more of a fault-free system for admitting mistakes it would change the world,” said Sean Morrison, a Howard Hughes Medical Institute investigator at the University of Texas Southwestern Medical Center.

Biomedical science is nowhere near that point right now, and it’s hard to see how to change that culture. Morrison said that it’s unfortunately in nobody’s interest to call attention to errors or misconduct — especially the latter. The scientists calling out problems worry about their own careers; universities worry about their reputations and potential lawsuits brought by the accused. And journals don’t like to publish corrections, admitting errors that sharper editing and peer review could well have avoided.

“We had a lot of trepidation about writing that technical comment because Richard and Vivian were much more experienced, established investigators,” Akey told me, noting that he and his colleagues were just a few years into their careers. “It was not clear what the risk/reward ratio would be. With that said, everybody believes science is a self-correcting process, and ultimately we felt it was important to point this out and to let other people start thinking about some of these issues in more detail.”

“Most people who work in science are working as hard as they can. They are working as long as they can in terms of the hours they are putting in,” said social scientist Brian Martinson. “They are often going beyond their own physical limits. And they are working as smart as they can. And so if you are doing all those things, what else can you do to get an edge, to get ahead, to be the person who crosses the finish line first? All you can do is cut corners. That’s the only option left you.” Martinson works at HealthPartners Institute, a nonprofit research agency in Minnesota. He has documented some of this behavior in anonymous surveys. Scientists rarely admit to outright misbehavior, but nearly a third of those he has surveyed admit to questionable practices such as dropping data that weakens a result, based on a “gut feeling,” or changing the design, methodology, or results of a study in response to pressures from a funding source. (Daniele Fanelli, now at Stanford University, came to a similar conclusion in a separate study.)

If you perceive you have a fair shot, you’re less likely to bend the rules. “But if you feel the principles of distributive justice have been violated, you’ll say, ‘Screw it. Everybody cheats; I’m going to cheat too,’” Martinson said. If scientists perceive they are being treated unfairly, “they themselves are more likely to engage in less-than-ideal behavior. It’s that simple.” Scientists are smart, but that doesn’t exempt them from the rules that govern human behavior.

Paul Smaldino at the University of California, Merced, and Richard McElreath at the Max Planck Institute for Evolutionary Anthropology ran a model showing that labs that use quick-and-dirty practices will propagate more quickly than careful labs. The pressures of natural selection and evolution actually favor these labs because the volume of articles is rewarded over the quality of what gets published. Scientists who adopt these rapid-fire practices are more likely to succeed and to start new “progeny” labs that adopt the same dubious practices. “We term this process the natural selection of bad science to indicate that it requires no conscious strategizing nor cheating on the part of researchers,” Smaldino and McElreath wrote.

A driving force encouraging that behavior is the huge imbalance between the money available for biomedical research and the demand for it among scientists, Martinson argues. “The core issues really come down to the fact that there are too many scientists competing for too few dollars, and too many postdocs competing for too few faculty positions. Everything else is symptoms of those two problems,” Martinson said.

Congress inadvertently made the problem worse by showering the NIH with additional funding. The agency’s budget doubled between 1998 and 2003, sparking a gold rush mentality. The amount of lab space for biomedical research increased by 50 percent, and universities created a flood of new jobs. But in 2003 the NIH budget flattened out. Spending power actually fell by more than 20 percent in the following decade, leaving empty labs and increasingly brutal competition for the shrinking pool of grant funding. The system remains far out of balance.

Psychiatrist Christiaan Vinkers and his colleagues at the University Medical Center in Utrecht, Holland, have documented a sharp rise in hype in medical journals. They found a dramatic increase in the use of “positive words” in the opening section of papers, “particularly the words ‘robust,’ ‘novel,’ ‘innovative,’ and ‘unprecedented,’ which increased in relative frequency up to 15,000%” between 1974 and 2014.

One must not underestimate the ingenuity of humans to invent new ways to deceive themselves,” he wrote.

In 2014, some leaders of the biomedical enterprise decided it was time to start a serious conversation about these issues. Bruce Alberts (then president of the National Academy of Sciences), Marc Kirschner (chair of systems biology at Harvard), Shirley Tilghman (former president of Princeton), and Harold Varmus (then head of the National Cancer Institute) wrote a paper titled “Rescuing US Biomedical Research from Its Systemic Flaws."

Those with the most economic power — the federal funding agencies — can’t simply impose solutions from above. “The NIH is terrified of offending institutions,” said Henry Bourne at UCSF. Conventional politics in part drives congressional funding for biomedicine. Members of Congress support institutions in their districts because local economies grow when federal dollars flow to universities and medical centers. Congress has also funded biomedical research because so many politicians have a sick relative or a dying friend and want to support the search for treatments and cures.

“I think that is what the real problem is—balancing ambition and delight,” he told me. Scientists need both ambition and delight to succeed, but right now the money crunch has tilted them far too much in the direction of personal ambition. “Without curiosity, without the delight in figuring things out, you are doomed to make up stories.

Chapter Nine - THE CHALLENGE OF PRECISION MEDICINE

Scientists are reluctant to create standards and even slower to adopt them. Something as commonsensical as authenticating cell lines has been a slog. Yet standards are hardly a new idea in science and technology. “We have tons of standards. We have more standards than you’d ever want to think about,” Barker said, for everything from lightbulbs and USB ports to food purity. But they don’t permeate biomedical research. “How many standards do we have in whole genome sequencing? That would be none at this point,” she said.

Of all the problems in biomedical research, “the irreproducibility and the lack of rigor on the biomarker side is probably the most painful,” Woodcock said. She blames academic researchers for insufficient rigor in their initial efforts to find biomarkers. “The biomedical research community believes if you publish a paper on a biomarker, then it’s real. And most of them are wrong. They aren’t predictive, or they don’t add additional value. Or they’re just plain old wrong.”

“There have been like ten thousand papers published on osteoarthritis biomarkers with no rigorous correlative science going on,” Woodcock said.

“I don’t know if there’s a way in academia to make it so that people can retain their positions but nonetheless walk away from data that isn’t looking encouraging. I think that’s a big part of this reproducibility problem. There’s this need to stick with what you find because your career depends upon it. If you could report it as negative and say it didn’t work and still survive, I think you’d be more inclined to do that.”

Chapter Ten - INVENTING A DISCIPLINE

Finally, Goodman decided to turn his full attention to issues of rigor and reproducibility in biomedicine. He moved to Stanford University in 2011 and two years later cofounded a new endeavor called METRICS, an acronym for Meta-Research Innovation Center at Stanford. “We do research on research,” Goodman told me. “To figure out what’s wrong and how to make research better, you have to study research. That’s what meta-research is. It’s not like metaphysics. It’s real. And we look at real things.”

Not only does METRICS have an unusual mission; it has an unlikely history. The codirectors, Goodman and John Ioannidis, are not natural partners. Goodman is deliberative, while Ioannidis moves rapidly from project to project, publishing dozens of papers every year. The two scientists even faced off in a very public disagreement a decade ago.

He suspects that scientists who had spent their careers studying vitamin E kept on defending the positive findings. “They were living in their own bubble, unperturbed by the evidence.” “This is one major reason why having lots of false results circulating in the literature is not a good idea. These results get entrenched. You cannot get rid of them,” he told me. “There will also be lots of people who are unaware of it who will just hit upon the paper and will never know that this thing has been refuted.”

Each field of science has its own particular culture, so each will have to develop its own ways to improve rigor. For example, Goodman discovered that in psychology, a single experiment often becomes the basis for an entire career, and replication is actually discouraged. “To redo an experiment is taken as a personal attack on the integrity and on the theories of the person who did the original work,” Goodman found. “I thought I couldn’t be shocked, but this is truly shocking.”

discussion touched on four topics that generally arise when scientists think about how to fix the broken system: getting individual scientists to change their ways, getting journals to change their incentives, getting funding agencies to promote better practices, and, last but not least, getting universities to grapple with these issues.

Robert Califf, then awaiting confirmation as Food and Drug Administration (FDA) commissioner, said that sensibility is starting to take hold in the United States as well. “Academia has to clean up its shop and get out of the ego business and get into the business of answering questions that matter to patients,” he said. “But the beauty is the patients are gradually going to be taking control.” If academics don’t answer the questions the public cares about, “there’s a very high chance you won’t get funded because they’re going to have a lot to say about it.” Politicians already steer some biomedical research dollars through the Defense Department, which is heavily influenced by patient advocacy groups that participate in the peer review process.

Michael Rosenblatt from Merck has suggested an even more aggressive remedy: drug companies should fund more research at universities, but, in exchange, universities should offer a money-back guarantee if the findings don’t hold up. That would obviously make universities take a more active role in ensuring the reproducibility of research conducted within their walls.

He went on to suggest that journals should consider a year of “scientific jubilee” during which papers could be self-retracted, no questions asked. “The literature would be purged, repentant scientists would be rewarded, and those who had sinned, blessed with a second chance, would avoid future temptation.”

Steve Goodman was startled to discover that Science magazine didn’t have a formal board of statistics editors until 2015 (though it did use statisticians as reviewers before then). “This has been recognized as absolutely critical to the review of empirical science for decades. And yet Science magazine just figured it out. How could that be?”

At the same time, she acknowledged the limits of scientific publication by recounting a story about John Maddox, longtime editor in chief of Nature. Someone once asked him how much of what Nature published was wrong, “and he famously answered, ‘All of it,’” McNutt said. “What he meant by that is, viewed through the lens of time, just about everything that we write down we’ll look back at and say, ‘That isn’t quite right. That doesn’t really look like how we would express things today.’ So most papers don’t stand up to the test of time.”

One idea he has pursued involves awarding “badges” to scientists who do the right thing. Like a gold star on an elementary school assignment, these visible tokens mark published papers whose authors have agreed to share their data. “Badges are stupid. But they work,” he said.

Kimmelman argued provocatively that since science can never free itself of missteps and irreproducible results, it would be helpful for scientists, when they report a result, to state how much confidence they have in their findings. If it’s a wild idea, declare that you don’t have a whole lot of confidence in the result, and scientists following up on it can proceed at their own risk.

ACKNOWLEDGMENTS

Last but not least, I am deeply grateful to Dan Sarewitz at the Consortium for Science, Policy & Outcomes of Arizona State University (ASU). He provided me a place to work in ASU’s Washington, DC, offices, as well as an appointment as a visiting scholar, which included financial support.