Beyond the Dead Salmon:

Why biomedical research exhibits

low reproducibility

and how we can help

Patrick Beukema

Ph.D.

The Dead Salmon of Systems Neuroscience

Experiment: fMRI analysis of emotional valence.

Analysis: Standard General Linear Model

Subject: North atlantic salmon (deceased), n=1

Study findings: Significant activation related to emotional valence

Standard practice can be flawed

Every year $100+ billion is spent on biomedical research

 

Generating 400 thousand scientific papers

 

More than 50% of research findings are likely false

 

  Reproducibility rates in many fields are low 25%

Sources:  Chakma 2014, nsf.gov, Ionnadis 2005, Stodden 2018

The Reproducibility Crisis in Science

"I realize that while I think I succeeded at getting lots of really cool papers published by cool scientists at fairly large costs—I think $20 billion—I don’t think we moved the needle in reducing suicide, reducing hospitalizations, improving recovery for the tens of millions of people who have mental illness." - Tom Insel, Director NIMH

  1. Null hypothesis significance testing
    • Pitfall 1: Misuse of p-values
    • Pitfall 2: Flexible study designs
  2. P(significant finding) = true?
  3. Poor reproducibility rates across fields
  4. Methods to improve reproducibility
  5. Discussion

Outline for Today

Pitfall 1: Misuse of p-values

Since the 1960's, p-values have been the standard criteria for establishing biomedical research claims

Source:  Fischer 1926, Wikipedia

How are p-values used in biomedical research?

"Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials."

The American Statistical Association

recognized widespread misuse of null hypothesis testing

How should p-values be used?

  • You register the study
  • Your analysis is specified before observing the data
  • You are blind to the data
  • You don't do the unblinding
  • The decision to publish will not be determined based on the p value i.e. even if p-value > significance threshold

The sampling plan is determined by the subjective intentions of the researcher

Source: Kruschke 2012

Change in intentions > Change in critical value

> Change in conclusion

It is not always transparent

how to correctly use p-values

Experiment B:

p-value = 0.04

n = 10

 

How should p-values be used to quantify evidence?

Source: Wagenmakers 2007

Experiment F:

p-value = 0.04

 n = 100

 

?

H0: No difference between groups

HA: A difference between groups

The p-value depends on a pre-specified sampling plan

Source: Wagenmakers 2007

Hypothetical replications

determine sampling distribution

$$ P(t(y^{rep}) > t(y_{observed})|H_0)$$

Pitfall 2: Flexible study designs

Pitfall 2: Flexible study designs

Source: Gelman and Loken 2013

Single test based on the data, but a different test would have been performed given different data

T(y)
T(y;\phi)
T(y;\phi(y))
T(y;\phi^{best}(y))

Test-statistic

Simple classical test based on a unique test statistic

Classical test pre-chosen from a set of possible tests

"p-hacking": performing J tests and then reporting the best result

p-hacking

Flexible study designs

def analysis_1():
    p_value = experiment(parameters_1)
    return p_value < alpha


def analysis_2():
    p_value = experiment(parameters_2)
    return p_value < alpha

# ...

def analysis_n():
    p_value = experiment(parameters_3)
    return p_value < alpha

# write paper
candidates = {'param_1': [...],
              'param_2': [...],
              'param_n': [...]}

all_params = list(ParameterGrid(candidates))
p_value = np.inf
alpha = 0.05
n = 0

while p_value > alpha:
    # Exhaustive grid search
    parameters = all_params[n]
    p_value = experiment(parameters)
    n += 1

# write paper 

Pitfall 2: Flexible study designs

Many sources contribute to flexible research designs

  • Adjusting frequencies, kernels, thresholds, etc
  • Sub group analyses when the main effect is not significant
  • Hypothesizing after an analysis

50% had selectively reported only studies that 'worked'

58% had peeked at the results

43% threw out data after checking its impact on the p-value

35% reported unexpected findings as if predicted from start

Source: Loewenstein 2012

How often does this happen?

Examples

Flexible study designs in Deep Learning

 "Graduate Student Descent"

Choice of hyperparameters

Determines performance

Flexible study designs present statistical challenges

The "Garden of Forking Paths"

Source: Bishop 2016

The importance of pre-study odds (R)

Identical p-values but

different conclusions

Source: Nuzzo 2014

What is the positive predictive value of a research finding?

Source: Ionnadis 2005

PPV = f(Bias

 + Odds

 + Significance level

 + Power)

PPV = Positive Predictive Value

What is the average statistical power?

“When effect size is tiny and measurement error is huge, you’re essentially trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.”

- Andrew Gelman

Source:  Gelman 2015 (drawing by Viktor Beekman)

Source: Nord et. al. 2017, Button et. al. 2013

Statistical power in neuroscience

Median statistical power is 21% (varies by subfield)

Source: Ionaddis 2005

$$ PPV = \frac{[1 - \beta] R + u \Beta R}{R + \alpha - \beta R + u - u\alpha + u \beta R} $$

= Power

= Pre-study odds

= Bias (report any result as true)

= Significance level

$$ \alpha $$

$$ [1-\beta]$$

$$ R$$

$$ u$$

What is the positive predictive value of a research finding?

Source:  Ioannidis 2005

80 % power

 20 % power

PPV (%)

What is the positive predictive value of a research finding?

$$ \alpha = 0.05 $$

If avg(positive predictive values) are low

>

most results should not be reproducible

Bias

Source: Fanelli et. al. 2017

Misuse of significance testing (Pitfall 1)

Flexible study designs (Pitfall 2)

>

Selective reporting

Evidence of bias across many fields

Many fields exhibit low reproducibility

Sources: OCC 2015, Stodden 2018, Prinz 2011, Begley and Ellis 2012, Gundersen 2018

Computational Neuroscience: 24%

Cancer Biology: 11%

Psychology: 30%

Pharmacology: 7%

Machine Learning and AI: 24%

Reproducibility in Pharmacology

Source:  Prinz et. al., 2011 Nature

n = 67 papers

7% reproducibility rate

Source:  Gundersen 2017 AAAI

n = 400 papers

20-30% reproducibility rate

Reproducibility in Machine Learning and AI

Source: Joelle Pineau 2018 NeurIPS

n = 50 papers (NeurIPS, ICML, ICLR) 2018

20% reproducibility rate

Reproducibility in Machine Learning and AI

Hedging against academic risk

Source: Ohserovich 2011 Nature

The “unspoken rule” among early stage VCs is that at least 50% of published studies, even those in top-tier academic journals, “can’t be repeated with the same conclusions by an industrial lab."

Psychology Reproducibility

Source: Open Science Collaboration, Science 2015

n = 100 papers

35% reproducibility rate

Distribution of p-values

Distribution of effect sizes

n = 204 papers (from Science mag)

26% reproducibility rate

Computational Neuroscience Reproducibility

Preclinical Oncology Reproducibility

Source: Begley and Ellis, 2012

Irreproducible findings continue to be

highly cited

n = 53 oncology papers

11% reproducibility rate

Flexible analyses

+

low statistical power

=

widespread irreproducibility

Three major strategies to do better

1. More stringent significance thresholds, using null hypothesis testing correctly (ASA).

 

2. Move away from bright-lines using alternative measures of assessing evidence: effect sizes, confidence intervals, Bayes Factors,....

 

3. Teach data science fundamentals in academia. Share data/code/computing environments.

1. Move threshold from 0.05 to 0.005

Pros:

  • Decrease false positive rate

Cons:

  • Increase false negative rate
  • Need for greater n
  • 0.005 is still a bright line

Source: Benjamin et. al. 2017

How will changing the threshold affect the

positive predictive value?

  • Decreasing alpha increases PPV
  • But depends heavily on bias.

Source: https://github.com/pbeukema/DeltaAlphaEffects

$$ PPV = \frac{[1-\beta]R+u\beta R}{R+\alpha-\beta R+u-u\alpha+u\beta R} $$

0.005

0.05

Substantial controversy over this proposal.

bias = 0.80

bias = 0.50

bias = 0.20

bias = 0.05

"The solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.” - Gelman

2. Bayes Factors offer one possible alternative to bright-lines.

$$ Bayes Factor = \frac{P(D \mid M_A)}{P(D \mid M_B)} $$

Pros:

  • Avoids dichotomization
  • Can quantify evidence for H0

Cons:

  • Requires specification of priors

Source: Wagenmakers 2007

3. Teach data science fundamentals in academia

3 Core Principles

Testing, Out of sample prediction, Reproducible compute

Improve reproducibility by teaching best practices in data science

Academic research projects are data science projects

 

Source: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py#L693

Testing

Test statistical assumptions automatically

"With four parameters I can fit an elephant, and with five I can make him wiggle his trunk."

Von Neumann

Source: Dyson 2004

Out of sample inferences

Reproducible compute

paper

data

software

environment

Scientist 1

Scientist 2

Container

Thank you for listening!

slides: pbeukema.github.io/science_aims

Additional Resources

Additional Resources

The ASA statement on p-values

1. P-values can indicate how incompatible the data are with a specified statistical model

 

2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

 

3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

4. Proper inference requires full reporting and

transparency

 

5. A p-value, or statistical significance, does not measure

the size of an effect or the importance of a result.

 

6. By itself, a p-value does not provide a good measure of

evidence regarding a model or hypothesis.

The ASA statement on p-values

Source: Baker 2016, Nature

There is widespread irreproducibility across fields

Results from survey of 1576 researchers

Are referees more likely to give red cards to dark skin toned professional soccer players?

29 different analyses, 21 unique combinations of covariates

Odds Ratio: 0.89 to 2.93 [median of 1.31]

Examples of PPV

Source:  Ioannidis 2005

Average statistical power in neuroscience

Median statistical power is 21%

Source: Button et. al. 2013

Example: Genome-wide association study

PPV = \frac{[1-\beta]R+u\beta R}{R+\alpha-\beta R+u-u\alpha+u\beta R}

Dataset: 100,000 polymorphisms

Power: 60%

Pre-study odds (R): 10/100,000

\alpha = 0.05

Positive predictive value is low even for reasonably powered study and minimal/no bias

Without bias

u=0

PPV = .0012

With bias

u=0.1

PPV = .00044

Source: Ionnadis 2005

Problem 3: p-values depend on data that weren't observed

Source: Wagenmakers 2007

Observation: x = 5

Under f(x): p-value = 0.03+0.01 = 0.04 -> reject null
Under g(x): p-value = 0.03+0.03 = 0.06 -> fail to reject null

"What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred" (Jeffreys, 1961)
 

Suppose we have the following two sampling distributions

Advice from Rob Kass:

When Possible, Replicate!

When statistical inferences, such as p-values, follow extensive looks at the data, they no longer have their usual interpretation. Ignoring this reality is dishonest: it is like painting a bull’s eye around the landing spot of your arrow...

 

The only truly reliable solution to the problem posed by data snooping is to record the statistical inference procedures that produced the key results, together with the features of the data to which they were applied, and then to replicate the same analysis using new data.

 

 

Source: Kass et. al. 2016

Preregister if possible

Source: Data Colada