Patrick Beukema
Ph.D.
The Dead Salmon of Systems Neuroscience
Experiment: fMRI analysis of emotional valence.
Analysis: Standard General Linear Model
Subject: North atlantic salmon (deceased), n=1
Study findings: Significant activation related to emotional valence
Standard practice can be flawed
Every year $100+ billion is spent on biomedical research
Generating 400 thousand scientific papers
More than 50% of research findings are likely false
Reproducibility rates in many fields are low 25%
Sources: Chakma 2014, nsf.gov, Ionnadis 2005, Stodden 2018
The Reproducibility Crisis in Science
"I realize that while I think I succeeded at getting lots of really cool papers published by cool scientists at fairly large costs—I think $20 billion—I don’t think we moved the needle in reducing suicide, reducing hospitalizations, improving recovery for the tens of millions of people who have mental illness." - Tom Insel, Director NIMH
Pitfall 1: Misuse of p-values
Since the 1960's, p-values have been the standard criteria for establishing biomedical research claims
Source: Fischer 1926, Wikipedia
How are p-values used in biomedical research?
"Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials."
The American Statistical Association
recognized widespread misuse of null hypothesis testing
Source: Wasserstein et. al. 2016
How should p-values be used?
Source: Wasserstein et. al. 2016
The sampling plan is determined by the subjective intentions of the researcher
Source: Kruschke 2012
Change in intentions > Change in critical value
> Change in conclusion
It is not always transparent
how to correctly use p-values
Experiment B:
p-value = 0.04
n = 10
How should p-values be used to quantify evidence?
Source: Wagenmakers 2007
Experiment F:
p-value = 0.04
n = 100
?
H0: No difference between groups
HA: A difference between groups
The p-value depends on a pre-specified sampling plan
Source: Wagenmakers 2007
Hypothetical replications
determine sampling distribution
$$ P(t(y^{rep}) > t(y_{observed})|H_0)$$
Pitfall 2: Flexible study designs
Pitfall 2: Flexible study designs
Source: Gelman and Loken 2013
Single test based on the data, but a different test would have been performed given different data
Test-statistic
Simple classical test based on a unique test statistic
Classical test pre-chosen from a set of possible tests
"p-hacking": performing J tests and then reporting the best result
p-hacking
Flexible study designs
def analysis_1():
p_value = experiment(parameters_1)
return p_value < alpha
def analysis_2():
p_value = experiment(parameters_2)
return p_value < alpha
# ...
def analysis_n():
p_value = experiment(parameters_3)
return p_value < alpha
# write paper
candidates = {'param_1': [...],
'param_2': [...],
'param_n': [...]}
all_params = list(ParameterGrid(candidates))
p_value = np.inf
alpha = 0.05
n = 0
while p_value > alpha:
# Exhaustive grid search
parameters = all_params[n]
p_value = experiment(parameters)
n += 1
# write paper
Pitfall 2: Flexible study designs
Many sources contribute to flexible research designs
50% had selectively reported only studies that 'worked'
58% had peeked at the results
43% threw out data after checking its impact on the p-value
35% reported unexpected findings as if predicted from start
Source: Loewenstein 2012
How often does this happen?
Examples
Flexible study designs in Deep Learning
"Graduate Student Descent"
Source: Lucic et. al. 2018 NeurIPS
Choice of hyperparameters
Determines performance
Flexible study designs present statistical challenges
The "Garden of Forking Paths"
Source: Bishop 2016
The importance of pre-study odds (R)
Identical p-values but
different conclusions
Source: Nuzzo 2014
What is the positive predictive value of a research finding?
Source: Ionnadis 2005
PPV = f(Bias
+ Odds
+ Significance level
+ Power)
PPV = Positive Predictive Value
What is the average statistical power?
“When effect size is tiny and measurement error is huge, you’re essentially trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.”
- Andrew Gelman
Source: Gelman 2015 (drawing by Viktor Beekman)
Source: Nord et. al. 2017, Button et. al. 2013
Statistical power in neuroscience
Median statistical power is 21% (varies by subfield)
Source: Ionaddis 2005
$$ PPV = \frac{[1 - \beta] R + u \Beta R}{R + \alpha - \beta R + u - u\alpha + u \beta R} $$
= Power
= Pre-study odds
= Bias (report any result as true)
= Significance level
$$ \alpha $$
$$ [1-\beta]$$
$$ R$$
$$ u$$
What is the positive predictive value of a research finding?
Source: Ioannidis 2005
80 % power
20 % power
PPV (%)
What is the positive predictive value of a research finding?
$$ \alpha = 0.05 $$
If avg(positive predictive values) are low
>
most results should not be reproducible
Bias
Source: Fanelli et. al. 2017
Misuse of significance testing (Pitfall 1)
Flexible study designs (Pitfall 2)
>
Selective reporting
Evidence of bias across many fields
Many fields exhibit low reproducibility
Sources: OCC 2015, Stodden 2018, Prinz 2011, Begley and Ellis 2012, Gundersen 2018
Computational Neuroscience: 24%
Cancer Biology: 11%
Psychology: 30%
Pharmacology: 7%
Machine Learning and AI: 24%
Reproducibility in Pharmacology
Source: Prinz et. al., 2011 Nature
n = 67 papers
7% reproducibility rate
Source: Gundersen 2017 AAAI
n = 400 papers
20-30% reproducibility rate
Reproducibility in Machine Learning and AI
Source: Joelle Pineau 2018 NeurIPS
n = 50 papers (NeurIPS, ICML, ICLR) 2018
20% reproducibility rate
Reproducibility in Machine Learning and AI
Hedging against academic risk
Source: Ohserovich 2011 Nature
The “unspoken rule” among early stage VCs is that at least 50% of published studies, even those in top-tier academic journals, “can’t be repeated with the same conclusions by an industrial lab."
Psychology Reproducibility
Source: Open Science Collaboration, Science 2015
n = 100 papers
35% reproducibility rate
Distribution of p-values
Distribution of effect sizes
Source: Stodden et. al. 2018
n = 204 papers (from Science mag)
26% reproducibility rate
Computational Neuroscience Reproducibility
Preclinical Oncology Reproducibility
Source: Begley and Ellis, 2012
Irreproducible findings continue to be
highly cited
n = 53 oncology papers
11% reproducibility rate
Flexible analyses
+
low statistical power
=
widespread irreproducibility
Three major strategies to do better
1. More stringent significance thresholds, using null hypothesis testing correctly (ASA).
2. Move away from bright-lines using alternative measures of assessing evidence: effect sizes, confidence intervals, Bayes Factors,....
3. Teach data science fundamentals in academia. Share data/code/computing environments.
1. Move threshold from 0.05 to 0.005
Pros:
Cons:
Source: Benjamin et. al. 2017
How will changing the threshold affect the
positive predictive value?
$$ PPV = \frac{[1-\beta]R+u\beta R}{R+\alpha-\beta R+u-u\alpha+u\beta R} $$
0.005
0.05
Substantial controversy over this proposal.
bias = 0.80
bias = 0.50
bias = 0.20
bias = 0.05
"The solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.” - Gelman
2. Bayes Factors offer one possible alternative to bright-lines.
$$ Bayes Factor = \frac{P(D \mid M_A)}{P(D \mid M_B)} $$
Pros:
Cons:
Source: Wagenmakers 2007
3. Teach data science fundamentals in academia
3 Core Principles
Testing, Out of sample prediction, Reproducible compute
Improve reproducibility by teaching best practices in data science
Academic research projects are data science projects
Source: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py#L693
Testing
Test statistical assumptions automatically
"With four parameters I can fit an elephant, and with five I can make him wiggle his trunk."
Von Neumann
Source: Dyson 2004
Out of sample inferences
Reproducible compute
paper
data
software
environment
Scientist 1
Scientist 2
Container
Thank you for listening!
slides: pbeukema.github.io/science_aims
Additional Resources
Additional Resources
The ASA statement on p-values
1. P-values can indicate how incompatible the data are with a specified statistical model
2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Source: ASA statement
4. Proper inference requires full reporting and
transparency
5. A p-value, or statistical significance, does not measure
the size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of
evidence regarding a model or hypothesis.
Source: ASA statement
The ASA statement on p-values
Source: Baker 2016, Nature
There is widespread irreproducibility across fields
Results from survey of 1576 researchers
Source: Silberzahn et. al. 2018
Are referees more likely to give red cards to dark skin toned professional soccer players?
29 different analyses, 21 unique combinations of covariates
Odds Ratio: 0.89 to 2.93 [median of 1.31]
Examples of PPV
Source: Ioannidis 2005
Average statistical power in neuroscience
Median statistical power is 21%
Source: Button et. al. 2013
Example: Genome-wide association study
Dataset: 100,000 polymorphisms
Power: 60%
Pre-study odds (R): 10/100,000
\alpha = 0.05
Positive predictive value is low even for reasonably powered study and minimal/no bias
Without bias
u=0
PPV = .0012
With bias
u=0.1
PPV = .00044
Source: Ionnadis 2005
Problem 3: p-values depend on data that weren't observed
Source: Wagenmakers 2007
Observation: x = 5
Under f(x): p-value = 0.03+0.01 = 0.04 -> reject null
Under g(x): p-value = 0.03+0.03 = 0.06 -> fail to reject null
"What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred" (Jeffreys, 1961)
Suppose we have the following two sampling distributions
Advice from Rob Kass:
When Possible, Replicate!
When statistical inferences, such as p-values, follow extensive looks at the data, they no longer have their usual interpretation. Ignoring this reality is dishonest: it is like painting a bull’s eye around the landing spot of your arrow...
The only truly reliable solution to the problem posed by data snooping is to record the statistical inference procedures that produced the key results, together with the features of the data to which they were applied, and then to replicate the same analysis using new data.
Source: Kass et. al. 2016
Preregister if possible
Source: Data Colada