Comparing Classical and Bayesian Approaches to Hypothesis Testing James O. Berger Institute of Statistics and Decision Sciences Duke University www.stat.duke.edu Outline The apparent overuse of hypothesis testing When is point null testing needed? The misleading nature of P-values Bayesian and conditional frequentist testing of plausible hypotheses

Advantages of Bayesian testing Conclusions I. The apparent overuse of hypothesis testing Tests are often performed when they are irrelevant. Rejection by an irrelevant test is sometimes viewed as license to forget statistics in further analysis Prototypical example Habitat Type A B C

D E F Rank 1 2 3 4 5 6 Observed Hypothesis Usage 3.8 3.6

H0 : "mean usage is 2.8 equal for all habitats" 1.8 Rejected (P<.025) 1.5 0.7 Statistical mistakes in the example The hypothesis is not plausible; testing serves no purpose. The observed usage levels are given without confidence sets. The rankings are based only on observed means, and are given without uncertainties. (For instance, perhaps Pr (A>B)=0.6 only.)

Prototypical example Habitat Type A B C D E F Rank 1 2 3 4 5 6

Observed Hypothesis Usage 3.8 3.6 H0 : "mean usage is 2.8 equal for all habitats" 1.8 Rejected (P<.025) 1.5 0.7 Statistical mistakes in the example The hypothesis is not plausible; testing serves no purpose.
The observed usage levels are given without confidence sets. The rankings are based only on observed means, and are given without uncertainties. (For instance, perhaps Pr (A>B)=0.6 only.) Prototypical example Habitat Type A B C D E F Rank

1 2 3 4 5 6 Observed Hypothesis Usage 3.8 3.6 H0 : "mean usage is 2.8 equal for all habitats" 1.8 Rejected (P<.025)
1.5 0.7 II. When is testing of a point null hypothesis needed? Answer: When the hypothesis is plausible, to some degree. Note that, while H 0 : 0 is typically not plausible, it is a good approximation to H 0 :| | , as long as < (4 n ) (assuming n Gaussian observations with standard deviation ). Examples of hypotheses that are not realistically plausible H0: small mammals are as abundant on livestock grazing land as on non-grazing land

H0: survival rates of brood mates are independent H0: bird abundance does not depend on the type of forest habitat they occupy H0: cottontail choice of habitat does not depend on the season Examples of hypotheses that may be plausible, to at least some degree: H0: Males and females of a species are the same in terms of characteristic A. H0: Proximity to logging roads does not affect ground nest predation. H0: Pollutant A does not affect Species B. III. For plausible hypotheses, P-values are misleading as measures of evidence Example: Experimental drugs D1, D2, D3, . . .

are to be tested. Each Test: H0: Di has negligible effect H1: Di is effective Typical Bayesian Answer: The probability that H0 is true is 0.06. Classical Answer (P-value): If H0 were true, the probability of observing hypothetical data as or more "extreme" than the actual data is 0.06. DRUG D1 D2 D3 D4 D5 D6 P-VALUE 0.41 0.04 0.32 0.94 0.01 0.28

DRUG D7 D8 D9 D10 D11 D12 P-VALUE 0.11 0.05 0.65 0.009 0.09 0.66 Question: How strongly do we believe that Drug i has a nonnegligible effect when (i) the P-value is approximately 0.05? (ii) the P-value is approximately 0.01? A Surprising Fact: Suppose it is known that, apriori, about 50% of the Drugs will have negligible effect. Then, (i) of the Drugs for which the P-value 0.05, at least 25% (and typically over 50%)

will have negligible effect; (ii) of the Drugs for which the P-value 0.01, at least 7% (and typically over 15%) will have negligible effect. DRUG D1 D2 D3 D4 D5 D6 P-VALUE 0.41 0.04 0.32 0.94 0.01 0.28 DRUG D7 D8 D9

D10 D11 D12 P-VALUE 0.11 0.05 0.65 0.009 0.09 0.66 Question: How strongly do we believe that Drug i has a nonnegligible effect when (i) the P-value is approximately 0.05? (ii) the P-value is approximately 0.01? A Surprising Fact: Suppose it is known that, apriori, about 50% of the Drugs will have negligible effect. Then, (i) of the Drugs for which the P-value 0.05, at least 25% (and typically over 50%) will have negligible effect; (ii) of the Drugs for which the P-value 0.01, at least 7% (and typically over 15%) will have negligible effect.

DRUG D1 D2 D3 D4 D5 D6 P-VALUE 0.41 0.04 0.32 0.94 0.01 0.28 DRUG D7 D8 D9 D10 D11 D12 P-VALUE 0.11 0.05 0.65 0.009 0.09 0.66 Question: How strongly do we believe that

Drug i has a nonnegligible effect when (i) the P-value is approximately 0.05? (ii) the P-value is approximately 0.01? IV. Bayesian testing of point hypotheses Data and Model: X has density f ( x| ) Example: X # of eggs hatched out of n eggs in a recently polluted area (so f is binomial, and is the true proportion that would hatch). To Test: H 0 : 0 versus H1 : 0 Example: 0 is the historically known proportion of eggs that hatch in the area The prior distribution Let P1 and P2 be the prior probabilities of H1 and H 2 . (The usual default choice is P1 P2 0.5.) Under H1 , let ( ) be the density representing

information concerning the location of . (The usual default choice for the binomial problem is ( ) 1.) Note: There are two schools of Bayesian statistics, the subjective school, where the prior distribution reflects real extraneous information, and the objective school, where the prior is chosen in a default fashion. Posterior probability that H0 is true, given the data (from Bayes theorem): Pr( H 0 | data x ) P0 f ( x|0 ) P0 f ( x|0 ) P1 f ( x| ) ( ) d

{ 0} 1 0 x 1 0 x n Beta ( x 1, n x 1) (for the binomial testing problem) ( 1)

Note: Some prefer to use the Bayes Factor (or weighted likelihood ratio) of H 0 to H1 , B f ( x|0 ) f ( x| ) ( ) d { 0} likelihood of data under H 0 = , " average" likelihood of data under H1 since this does not involve prior probabilties of the H i . Example: Suppose x=40 eggs hatched out of n=100.

Then Pr( H 0 | data x ) 0.52 and B 0.92. (Here a classical test would yield P value 0.05.) Conditional frequentist interpretation of the posterior probability of H0 Pr( H 0 | data x ) is also the frequentist type I error probability , conditional on observing data of the same "strength of evidence" as the actual data x. (The classical type I error probability makes the mistake of reporting the error averaged over data of very different strengths.) V. Advantages of Bayesian testing Pr (H0 | data x) reflects real expected error rates: P-values do not. A default formula exists for all situations: ( 1)

* * * f ( x , ) f ( x , )

f ( x , ) dx d 0 , Pr( H 0 | data x ) 1 *

f ( x , ) f ( x , ) d 0

where x * is independent (unobserved) data of the smallest size such that the above integrals exist. Posterior probabilities allow for incorporation of personal opinion, if desired. Indeed, if the published default posterior probability of H0 is P*, and your prior probability of H0 is P0, then your posterior probability of H0 is ( 1) 1 1 Pr( H 0 | data x ) 1 1 * 1

P P 0 * Example: In the binomial example, recall P 0.52. A "skeptic" has P0 01 . ; hence Pr( H 0 | data x ) 011 . . A " believer" has P0 0.9; hence Pr( H 0 | data x ) 0.91. Posterior probabilities are not affected by the

reason for stopping experimentation, and hence do not require rigid experimental designs (as do classical testing measures). Posterior probabilities can be used for multiple models or hypotheses. Example: H 0 : pollutant A has no effect on species B H1 : pollutant A decreases abundance of species B H 2 : pollutant A increases abundance of species B Pr( H 0 | data ) .30, Pr( H1 | data ) .68, Pr( H 2 | data ) .02 An aside: integrating science and statistics via the Bayesian paradigm Any scientific question can be asked (e.g., What is the probability that switching to management plan A will increase species abundance by 20% more than will plan B?) Models can be built that simultaneously incorporate

known science and statistics. If desired, expert opinion can be built into the analysis. Conclusions Hypothesis testing is overutilized while (Bayesian) statistics is underutilized. Hypothesis testing is needed only when testing a plausible hypothesis (and this may be a rare occurrence in wildlife studies). The Bayesian approach to hypothesis testing has considerable advantages in terms of interpretability (actual error rates), general applicability, and flexible experimentation.