Binary Logistic Regression To be or not to be, that is the question..(William Shakespeare, Hamlet) Binary Logistic Regression Also known as logistic or sometimes logit regression Foundation from which more complex models derived e.g., multinomial regression and ordinal logistic regression

Dichotomous Variables Two categories indicating whether an event has occurred or some characteristic is present Sometimes called binary or binomial variables Dichotomous DVs Placed in foster care or not

Diagnosed with a disease or not Abused or not Pregnant or not Service provided or not Single (Dichotomous) IV Example DV = continue fostering, 0 = no, 1 = yes Customary to code category of interest 1 and the other category 0 IV = married, 0 = not married, 1 = married

N = 131 foster families Are two-parent families more likely to continue fostering than one-parent families? Crosstabulation Table 2.1 Relationship between marital status and continuation is statistically significant [2(1, N = 131) = 5.65, p = .017]

A higher percentage of two-parent families (62.20%) than single-parent families (40.82%) planned to continue fostering Strength & Direction of Relationships Different ways to quantify the relationship between IV(s) and DV Probabilities Odds Odds Ratio (OR)

Also abbreviated as eB, Exp(B) (on SPSS output), or exp(B) % change Roadmap to Computations Probabilities Odds p/1-p Odds Ratios Odds(1) / Odds(0) % change 100(OR - 1) Probabilities Percentages

in Table 2.1 as probabilities (e.g., 62.20% as .6220) p Probability that event will occur (continue) e.g., probability that one-parent families plan to continue is .4082 1 p Probability that event will not occur (not continue) e.g., probability that one-parent families do not plan to continue is .5918 (1 - .4082) Odds Ratio

of probability that event will occur to probability that it will not e.g., odds of continuation for one-parent families are .69 (.4082 / .5918) p odds 1 p Can range from 0 to positive infinity Probabilities and Odds Table 2.2

Odds = 1 Both outcomes equally likely Odds >1 Probability that event will occur greater than probability that it will not Odds <1 Probability that event will occur less than probability that it will not Odds Ratio (OR) Odds

of the event for one value of the IV (two-parent families) divided by the odds for a different value of the IV, usually a value one unit lower (oneparent families) e.g., odds of continuing for two-parent families more than double the odds for one-parent families OR = 1.6455 / .6898 = 2.39 OR (contd) Plays a central role in quantifying the

strength and direction of relationships between IVs and DVs in binary, multinomial, and ordinal logistic regression OR < 1 indicates a negative relationship OR > 1 indicates a positive relationship OR = 1 indicates no linear relationship ORs > 1 e.g., OR of 2.39 A one-unit increase in the independent variable increases the odds of continuing by a factor of 2.39 The odds of continuing are 2.39 times

higher for two-parent compared to oneparent families ORs < 1 e.g., OR = .50 A one-unit increase in the independent variable decreases the odds of continuing by a factor of .50 The odds that two-parent families will continue are .50 (or one-half) of the odds that one-parent families will continue ORs < 1 (contd) Compute

reciprocal (i.e., 1 / .50 = 2.00) Express relationship as opposite event of interest (e.g., discontinuing) A one-unit increase in the independent variable increases the odds of discontinuing by a factor of 2.00 The odds that two-parent families will discontinue are 2.00 times (or twice) the odds of one-parent families OR to Percentage Change % change = 100(OR 1) Alternative way to express OR e.g., A one-unit increase in the independent variable increases the odds of continuing by

139.00% 100(2.39 1) = 139.00 e.g., A one-unit increase in the independent variable decreases the odds of continuing by 50.00% 100(.50 1) = -50.00 Comparing OR > 1 and OR <1 Compute e.g., reciprocal of one of the ORs

OR of 2.00 and an OR of .50 Reciprocal of .50 is 2.00 (1 / .50 = 2.00) ORs are equal in size (but not in direction of the relationship) Qualitative Descriptors for OR Table 2.3 Use cautiously with IVs that arent dichotomous Question & Answer Are

two-parent families more likely to continue fostering than one-parent families? Yes. The odds of continuing are 2.39 times (139%) higher for two-parent compared to one-parent families. The probability of continuing is .41 for one-parent families and .62 for two-parent families. Binary Logistic Regression Example DV = continue fostering, 0 = no, 1 = yes Customary to code category of interest 1 and the other category 0

IV = married, 0 = not married, 1 = married N = 131 foster families Are two-parent families more likely to continue fostering than one-parent families? Statistical Significance Table 2.4 Relationship between marital status and

continuation is statistically significant (Wald 2 = 5.544, p = .019) Direction of Relationship B = slope Positive slope, positive relationship OR > 1 Negative slope, negative relationship OR < 1 0 slope, no linear relationship OR = 1

Direction/Strength of Relationship Positive relationship between marital status and continuation Two-parent families more likely to continue B = .869 Exp(B) = OR = 2.385 % change = 100(2.385 - 1) = 139% The odds of continuing are 2.39 times (139%) higher for two-parent compared to one-parent families Roadmap to Computations Logits

ln(p / 1 p) = L short for ln(p / 1 p) Odds eL Odds Ratios Odds(1) / Odds(0) % change 100(OR - 1) Probabilities eL / (1 + eL) Binary Logistic Regression Model

ln(/ (1 - )) = + 1X1 + 1X2 + kXk, or ln( / (1 - )) = is the probability of the event (eta) is the abbreviation for the linear predictor (right hand side of this equation) k = number of independent variables Logit Link ln( / (1 - ))

Log of the odds that the DV equals 1 (event occurs) Connects (i.e., links) DV to linear combination of IVs Estimated Logits (L) ln(p / 1 - p) = a + B1X1 + B1X2 + BkXk ln(p / 1 p) Log of the odds that the DV equals 1 (event occurs) Estimated logit, L Does not have intuitive or substantive meaning

Useful for examining curvilinear relationships and interaction effects Primarily useful for estimating probabilities, odds, and ORs Estimated Logits (L) L(Continue) = a + BMarriedXMarried L(Continue) = -.372 + (.869)(XMarried) a = intercept B = slope Logit to Odds If L = 0:

Odds = eL = e0 = 1.00 If L = .50: Odds = eL = e.50 = 1.65 If L = 1.00: Odds = eL = e1.00 = 2.72 Logits to Odds (contd) Table 2.4

One-parent families L(Continue) = -.372 = -.372 + (.869)(0) Odds of continuing = e-.372 = .69 Two-parent families L(Continue) = .497 = -.372 + (.869)(1) Odds of continuing = e.497 = 1.65 Odds to OR OR = 1.65 / .69 = 2.39, or e.869 = 2.39, labeled Exp(B)

Table 2.4 OR to Percentage Change % change = 100(OR 1) e.g., A one-unit increase in the independent variable increases the odds of continuing by 139.00% 100(2.39 1) = 139.00 e.g., A one-unit increase in the independent variable decreases the odds of continuing by 50.00% 100(.50 1) = -50.00

Logits to Probabilities p( Continue ) One-parent p( Continue ) families, L(Continue) = -.372 e . . . . e .

Two-parent p( Continue ) eL eL families, L(Continue) = .497 e . . . .

e . Question & Answer Are two-parent families more likely to continue fostering than one-parent families? Yes. The odds of continuing are 2.39 times (139%) higher for two-parent compared to one-parent families. The probability of continuing is .41 for one-parent families and .62 for two-parent families. Single (Quantitative) IV Example

DV = continue fostering, 0 = no, 1 = yes Customary to code category of interest 1 and other category 0 IV = number of resources N = 131 foster families Are foster families with more resources more likely to continue fostering? Statistical Significance Table

2.5 Relationship between resources and continuation is statistically significant (Wald 2 = 4.924, p = .026) H0: = 0, 0, 0, same as H0: OR = 1, OR 1, OR 1 Likelihood ratio 2 better than Wald Direction/Strength of Relationship Positive relationship between resources and continuation Families with more resources are more

likely to continue B = .212 Exp(B) = OR = 1.237 % change = 100(1.237 1) = 24% The odds of continuing are 1.24 times (24%) higher for each additional resource Estimated Logits L(Continue) = -1.227 + (.212)(X) Figures Resources.xls Effect of Resources on Continuation (Logits) 1.50

1.00 Logits 0.50 0.00 -0.50 -1.00 -1.50 1 Logits -1.01 2 3

4 5 6 7 8 9 10 11

-0.80 -0.59 -0.38 -0.16 0.05 0.26 0.47 0.68 0.90

1.11 Re source s Effect of Resources on Continuation (Odds) 3.50 3.00 Odds 2.50 2.00 1.50 1.00 0.50

0.00 Odds 1 2 3 4 5 6 7

8 9 10 11 0.36 0.45 0.55 0.69 0.85

1.05 1.30 1.60 1.98 2.45 3.03 Resources Effect of Resources on Continuation (Probabilities)

0.80 Probabilities 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 1 Probabilities 0.27

2 3 4 5 6 7 8 9 10

11 0.31 0.36 0.41 0.46 0.51 0.56 0.62

0.66 0.71 0.75 Resources Question & Answer Are foster families with more resources more likely to continue fostering? Yes. The odds of continuing are 1.24 times (24%) higher for each additional resource. The probability of continuing is .31 for families with two resources, .51 for families

with 6 resources, and .71 for families with 10 resources. Relationship of Linear Predictor to Logits, Odds & p Relationship between linear predictor and logits is linear Relationship between linear predictor and odds is non-linear Relationship

between linear predictor and p is non-linear Challenge is to summarize changes in odds and probabilities associated with changes in IVs in the most meaningful and parsimonious way Logit as Function of Linear Predictor Logit 3.00 2.00 1.00 .00 -1.00 -2.00

-3.00 -3.00 -2.00 -1.00 .00 1.00 Linear Predictor 2.00 3.00

Odds Odds as Function of Linear Predictor 21.00 18.00 15.00 12.00 9.00 6.00 3.00 .00 -3.00 -2.00 -1.00

.00 1.00 Linear Predictor 2.00 3.00 Probability Probabilities as Function of Linear Predictor 1.00 .90

.80 .70 .60 .50 .40 .30 .20 .10 .00 -3.00 -2.00 -1.00 .00

1.00 Linear Predictor 2.00 3.00 IVs to z-scores z-scores (standard scores) Only the IV (not DV)--semi-standardized slopes One-unit increase in the IV refers to a onestandard-deviation increase OR interpreted as expected change in the odds associated with a one standard deviation

increase in the IV Conversion to z-scores changes intercept, slope, and OR, but not associated test statistics Table 2.6 (compare to Table 2.5) Figures zResources.xls Probabilities Effect of zResources on Continuation (Probabilities) 0.90 0.80 0.70 0.60

0.50 0.40 0.30 0.20 0.10 0.00 Probabilities -3 -2 -1 0 1

2 3 0.26 0.34 0.44 0.54 0.64 0.73

0.80 Standardize d Re source s Question & Answer Are foster families with more resources more likely to continue fostering? Yes. The odds of continuing are 1.51 times (51%) higher for each one standard deviation (1.93) increase in resources. The probability of continuing is .34 for families with resources two standard deviations below the mean, .54 for families with the mean number of resources (6.60), and .73 for families with resources two standard deviations above the

mean. IVs Centered Centering Typically center on mean Useful when testing interactions, curvilinear relationships, or when no meaningful 0 point (e.g., no family with 0 resources) Centering doesnt change slope, OR, or associated test statistics, but does change the intercept Table 2.7 (compare to Table 2.5) Figures cResources.xls

Probabilities Effect of cResources on Continuation (Probabilities) 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 Probabilities -5

-4 -3 -2 -1 0 1 2 3

4 5 0.29 0.34 0.39 0.44 0.49 0.54 0.60

0.65 0.69 0.74 0.77 Cente red Re source s Question & Answer Are foster families with more resources more likely to continue fostering? Yes. The odds of continuing are 1.24 times

(24%) higher for each additional resource. The probability of continuing is .34 for families with 4 resources below the mean, .54 for families with the mean number of resources (6.60), and .74 for families with 4 resources above the mean. Multiple IV Example DV = continue fostering, 0 = no, 1 = yes Customary to code the category of interest as 1 and the other category as 0 IV = married, 0 = not married, 1 =

married IV = number of resources (z-scores) N = 131 foster families Are foster families with more resources more likely to continue fostering, controlling for marital status? Statistical Significance Table 2.12 Relationship between set of IVs and continuation is statistically significant (2 = 6.58, p = .037)

H0: 1 = 2 = k = 0, same as H0: 1 = 2 = k = 1 (psi) is symbol for population value of OR Statistical Significance (contd) Table 2.13 Relationship between resources and continuation is not statistically significant, controlling for marital status (2 = .92, p = .338) Relationship between marital status and continuation is not statistically significant, controlling for resources (2 = 1.42, p = .234) H0: = 0, 0, 0, same as

H0: = 1, 1, 1 (psi) is symbol for population value of OR Likelihood ratio 2 better than Wald Statistical Significance (contd) Table 2.9 Relationship between resources and continuation is not statistically significant, controlling for marital status (2 = .91, p = .340) Relationship between marital status and continuation is not statistically significant,

controlling for resources (2 = 1.41, p = .235) H0: = 0, 0, 0, same as H0: = 1, 1, 1 (psi) is symbol for population value of OR Wald 2, but likelihood ratio 2 better Estimated Logits L(Continue) = -.183 + (.228)(XzResources) + (.570) (XMarried) ORs & Percentage Change ORzResources = 1.256 (ns) The odds of continuing are 1.26 times (26%)

higher for each one standard deviation (1.93) increase in resources, controlling for marital status ORMarried = 1.769 (ns) The odds of continuing are 1.77 times (77%) higher for two-parent compared to oneparent families, controlling for marital status Figures Married & zResources.xls Effect of Resources and Marital Status on Plans to

Continue Fostering (Odds) 3.50 3.00 O dds 2.50 2.00 1.50 1.00 0.50 0.00 -3 -2

-1 0 1 2 3 One-Parent 0.42 0.53 0.66

0.83 1.05 1.31 1.65 Two-Parent 0.74 0.93 1.17

1.47 1.85 2.32 2.92 Standardize d Re source s Probabilities Effect of Resources and Marital Status on Plans to Continue Fostering (Probabilities) 0.80

0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 -3 -2 -1 0

1 2 3 One-Parent 0.30 0.35 0.40 0.45 0.51

0.57 0.62 Two-Parent 0.43 0.48 0.54 0.60 0.65

0.70 0.74 Standardized Resources Presenting Odds and Probabilities in Tables Tables 2.10 and 2.11 Question & Answer Are foster families with more resources more likely to continue fostering,

controlling for marital status? No (ns). The odds of continuing are 1.26 times (26%) higher for each one standard deviation (1.93) increase in resources, controlling for marital status. Contd Question & Answer (contd) For one-parent families the probability of continuing is .35 for families with resources two standard deviations below the mean, .45 for families with the mean number of resources, and .57 for families with resources two standard deviations above the mean. For two-parent families the probability of continuing is .48 for families with resources two standard deviations below the mean, .60

for families with the mean number of resources, and .70 for families with resources two standard deviations above the mean. Comparing the Relative Strength of IVs Size of slope and OR depend on how the IV is measured When IVs measured the same way (e.g., two dichotomous IVs or two continuous IVs transformed to z-scores) relative strength can be compared Nothing

comparable to standardized slope (Beta) Nested Models IV1, IV2, IV3 IV1, IV2 IV1 IV2, IV3 IV2 IV2 IV1, IV3

IV3 IV1 IV3 Nested Models (contd) One regression model is nested within another if it contains a subset of variables included in the model within which its nested, and same cases

are analyzed in both models The more complex model called the full model The nested model called the reduced model. Comparison of full and reduced models allows you to examine whether one or more variable(s) in the full model contribute to explanation of the DV Sequential Entry of IVs Used to compare full and reduced models e.g., family resources entered first, and then marital status Fchange

used in linear regression Sequential Entry of IVs (contd) SPSS IVs GZLM doesnt allow sequential of Estimate models separately and compare omnibus likelihood ratio 2 values Reduced model 2(1) = 5.168 Full model 2(2) = 6.585 2 difference = 6.585 5.168 = 1.417 df difference = 2 1

p = .234 Chi-square Difference.xls Assumptions Necessary for Testing Hypotheses No assumptions unique to binary logistic regression other than ones discussed in GZLM lecture Model Evaluation Evaluate your model before you test hypotheses or interpret substantive results

Outliers Analogs of R2 Outliers Atypical cases Can lead to flawed conclusions Can provide theoretical insights Common causes Data entry errors Model misspecification Rare events Outliers (contd) Leverage Residuals

Standardized or unstandardized deviance residuals Influence Cooks D Leverage Think of a seesaw Leverage value for each case Cases with greater leverage can exert a disproportionately large influence Leverage value for each case No clear benchmarks Identify cases with substantially different

leverage values than those of other cases Residuals Difference between actual and estimated values of the DV for a case Residual for each case Large residual indicates a case for which model fits poorly Residuals (contd) Standardized or unstandardized deviance residuals Not normally distributed

Values less than -2 or greater than +2 warrant some concern Values less than -3 or greater than +3 merit close inspection Influence Cases whose deletion result in substantial changes to regression coefficients Cooks D for each case Approximate aggregate change in regression parameters resulting from deletion of a case Values of 1.0 or more indicate a problematic degree of influence for an individual case

Index Plot Scatterplot Horizontal axis (X) Case id Vertical axis (Y) Leverage values, or Residuals, or Cooks D Index Plot: Leverage Values Index Plot: Standardized Deviance Residuals

Index Plot: Cooks D Analogs of R2 None in standard use and each may give different results Typically much smaller than R2 values in linear regression Difficult to interpret Multicollinearity SPSS GZLM doesnt compute multicollinearity statistics

Use SPSS linear regression Problematic levels Tolerance < .10 or VIF > 10 Additional Topics Polychotomous IVs Curvilinear relationships Interactions

Overview of the Process Select IVs and decide whether to test curvilinear relationships or interactions Carefully screen and clean data Transform and code variables as needed Estimate regression model Examine assumptions necessary to estimate binary regression model, examine model fit, and revise model as needed Overview of the Process (contd) Test

hypotheses about the overall model and specific model parameters, such as ORs Create tables and graphs to present results in the most meaningful and parsimonious way Interpret results of the estimated model in terms of logits, probabilities, odds, and odds ratios, as appropriate Additional Regression Models for Dichotomous DVs Binary probit regression

Substantive results essentially indistinguishable from binary logistic regression Choice between this and binary logistic regression largely one of convenience and discipline-specific convention Many researchers prefer binary logistic regression because it provides odds ratios whereas probit regression does not, and binary logistic regression comes with a wider variety of fit statistics Additional Regression Models for Dichotomous DVs (contd) Complementary

log-log (clog-log) and log-log models Probability of the event is very small or large Loglinear regression Limited to categorical IVs Discriminant analysis Limited to continuous IVs