Understanding Logistic Regression Output from SAS
This post details the terms obtained in SAS output for logistic regression. The definitions are generic and referenced from other great posts on this topic. The aim is to provide a summary of definitions and statistical explaination of the output obtained from Logistic Regression Code in SAS.
This covers the binary classification not the multi-class classification.
Logistic regression code:
ods graphics on; proc logistic data = lib_cbl.data_pd_&circle. namelen=100 plots=ALL descending outest=estimates_pd_&circle. outmodel=model_pd_&circle.; class = &classvars. model dv = &variables. / selection=stepwise ctable scale=none clparm=wald clodds=pl rsquare lackfit ; output out= pred_pd_&circle. p=PROB ; run;
Decile Preparation Code:
/* Calculate Deciles for Lift Chart and KS */ data decile_pd; set lib_cbl.pred_pd_&circle.; predicted_trn=PROB+(ranuni(12345)/10000000); run; proc rank data=decile_pd out=rank_pd groups=10; var predicted_trn; ranks predicted_rank_trn; run; proc sql; create table lib_cbl.final_decile_pd_&circle. as select predicted_rank_trn, count(*)as count, min(predicted_trn)*1000 as min, max(predicted_trn)*1000 as max, mean(dv) as actual,mean(predicted_trn) as pred, sum(dv) as bad from rank_pd group by predicted_rank_trn order by actual desc; quit;
Output in HTML & PDF Format
Interpretation of Terms in the Output
1. Model Information
Data Set : Data set used by the model
Response Variable : Name of the Dependent vatiable (the one with 0/1 target values)
Number of Response Levels : 2 if binary classification, >2 if multinomial logisitc regression
Model : binary logit (for two class binary classification)
Optimization Technique (Fisher’s scoring):
Scoring algorithm, also known as Fisher’s scoring, is a form of Newton’s method used in statistics to solve maximum likelihood equations numerically, named after Ronald Fisher. This is similar to Gradient Descent algorithm (though Fisher’s scoring finds the peak as compared to the bottom in Gradient Descent, but the concept is similar). This is an optimization function which minimizes the cost function of logistic regression.
Fisher scoring is a hill-climbing algorithm for getting results - it maximizes the likelihood by getting successively closer and closer to the maximum by taking another step ( an iteration). It knows when it has reached the top of the hill in that taking another step does not increase the likelihood. It is known to be an efficient procedure - not many steps are usually needed - and generally converges to an answer. When this is the case you do not need to be concerned about it - accept what you have got. Changing metaphors, you can drive the care without knowing how the internal combustion engine works.
Occasionally however the likelihood surface you are trying to climb is not a sharp peak but there are multiple peaks and the there is no much depth in the landscape. There are a series of bumps each giving potentially different answers in terms of the estimates but with pretty similar likelihoods. When this is the case good software will warn you and you will see a large number of iterations and a failure to converge. When that is the case it is usually due to lack of information ( small sample size) or failure to meet assumptions of the model ( eg multicollinearity between predictors). When this happens the solution is not usually a different algorithm but improved data and/or model. Some software allows you to profile the likelihood to see a map of the surface in which you are trying to find the peak.
In lots of software for the logistic model the Fisher scoring method (which is equivalent to iteratively reweighted least squares) is the default ; an alternative is the Newton-Raphson algorithm . Both should give same parameter estimates; but the standard errors may differ slightly. (The Fisher scoring is based on the expected information matrix; Newton-Raphson method on the observed information matrix.) In fact in a binary logit model, the observed and expected information matrices are identical, resulting in identical estimated standard errors.
If you have rare events ( eg many people but few of them have died - lots of zeroes , not many ones) then you may wish to use other procedures to get less biased results; such as the Firth method which is a penalized likelihood approach to reducing small-sample bias in maximum likelihood estimation; see https://www3.nd.edu/~rwilliam/stats3/RareEvents.pdf
Number of Observations Read & Number of Observations Used : This will be the same if no missing values are present. Rows with missing values are removed
Response Profile : Frequency split of the Dependent Variable (target Variable) - how many rows (observations) have value 1 and how many have a value 0
Probability modeled is dv=1 This statement confirms the target variable name (here the target variable is named ‘dv’) and the logistic model is built to predict the probability of this variable.
Stepwise Selection Procedure
Stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure.In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. Usually, this takes the form of a sequence of F-tests or t-tests, but other techniques are possible, such as adjusted , Akaike information criterion (AIC), Bayesian information criterion, Mallows’s Cp, PRESS, or false discovery rate.
SAS Code and Explanation:
The following invocation of PROC LOGISTIC illustrates the use of stepwise selection to identify the prognostic factors for cancer remission. A significance level of 0.3 is required to allow a variable into the model (SLENTRY= 0.3), and a significance level of 0.35 is required for a variable to stay in the model (SLSTAY= 0.35). A detailed account of the variable selection process is requested by specifying the DETAILS option. The Hosmer and Lemeshow goodness-of-fit test for the final selected model is requested by specifying the LACKFIT option. The OUTEST= and COVOUT options in the PROC LOGISTIC statement create a data set that contains parameter estimates and their covariances for the final selected model. The response variable option EVENT= chooses remiss=1 (remission) as the event so that the probability of remission is modeled. The OUTPUT statement creates a data set that contains the cumulative predicted probabilities and the corresponding confidence limits, and the individual and cross validated predicted probabilities for each observation. The ODS OUTPUT statement writes the “Association” table from each selection step to a SAS data set.
title 'Stepwise Regression on Cancer Remission Data'; proc logistic data=Remission outest=betas covout; model remiss(event='1')=cell smear infil li blast temp / selection=stepwise slentry=0.3 slstay=0.35 details lackfit; output out=pred p=phat lower=lcl upper=ucl predprob=(individual crossvalidate); ods output Association=Association; run; proc print data=betas; title2 'Parameter Estimates and Covariance Matrix'; run; proc print data=pred; title2 'Predicted Probabilities and 95% Confidence Limits'; run;
In stepwise selection, an attempt is made to remove any insignificant variables from the model before adding a significant variable to the model. Each addition or deletion of a variable to or from a model is listed as a separate step in the displayed output, and at each step a new model is fitted.
Finally, when none of the remaining variables outside the model meet the entry criterion, and the stepwise selection is terminated.
Adifferent variable selection method is used to select prognostic factors for cancer remission, and an efficient algorithm is employed to eliminate insignificant variables from a model. The following statements invoke PROC LOGISTIC to perform the backward elimination analysis:
title 'Backward Elimination on Cancer Remission Data'; proc logistic data=Remission; model remiss(event='1')=temp cell li smear blast / selection=backward fast slstay=0.2 ctable; run;
The backward elimination analysis (SELECTION=BACKWARD) starts with a model that contains all explanatory variables given in the MODEL statement. By specifying the FAST option, PROC LOGISTIC eliminates insignificant variables without refitting the model repeatedly. This analysis uses a significance level of 0.2 to retain variables in the model (SLSTAY=0.2), which is different from the previous stepwise analysis where SLSTAY=.35. The CTABLE option is specified to produce classifications of input observations based on the final selected model.
Results of the fast elimination analysis are shown in Output 51.1.9 and Output 51.1.10. Initially, a full model containing all six risk factors is fit to the data (Output 51.1.9). In the next step (Output 51.1.10), PROC LOGISTIC removes blast, smear, cell, and temp from the model all at once. This leaves li and the intercept as the only variables in the final model. Note that in this analysis, only parameter estimates for the final model are displayed because the DETAILS option has not been specified.
See the outputs on the link provided above
The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.
AIC is founded on information theory: it offers an estimate of the relative information lost when a given model is used to represent the process that generated the data. (In doing so, it deals with the trade-off between the goodness of fit of the model and the simplicity of the model.)
AIC does not provide a test of a model in the sense of testing a null hypothesis. It tells nothing about the absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, AIC will not give any warning of that.
Suppose that we have a statistical model of some data. Let k be the number of estimated parameters in the model. Let be the maximum value of the likelihood function for the model. Then the AIC value of the model is the following.
Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Thus, AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting, because increasing the number of parameters in the model almost always improves the goodness of the fit.
BIC( Bayesian Information Criterion). Your logistic regression model will give you -2 Log Likelihood. So it is very easy to calculate both AIC and BIC.
BIC = LN(number of observations) x number of variables in your model- 2 Log Likelihood AIC = 2 x number of variables in your model - 2 Log Likelihood
AIC is a bit more liberal often favours a more complex, wrong model over a simpler, true model. On the contrary, BIC tries to find the true model among the set of candidates. BIC penalizes for every additional parameter/variable that is in the model. Most of the time, they will agree on the preferred model but when they don’t, I guess you would just have to exercise your judgement.
The Hosmer–Lemeshow test
The Hosmer–Lemeshow test is a statistical test for goodness of fit for logistic regression models. It is used frequently in risk prediction models. The test assesses whether or not the observed event rates match expected event rates in subgroups of the model population. The Hosmer–Lemeshow test specifically identifies subgroups as the deciles of fitted risk values. Models for which expected and observed event rates in subgroups are similar are called well calibrated.
The Hosmer-Lemeshow goodness of fit test is based on dividing the sample up according to their predicted probabilities, or risks. Specifically, based on the estimated parameter values β̂ 0,β̂ 1,..,β̂ p, for each observation in the sample the probability that Y=1 is calculated, based on each observation’s covariate values:
The observations in the sample are then split into g groups (we come back to choice of g later) according to their predicted probabilities. Suppose (as is commonly done) that g=10. Then the first group consists of the observations with the lowest 10% predicted probabilities. The second group consists of the 10% of the sample whose predicted probabilities are next smallest, etc etc.
Suppose for the moment, artifically, that all of the observations in the first group had a predicted probability of 0.1. Then, if our model is correctly specified, we would expect the proportion of these observations who have Y=1 to be 10%. Of course, even if the model is correctly specified, the observed proportion will deviate to some extent from 10%, but not by too much. If the proportion of observations with Y=1 in the group were instead 90%, this is suggestive that our model is not accurately predicting probability (risk), i.e. an indication that our model is not fitting the data well.
In practice, as soon as some of our model covariates are continuous, each observation will have a different predicted probability, and so the predicted probabilities will vary in each of the groups we have formed. To calculate how many Y=1 observations we would expect, the Hosmer-Lemeshow test takes the average of the predicted probabilities in the group, and multiplies this by the number of observations in the group. The test also performs the same calculation for Y=0, and then calculates a Pearson goodness of fit statistic
where o0l denotes the number of observed Y=0 observations in the lth group, o1l denotes the number of observed Y=1 observations in the lth group, and e0l and e1l similarly denote the expected number of zeros.
In a 1980 paper Hosmer-Lemeshow showed by simulation that (provided p+1<g) their test statistic approximately followed a chi-squared distribution on g−2 degrees of freedom, when the model is correctly specified. This means that given our fitted model, the p-value can be calculated as the right hand tail probability of the corresponding chi-squared distribution using the calculated test statistic. If the p-value is small, this is indicative of poor fit.
It should be emphasized that a large p-value does not mean the model fits well, since lack of evidence against a null hypothesis is not equivalent to evidence in favour of the alternative hypothesis. In particular, if our sample size is small, a high p-value from the test may simply be a consequence of the test having lower power to detect mis-specification, rather than being indicative of good fit.
Choosing the number of groups As far as I have seen, there is little guidance as to how to choose the number of groups g. Hosmer and Lemeshow’s conclusions from simulations were based on using g>p+1, suggesting that if we have 10 covariates in the model, we should choose g>11, although this doesn’t appear to be mentioned in text books or software packages.
Intuitively, using a small value of g ought to give less opportunity to detect mis-specification. However, if we choose g to large, the numbers in each group may be so small that it will be difficult to determine whether differences between observed and expected are due to chance or indicative or model mis-specification.
A further problem, highlighted by many others (e.g. Paul Allison) is that, for a given dataset, if one changes g, sometimes one obtains a quite different p-value, such that with one choice of g we might conclude our model does not fit well, yet with another we conclude there is no evidence of poor fit. This is indeed a troubling aspect of the test.
References and Also-read -
Concordance & Discordance
Concordant pairs and discordant pairs refer to comparing two pairs of data points to see if they “match.” The meaning is slightly different depending on if you are finding these pairs from various coefficients (like Kendall’s Tau) or if you are performing experimental studies and clinical trials.
Let’s say you had two interviewers rate a group of twelve job applicants:
Note that in the first column, interviewer 1’s choices have been ordered from smallest to greatest. That way, a comparison can be made between the choices for interviewer 1 and 2. With concordant or discordant pairs, you’re basically answering the question: did the judges/raters rank the pairs in the same order? You aren’t necessarily looking for the exact same rank, but rather if one job seeker was consistently ranked higher by both interviewers.
Three possible scenarios are possible for these ordered pairs:
- Tied pairs: both interviewers agree. For example, candidate A was marked as a 1st choice for both interviewers, so they are tied.
- Concordant pairs: both interviewers rank both applicants in the same order — that is, they both move in the same direction. While they aren’t the same rank (i.e. both 1st or both 2nd), each pair is ordered equally higher or equally lower. Interviewer 1 ranked F as 6th and G as 7th, while interviewer 2 ranked F as 5th and G as 8th. F and G are concordant because F was consistently ranked higher than G.
- Discordant pairs: Candidates E and F are discordant because the interviewers ranked in opposite directions (one said E had a higher rank than F, while the other said F ranked higher than 6).
Steps to calculate concordance / discordance and AUC
Calculate the predicted probability in logistic regression model.
Divide the data into two datasets. One dataset contains observations having actual value of dependent variable with value 1 (i.e. event) and corresponding predicted probability values. And the other dataset contains observations having actual value of dependent variable 0 (non-event) against their predicted probability scores.
Compare each predicted value in first dataset with each predicted value in second dataset.
Total Number of pairs to compare = x * y
x: Number of observations in first dataset (actual values of 1 in dependent variable)
y: Number of observations in second dataset (actual values of 0 in dependent variable).
In this step, we are performing cartesian product (cross join) of events and non-events. For example, you have 100 events and 1000 non-events. It would create 100k (100*1000) pairs for comparison.
- A pair is concordant if 1 (observation with the desired outcome i.e. event) has a higher predicted probability than 0 (observation without the outcome i.e. non-event).
- A pair is discordant if 0 (observation without the desired outcome i.e. non-event) has a higher predicted probability than 1 (observation with the outcome i.e. event).
A pair is tied if 1 (observation with the desired outcome i.e. event) has same predicted probability than 0 (observation without the outcome i.e. non-event).
- The final percent values are calculated using the formula below -
Percent Concordant = (Number of concordant pairs)/Total number of pairs
Percent Discordance = (Number of discordant pairs)/Total number of pairs
Percent Tied = (Number of tied pairs)/Total number of pairs
Area under curve (c statistics) = Percent Concordant + 0.5 * Percent Tied
Somers’ D (Somers’ Delta), sometimes incorrectly referred to as Somer’s D, is a measure of ordinal association between two possibly dependent random variables X and Y. Somers’ D takes values between -1 when all pairs of the variables disagree and 1 when all pairs of the variables agree.
Delta is an ordinal alternative to Pearson’s Correlation Coefficient. Like Pearson’s R, the range for Somers’ D is -1 to 1:
-1 = all pairs disagree,
1 = all pairs agree.
Large values for Somers’ D (tending towards -1 or 1) suggest the model has good predictive ability. Smaller values (tending towards zero in either direction) indicate the model is a poor predictor. Let’s say you had a Delta of .549 in the friendly sales staff/customer satisfaction scenario. Customer satisfaction is the dependent variable, so you can say that friendly sales staff improves customer satisfaction by 54.9%.
Decile Table, Lift Chart, and KS
Decile Table: The predicted probabilities are sorted in descending order, and then divided into groups of 10. This gives a uniform distribution of customers (equal numbers in each decile) with the first decile containing exactly 10% of population.
The idea is to see the distribution of probabilities of classification. If the model is random (no model), and say, if the good/positive is represented by 1 and bad/negative is represented by 0, the distribution of 1 and 0 will be same as it is in the population, i.e. first decile containing 10% of entire population would contain 10% of good. In the case of a very good model, the first decile would (10% of population) would contain, say, 40% of good customers; 2nd decile would contain 15% of good, 3rd would contain 8% of good, and so on. This way the classifier segregates the population and top 3 deciles (example) would contain 63% of good population. If you target this 30% population, you will be capturing over 63% of good population.
A decile table looks like:
Lift Chart: This is a representation of cumulative good and cumulative bad decile wise, and the difference between the two lines show the robustness of the model.
Kolmogorov–Smirnov test (KS) KS is the maximum difference between the cumulative good and cumulative bad population. This is a metric by which the performance of the model can be measured. For a good model, the KS should be atleast 35-40 in the 3rd or 4th decile.