Logistic Regression – Residual Analysis

I have been working with a client that is needing to model a process that generates attribute data ranging from 100% to 0%, which was not a problem but just part of their process.  Their process was targeted at 50% but different product lines were performing differently but they could not predict the output enough to properly price their products. By the way, the attribute was not a quality measure, it was another count type data.

They identified three factors that should drive the output.  One was a locational issue, one was a type issue and one was supplier based.  This yells for a logistic ANOVA, which was a good choice.  But the nature of the data, where each combination of factors had ranges from 0% to 100%, almost ensures that the goodness of fit tests to all fail.  In order to apply some type of test of the Logistic model, the choice was to use the model EPRO (estimated probability) for each product and compare it to actual performance.  We used an Actual-predicted delta for the prediction residual.   Plotting this value should show a symetric histogram which could even be bell shaped.  It is not expected to be normally distributed, but it could be close.

The first plots of these delta values using single factor models showed a somewhat uniformed shaped histogram, indicating that the one-factor fits were not enough.  When two factors were used, showed a tri-modal (three peaks), but it did have lower occurrence rates (tails).  The center peak was taller than the other two side peaks.

When the third factor was added to the analysis, the one side peak disappeared and the other one was significantly reduced.  The shifting of the histogram of the delta values for each improving fit provides an insight into the process and indicates that the last three parameter fit is good, but the existence of the remaining minor peak in the delta value histogram indicates that there is another factor influencing the fit bu the client does not have any other factor to fit at this time.

We took the three factors and mapped the coefficients into an excel worksheet for use in estimating the output for each new product.

Lesson to take away from this posting.  The statistical tests are flexible in their use if you understand how they work and what is behind the menus.  Combining the definition of a residual into the logistic regression process is not necessarily the best thing to do, but it was able to verify the benefit of the logistic prediction.  By the way, the R-squared was 36%.  Not great but an improvement from zero.

Comments are closed.