Wrong Time for Regression

Teaching regression in a Lean Six Sigma course or a statistics course is always enjoyable because the students recognize what a great tool they have. It is a great tool that they have all learned about in school as Y=mX+b or Y=slope*X+y intercept. In class the students realize how powerful the regression concept can be when they are looking at a continuous data set.

Using regression, students show that call volume does impact cycle times or they show that complexity does explain long lead times. But even better, students can prove things that they have accepted as truths are not really true. These false beliefs may be that greater job experience leads to faster job execution or even fewer mistakes or they may show that orders that have a greater value do not cost more money to process. Finding that a believed relationship does not exist may be more important that finding a new relationship.

Nearly all work in Lean Six Sigma is about identifying new relationships or in proving or disproving existing beliefs. For many of us, it is the evaluation of beliefs and finding new relationships that provide us the most joy.

Regression used improperly

With all of the good that can be achieved with the regression tool, there is one simple use that leads to the highest number of false conclusions: the use of clock-time as the X in a regression. I found this example in a Blog in early March.

Percent of great lakes covered with ice from Steven Goddard's blog on March 1, 2014

 

This chart was provided in a blog by Steven Goddard on 1 Mar 2014 based on data from coastwatch.glerl.noaa.gov/webdata/cwops/webdata/statistic/dat/g2013_2014_ice.dat

The error is the trend line that was drawn on the chart. Notice that the x-axis consists of sequential days with the y-axis being the percent coverage. The line predicts that there will be 100% coverage by March 5th. I am writing my blog on the 4th of March and the current estimate is 90%. It appears that the lakes have never been recorded to have achieved 100% ice cover, but this chart says it will be there on the 5th.

What will happen on the 6th and 7th? The regression line indicates that we will achieve close to 110% coverage, which is ridiculous.

I have read through a few of Steven’s blogs, and most of his charts and discussions seem to use statistics properly no matter if you believe or disbelieve his position on climate change. It turns out that there are many “Scientists” that like to plot regression or trend lines on time series data, which is always the wrong thing to do.

Why regression should not be used with time-series data

When regression is taught at a simple and basic level, the student is just shown a set of x-y data and is taught how to describe a line showing the relationship. This is all good. What the student fails to be told is a very important assumption that is required to be met before the regression line should be expected as an optimal result: each pair of data is required to be independent. It is the independence that is the basis of the uncertainty estimates and the predictability of a regression between the X values used in the analysis.

When you collect data over time, where the collection time is the X, the data are no longer independent. Each y value is auto-correlated with the prior data value. The example in this case is that the % ice coverage on March 4th is not an independent value between 1 and 100. The March 4th value will be a value close to the March 3rd value. The regression algorithm will produce a line with this type of data, but the result is somewhat meaningless.

Even if you do not believe me, you will find a rule for regression that tells us that a regression equation is only able to predict values between the values of X used in the analysis. Therefore, even if you choose to use time on the x-axis, you should not use it to predict tomorrow, as the prior graphic shows.

Examples of time as a predictor (fails)

The global warming topic has provided a few examples; here is one.

future global warming

This chart is probably the worst example that I could find. It does not matter which side you are on, the belief or non-belief in climate-change, we should all accept this chart as ridiculous. We should treat this data as we would stock data; past performance does not guarantee future results.

In another case, I had a black belt student from a technology company that examined IRS divisions revenue vs month and found a positive slope. Revenue was going up over time. He then took this data to show with a regression that the revenue would reach XX level within 12 months, achieving the goal set by the company. In this example, we should clearly see the problem. Revenue from one month to another is not only dependent but it is influenced by many things other than the month. A bad choice.

What is the right tool to use with data over time?

Using historical data to predict future performance can be easy and it can be nearly impossible.

The easy case is when you are observing a process that is consistent, or stable, over time as demonstrated with a control chart.  It is fair to say that the process is predictable and will produce similar performance in the future, a future that is predictable until the process changes.

The difficult case is found when the data are auto-correlated, such as the time examples above or as found in the stock market. In this case you must use a different set of tools based specifically on time-series analysis and forecasting, topics that are not included in your typical statistics or Lean Six Sigma course. In these methods, your ability to predict future performance is reasonable over the next time period or two but the uncertainty of a prediction rapidly grows as the future is predicted. Most methods show you the prediction, but fail to include the corresponding uncertainty or confidence interval to the prediction. If the uncertainty was included, most of us would never have used a regression over time.

Final Message

Regression is a great tool as long as every pair of data is independent. If the pairs of data are not independent, then you should not use regression to fit the data.