Automated out-of-control identification for control charts

Forrest Breyfogle and I were talking yesterday about a client discussion.

The client has and wants to continue with an automatic detection of a process that is out-of-control that signals the management to take actions.

Neither or Forrest and I believe in automatic out-of-control actions.  Why?

What does out-of-control mean?

Control charts are created to identify out-of-control conditions.  The out-of-control conditions are identified by one or more rules applied to the pattern of the data points.  These rules are often called the Western Electric run rules or the zone tests.  These rules include the following;

  1. One point outside of limits
  2. Two of three points in zone A or beyond
  3. Three of five points in zone B or beyond
  4. Six points in a row increasing or decreasing
  5. Nine points in a row on the same side of the center line

Control Chart Zones out-of-control

Each rule can be calculated using the probabilities of data being in that zone using a normal distribution CDF.

So what does out-of-control mean?  It means that there is a pattern in the data that is improbable to exist in truly random data.  But they can occur in a stable random process, all be it not very often.  This is why Dr. Shewhart did not write that we call a process out-0f-control based only on a rule violation, his guidance was to investigate.

A process should not be considered as out-of-control until a rule was triggered and an investigation has been performed that determines that the cause is a true change in the process.  

If the run rule was triggered and no cause was identified, it could have been a random event, so the process can be considered as still stable and in-control.

Reasons to investigate out-of-control conditions

The prior section showed how an out-of-control condition is identified, through a probabilistic analysis.  All of the run rules have a probability of random occurrence of less than 0.3%.  This means they occur randomly between 1 out of 300 to 1 out of 500 data points collected.  This seems infrequent enough to allow us to consider each event as a true special cause, but it is not.

If we apply the three run rules calculated above on a single chart, the probability of a random event triggering one of the rules becomes (1-(1-0.0027)*(1-0.0019)*(1-0.0028)) = 0.0074 or 0.74%, which is 1 in 135 data points.

If we applied all five rules, assuming they each have a .002 probability of a random occurrence, then the probability of a one of the five rules being triggered is 0.0099% or about 1%.  This would mean that 1 out of every 100 points could be a random event that triggers a false indication of a process change.  Most of us would consider this too frequent for a false indication of a change.

Business processes and the run rules

Every calculation above assumes that the process data to be a random normal distribution so that the run rules and even the outside of the control limit probabilities to be quite small. This assumption is nearly always WRONG.

To have a random process output, there needs to be no changes in the process inputs or execution. That would mean only one person is executing the process, their competency/skill is unchanging hour-to-hour and day-to-day. The raw materials (or transactions) are identical and unchanging hour-to-hour and day-to-day. All of the equipment and systems used in the process have no wear-out or degradation hour-to-hour and day-to-day. These conditions do not exist in actual business.

Since these perfect conditions do not exists, there will be more false indications of change (out-of-control) signals than you would expect if everything was random. So the probability of an out-of-control signal calculated above is probably the best case value. So what does this mean to us? Do not set a lot of run-rules on your control charts.

My recommendation to all people using control charts has three parts.

  1. Only use an individuals control chart to monitor a business process.  This guideline will make your process management more successful because of the use of a moving range to estimate the process variation is a better estimate of the true common cause variation than a subgroup range.  The moving range includes more than just short term variation into the control limits.  This is one of the precepts of the 30,000-foot-level methods documented by Forrest and taught here at Smarter Solutions.
  2. Only apply run-rules that are sensitive to the most probable failure modes in the process.  If a process failure will cause the process mean to start trending, use the trending rules or the points on one side of the mean rule.  If the process will fail with a shift in the mean, use the rules that quickly detect a shift such as 2-of-3 in zone A.  Of course, you should always use rule-1, which is one point outside of the control limits.
  3. Always investigate an out-of-control signal before declaring it a special cause.  The chance that an out-of-control signal is a random event is too high to just trust the rules through automation.  An out-of-control signal is somewhat like a criminal indictment and the investigation is like the trial to determine guilt.

Bottom line; do not automatically trust the run-rules to identify special causes.  They identify only non-random patterns that must be investigated to determine if they are special causes.

Dependence on Normality

You can read from many authors that control charts to not require normality. Although it is true, the data does need to be normally distributed to create a control chart, all of the run-rules are based on the probability distribution of the normal distribution. If the original data is reporting the time to execute a process, it usually follows a lognormal distribution. This distribution is skewed to longer times, which would cause more high values, beyond the upper control limit; to exist that will create even more false indications of a process change.

How to deal with the non-normality involves averaging or transforming of the data prior to assessing the stability using an individuals chart. This is beyond the scope of this post.

Bottom Line

  1. Only use an individuals chart for business process management
  2. Apply on the run-rules related to known process change modes
  3. If the data is significantly non-normal, transform or average the data.
  4. Every out-of-control signal should be investigated by a person before it is considered as a special cause event.