Bad data collection or a special case in time data?

A common issue with transactional projects is found in data collection from the data warehouse. When a year of data is requested, they provide you all the jobs that were worked in the year, which to a programmer may mean that started and stopped in the year requested. This is a bad assumption.

The problem is if the data collection is limited to a single year for the stop and start date, you will not include the jobs that started in the prior year but finished in the next year causing the early year data to only include short duration jobs.

I have included charts to show what I am trying to describe. They compare two data collections.

1) Collect all the jobs that start and stop in 2008
2) Collect all the jobs that stopped in 2008, no matter when they start.

For case 1 I plotted an i-chart showing the data by transaction. This is clearly a poor choice of plots due the skewed nature of the data.

Then I plotted it with a lambda=0 transformation and all looks better. This is the right transform since the data are time durations, which are known to have a lognormal distribution.

In the second case, I plotted an i-chart showing the data by transaction. You can see the wedge shape in the early months, but you can also see that it is the wrong chart.

Now I created an i-chart with a lambda=0 transform, as in the earlier case. You can still see the wedge in the early days.

Maybe this is difficult to see in these charts, so I now plot the mean and standard deviation plots for each case. This is the procedure recommended in our courses, and all it does is make the second case appear to have an early year trend.

Compare that with the i-chart of the data when all the jobs that were closed in 2008 were processed with a mean and standard deviation. All looks fine.

Now was there a special cause change in the process or was it poor data collection?

You now know what to look for!

For this simulation I used a lognormal(3.5,0.8) distribution rounded to an integer for the duration of the jobs. 2000 jobs were simulated with the 2008 finish dates randomly spread throughout the entire 365 day year. One analysis only includes the jobs with a start date in 2008, while the full set includes all data.