Do I have numbers, data, or information?

Tim Akerman
Categories:   Data Analysis   Statistical Analysis  

Nearly everything we do generates numbers. Everyone feels comfortable with numbers because they give a perceived absolute measure of what is “right” and what is “wrong”. But what do the numbers really mean?

Numbers generated electronically also seem to generate automatic trust. If it comes from a calculator or computer, it must be right. We must learn to question numbers and information regardless of source, the computer or calculator is only as good as its programming.

Caveat emptor, buyer beware, has never been more relevant than when dealing with electronically generated numbers.

There is a difference between having numbers, having data, and having information.

Numbers can feel secure but may not be useful. For example, if the number that is used is either not related to the core process or is measured inaccurately, then the numbers may be misleading.

Numbers are generated from a process, perhaps a test. It is important to understand how the sample was selected and if the sample size is appropriate. If a test method is used, it is vital that we measure what is relevant and that the results don’t depend on the person doing the test or when it is done. Finally, we must ensure that we understand how much variability is acceptable. Without knowing this, we can’t judge the risk we are taking when using the number to make a decision.

One example that has been experienced in print is coefficient of friction measurements from a supplier. The numbers presented appeared to be very consistent and in good process control, however when the method was examined using Gage r&R techniques it was found that the measurements could not even produce a consistent number for one batch of material. The confidence interval was so wide that there was a single category with category limits wider than the specification tolerance.

The result was random system tampering based on whether the number generated was inside or outside of the specification limits. There was no recognition that the test was inadequate for the specification limits applied.

It is critical to question the basis of numbers; failing to do this can result in taking the wrong action for the right reasons. All numbers we deal with should be checked to ensure that they are valid and reliable;

  • Valid means that the test measures a relevant parameter.
    If we are interested in liquid viscosity, taking a measurement set against an arbitrary reference point won’t work. It is necessary to measure the flow rate of the liquid, but liquids can have different reactions to how they are moved, usually referred to as shear. Most liquids get thinner, some don’t change, but a few liquids get thicker when you try to move them. One example of a viscosity measurement is the use of Zahn cups. If you have never heard of this, it is a small cup with a hole in the bottom and a handle on top. The time taken for liquid to flow out of the hole is related to liquid viscosity. They give an indication of flow, but absolute figures are hard to specify, since the tolerance of a cup is ±5%, and the method may not reflect the shear behaviour of the process. The viscosity also varies with temperature, so their use is more difficult than it may appear.
  • Reliable means that the test delivers the same number regardless of the operator or time of day.
    Taking the Zahn cup example again, different operators will time the stream differently, one will stop at the first appearance of drops in the stream, another will wait until there are drops from the cup itself.

Failing to adequately understand the methodology generating the numbers and the sources of variance (error) can lead to poor decisions and mistakes. For example, am I measuring real changes that I am interested in, or changes in operator or the test method, which are more likely to mislead me?

Once the basis of the numbers is understood, the stream of numbers then becomes reliable. The next step is to create data. This can be dangerous ground if numbers are used out of context. Distilling numbers to a single number, for example a compound statistic such as average and making judgements on that number can lead to problems, if the degree of variation is not understood.

William Scherkenbach has observed that

“We live in a world filled with variation – and yet there is very little recognition or understanding of variation”

Another way to look at it is this; if my head is in the oven and my feet are in the freezer, my average temperature may well be 20°C, but am I happy?

This is what makes single numbers dangerous.  A cosmetic product recently advertised 80% approval based on a sample of 51 users. Using statistical analysis, the true range of approval is somewhere between 67% and 90%. That’s not quite the same message, is it?

If there is no concept of variation, it is possible to either make an incorrect decision based on inadequate data or waste massive amounts of time and resource trying to address perceived “good” or “bad” numbers by looking for one off causes that are just normal variation. One way of understanding if the variation we observe is common cause or special cause is to use a control chart. There are several statistical signals that alert a knowledgeable operator that something is changing in the process.

There is a significant risk in chasing special cause variation with common cause approaches and vice-versa. Most often a number will be deemed too high or too low arbitrarily, because someone “knows what good looks like”. The danger when using common cause solutions for special cause variation is that the system will vary randomly and without warning. This will create confusion and time will be lost in trying to recreate the conditions as a controlled part of the system. Over time the conditions giving rise to the observed variation will conflict, since the test parameters do not control the variation.

Using special cause solutions for common cause variation will result in arbitrary attribution of cause because no rational cause can be found. Therefore, the first change that shows any sign of improvement becomes the root cause.

What needs to be done to give numbers and data meaning?

Numbers only have meaning in context. This means that it is necessary to consider two features of any data – location and dispersion. The representation of these two measure most commonly used are average and standard deviation. Even her it is important to have clarity about what the average is intended to demonstrate and why we are interested in the average.

We all know what the average is don’t we? Most people will add all the numbers up divide by the count of the numbers. This is called the mean. There are however two other versions of the average, the median – the middle number of a series, and the mode – the most common number in a series.

Why does it matter if I use mean instead of median for example? What difference does it make?

Mean is fine if the data is normally distributed, however if the data is not normally distributed there can be a significant difference between the arithmetic mean and the median, or middle number. Extreme values can significantly skew the average and mislead the user about the true position of the average. It is for this reason that one must understand the data distribution before deciding whether to use mean or median data. If one is looking for the most common occurrence, then clearly the mode is relevant.

How much variation is there?

The standard deviation is the average difference between the data and the average of all the data. In other words, how close to the average is each data point. The bigger the standard deviation, the more uncertain we are about the average. The standard deviation uses a root mean square function, so as the number of data points increases the standard deviation will reduce. If your result is too variable, take a larger sample.

Using only the mean, it could be tempting to move the mean of a process to the specification limit, perhaps to maximise profit. This can be fine, if the distribution is narrow and tall, moving closer to the limit will not result in failures. However, if the distribution is broader and flatter, it may be that the distribution only just fits in the tolerance window. If this is the case, moving the mean to the limit will result in a significant proportion of failures. It is possible that more than half of the product could fail, particularly if the sample size means the window for the true mean is large. If the true mean could fall outside of the specification, you are taking a huge risk!

This is because the average is not an absolute. The only way to have an absolute average is to have all possible data points – the population. This is often impossible, for example if the testing is destructive so we take a sample. We need to understand how close our sample standard deviation is to the population standard deviation. From the mean and standard deviation information, we can calculate a third parameter, the confidence interval for the average. The confidence interval expresses the range within which we would expect the true population mean to lie. Most statistics use a 95% confidence level, this means that 95% of the samples taken under the same conditions would give a result within this range.

These three pieces of information convert the number to information by considering in context of all the data.

How does this help us to understand?

Now it is possible to see

  • The location of the average
  • How closely can we predict the population average from the sample average
  • The location of a data point
    • how close is each number to the average?
  • The dispersion of a data point
    • how variable are the numbers that make up the data point?

With this information, it is possible to use the data to create information which can be used to make decisions.

For example, it is possible to determine if a number is within the normal spread of values for a measurement parameter. If it is, why search for some special meaning, if it is not, we can ask what happened?

In conclusion then using numerically based information to make decisions is a good thing to do. However, before we use the numbers we must be certain the numbers are valid and reliable. Then we can provide a framework of context and we can provide limits inferred from the data itself.

The final step is to apply logic and reason to the patterns revealed to create actionable information from the data. This involves application of the scientific method. Management theory teaches that data driven decision making is the only rational way to make decisions.

A quote from W. Edwards Deming seems appropriate at this point

“If you do not know how to ask the right question, you discover nothing.”

It doesn’t matter how good your numbers, how well those numbers have been converted to data, you will not gain information if you don’t know what question to ask.

Hypothesis testing

Demonstration with a confidence level of 95% that a hypothesis is true or false is not perfect, but it does provide a better mechanism for decision making than superstitious belief or gut feel.

W.E.B. Du Bois has probably stated the truth of data based decision making most eloquently

“When you have mastered numbers, you will in fact no longer be reading numbers, any more than you read words when reading books. You will be reading meanings.”

Next time you are looking at information to decide a course of action, make sure it demonstrates what you intended. Is your decision-making process based on numbers, data, or information? If your decision-making process uses numerical information, do you know what questions to ask? Ask how the data was sampled, what was the sample size? Is it appropriate?

There is an interaction here with one of my previous blog posts, Three simple questions.

Next time you have a decision to make define your objective clearly, question wisely, obtain relevant valid and reliable information, and use this as a basis for making sound decisions.