## STATISTICS AT THE PINNACLE-PART 2

I’d talk about the lottery and expected values. People play the lottery, because eventually somebody wins.  We can predict quite accurately the probability that somebody will win.  “You see,” I’d say, “low probability events happen; they just happen with low probability.  Take the lottery with a 1 in 110 million chance of winning.  If 330 million tickets have been bought, the expected value of jackpot winners is 3.  That doesn’t mean that 3 will win, but it is expected.  We can easily, and I mean easily calculate the probability of 0,1,2,3, and 4 with a calculator and a few key strokes.  Three people nationwide might win.  Three people, in the entire country.  Yes, it has to be somebody.  But do you think it is going to be you?”

If you have a disease, you have certain symptoms.  Medicine is the study of people who have certain symptoms and tries to figure out the probability of their having a disease.  Physicians and others would do well to understand the idea that not all who test positive for a disease have the disease.  “Suppose a disease has a 0.1% prevalence in the population, or 1 in 1000 people has it.  We would do well to teach percentages early in math and often, too.  Suppose if you have the disease, you test positive for it 98% of the time.  If you don’t have the disease, you test negative for it 99% of the time.  You test positive.  What is the likelihood you have the disease?

What is important here is the background frequency of the disease.  The fact the disease occurs in only 1 of 1000 means that it is unlikely somebody who tests positive will have the disease: only 9%.

Anybody remember W. Edwards Deming?  He was ignored here but found the Japanese receptive to his ideas about data analysis and optimizing systems.  The Japanese cleaned our clocks in the automotive industry before the Big Three caught on, not because Japanese cars were fancy, but because they worked.  There is an apocryphal story about how a Japanese company was told by an American buyer that no more than 4% of ball bearings should be faulty.  In the next shipment, 4 at the top of every box were faulty.  When asked why they were there, the company spokesman said, ‘you didn’t want more than 4% faulty.  Here they are, on top.  The rest are perfect.’

“Deming taught that variability could be classified as “common cause” (noise) and “special cause” (signal, important).  It was he who said that considering every variation as significant was not only wasteful, such “tinkering” made the process worse.  How often do we hear comparisons of say a murder number in a city being more than last year’s and hearing somebody pontificate an explanation?  Have you ever heard that this is common cause variability, and that if you want to lower the murder number, you need to address the entire system?

Samples have to be random, which is a way of saying everybody in the population, the group of people one is studying, has a definable non-zero chance of being chosen.  That doesn’t mean, ‘They didn’t ask me, so the sample is no good.’  It’s no good if the sample is done in the Deep South and the sampler wants to extrapolate it to the whole country.  One living in New York or Ohio never had a chance of being sampled.  Most people think large samples mean more useful results, but bias in a sample of 200 continues to be bias in a sample of 200,000 if the methodology doesn’t change.  The mathematics of sampling are not difficult to understand, and if one wishes to be a little less confident, 90% rather than 95%, and the margin of error for a dichotomous (yes-no) question allowed to rise to 8 or 9%, rather than 1-2%, the sample size needed decreases dramatically.

“I’ve worn out my welcome, but let me finish by mentioning the concept of 2-3 standard deviations from the mean, which most people take as being a significant outlier.  That all depends whether the curve is Bell-shaped.  If it is, then the probability of something more than 2 standard deviations from the mean is 5%.  But it is possible, depending upon the distribution of the data, to have up to 25% of items more than 2 standard deviations from the mean, hardly a significant outlier.  For 3 standard deviations, it is 3 in 1000 chance with a bell-shaped curve, but with some distributions, up to 11% of the observations.  I wonder how many who have been 2-3 standard deviations on the wrong side of the curve have been punished unjustly.  Five standard deviations?  4%.  It is 1/25, which is 5 squared.  This is known as Chebyshev’s Inequality.

“Finally, I would like to see students learn how to make good graphs instead of the ones I see today.  I would make Edward Tufte’s books required reading.  I would like to see more line graphs, dot plots, and box-and-whisker graphs with fewer multi-color pie charts.  I said I could go on for three more pages about statistics.  I have. Statistics has many day-to-day encounters, it is often used poorly, both by those who don’t know it and worse, by those who want to fool you.  It’s not lies, damned lies, and statistics but rather lies and damned people who lie using statistics.”

Tags: ,