What's lurking in your Big Data: Bias

What’s LURKING in your Big Data: Bias

It’s a sad, hard truth; Big Data is biased. If this seems hard to accept, we should consider how bias lurks in data and analysis.

The most common bias in data is selection bias. This occurs when the data is either collected wrong, or a sample of the data is collected wrong, or both. The most common bias in processing is interpretation bias; placing subjective predispositions into use, processing and understanding of data.

While there are dozens of named data and analytical biases, these two root causes explain most of them. Three examples can clearly illustrate the point: airplanes, smoking, and political polling.

Selection Bias: Airplanes

Nearly all data biases are some form of selection bias, though there are special names for some selection biases. A good example is survivorship bias.

According to internet lore, during World War II, the Navy wanted to understand what locations might benefit from additional armor. It was tempting to look at battle damage and add armor to the places where airplanes were full of holes.

Work by Abraham Wald, whose work is summarized here, argued for something else. Wald’s equations are not an easy read, but what he argued for is simple. Assume enemy bullets hit our aircraft at random, if we consider enough bullets hitting enough airplanes. Now, look at the density of bullet holes in the aircraft which come back, the highest density shows us where we need armor the least. Those airplanes survived without armor. In contrast, airplanes which didn’t return had holes we can’t count.

So, the surviving airplanes create a tempting bias in the data, which might lead us in exactly the wrong direction. The survivors give us lots of data about where bullets hit, and unless we understand bias, we’d be tempted to protect those areas. Returning aircraft also show where we find the fewest holes, and therefore where we most desperately need armor protection.

A caution on the story of Wald; the internet versions of the tale are probably mostly a myth, with only germs of truth, according to Bill Casselman, writing for the American Mathematical Society. The real genius of Wald’s work was far beyond saving airplanes. It means that we can sometimes (when the conditions are right) estimate the degree of selection bias in our data.

Interpretation Bias: Smoking

In some cases, correct data processing can’t save us from our own biases. Today, if you do a web search on “cancer causes smoking” you will get hundreds of results “correcting” the search input. We will see smoking -> cancer, not the other way around in which we meant to write, cancer -> smoking.

But R.A Fisher, who was perhaps the greatest statistician of the early 20^th century, believed a predisposition for cancer caused people to smoke. As one scholar said, “His views may also have been influenced by personal and professional conflicts, by his work as a consultant to the tobacco industry, and by the fact that he was himself a smoker.”

Fisher collected his works on this topic into a 1959 pamphlet. Although flawed, it makes some good points on the difficulty of collecting unbiased data, and doing unbiased analysis on anything related to humans. We might devise controlled experiments on the toughness of steel, or the genome of tulips, but humans are difficult. Ethics about how we treat our fellow beings, and the fibs they tell us when we ask about their habits are just two of the problems we face.

Knowing about bias was not enough to help Fisher fully overcome his own biases. He interpreted the best data and best analysis of his day, but incorrectly. While it’s easy in hindsight to criticize Fisher, it’s better to take this as a warning. Everyone is vulnerable to interpretation bias.

1936 Political Polling: Blending the Biases

In 1936, the Great Depression was dragging on. Franklin Roosevelt had been elected to a first term in 1932 following more than two years of economic decline. But consumer spending and investment continued to drop, and by 1933 nearly half of U.S. banks had failed. Roosevelt’s “New Deal” was energetic and controversial. The federal government became a massive employer. Even today, historians, legal experts and economists don’t agree about what worked, what was legal, and what was misguided. The political context was volatile. Radical ideas competed with established traditions in America and the rest of the world. Violent protests had begun before the 1932 elections and continued across the country after Roosevelt took office.

Could Roosevelt be elected to a second term? A recovery had begun, but the prosperity of the 1920’s had not been restored. Government spending seemed reckless to many voters.

The ‘36 election seemed unique. So, one of the most respected publications of the day, Literary Digest, conducted the most ambitious political surveys ever attempted; they mailed ten million mock ballots. The sheer size of this poll was impressive, it represented about 25% of the total voters in the nation.

Literary Digest had been accurately predicting presidential elections since 1916, and had the expectation of being right again, when they predicted a landslide for Landon with 57% of the vote, to 43% for Roosevelt. Landon was a moderate who accepted most of the New Deal, but wanted to rein in waste, corruption and inefficiency. The projected win for Landon was published in October 1936. The cover of that edition is shown above.

Instead of a dramatic win for Landon, the actual election was even more dramatic for Roosevelt, who received 61% of the popular vote, carried 46 of the 48 states, and received more than 90% of the electoral college.

How could such a big data set from such a reputable institution be wrong? Literary Digest blended selection bias with interpretation bias.

Selection bias in the poll was dramatic. While large and expensive, it depended on mailing addresses from telephone books, clubs, associations, and magazine subscriptions. This excluded anyone who could not afford club dues, a telephone, or a magazine. With record unemployment, many people lacked food, so these discretionary expenses were out of the question. Further, evictions of renters and home owners made mailing lists inaccurate.

Interpretation bias could have been avoided. But Literary Digest was driven by a strong bias: their readership. Like the echo-chamber of today’s social media, they’re an audience with a perspective. Their readers were the people who were getting by, or even prospering during the Great Depression. They reflected the view among these readers that the New Deal had gone too far. Landon’s call for moderation appealed to these readers, but not to the sharecropper forced off his farm, the worker who couldn’t find a job, or the school teacher who had gone months without pay.

George Gallup, a polling pioneer and founder of the Gallup poll correctly forecasted the election with a sample of about 50,000 voters, a fraction of Literary Digest’s response. Unlike the magazine, Gallup had a well-constructed (i.e., unbiased) sample.

By the end of November, the cover of Literary Digest read “Is Our Face Red!” and by 1938, the proud old title was gone.

About Lone Star Analysis

Lone Star Analysis enables customers to make insightful decisions faster than their competitors. We are a predictive guide bridging the gap between data and action. Prescient insights support confident decisions for customers in Oil & Gas, Transportation & Logistics, Industrial Products & Services, Aerospace & Defense, and the Public Sector.

Lone Star delivers fast time to value supporting customers planning and on-going management needs. Utilizing our TruNavigator® software platform, Lone Star brings proven modeling tools and analysis that improve customers top line, by winning more business, and improve the bottom line, by quickly enabling operational efficiency, cost reduction, and performance improvement. Our trusted AnalyticsOS^SM software solutions support our customers real-time predictive analytics needs when continuous operational performance optimization, cost minimization, safety improvement, and risk reduction are important.

Headquartered in Dallas, Texas, Lone Star is found on the web at http://www.Lone-Star.com.

[/vc_column]

What’s LURKING in your Big Data: Bias