There is a classic example from the field of politics, when the Literary Digest, a well-respected publication, wanted to predict the winner of the US presidential election in 1936. After the magazine had conducted an enormous survey with 2.4 million respondents, the winner was evident: Republican candidate Alfred Landon would win an overwhelming victory. At the same time, a new, unknown research company, the American Institute of Public Opinion, conducted a much smaller survey with only 50,000 people. This indicated the opposite outcome instead: a big win for the other candidate, F. D. Roosevelt. The company’s founder, a certain George Gallup (who gave his name to the Gallup poll), was of course correct. Roosevelt won easily.
What had gone wrong? How can 48 times more data be worse? The answer lies in a false assumption that is often made with Big Data. We assume that the data we have is all the data (often expressed as “N = All”), which is very rarely the case. The fact is that if all ten million people with votes in the Literary Digest survey had chosen to reply, they would at least have predicted the right candidate. Selecting the right data (sampling) and trying to investigate what data you do not have is a precondition for an analysis producing a fair result. Working with small, but representative data sets also has other benefits such as speed, cost-efficiency and manageability.
There is also another assumption made in the field of Big Data, especially in sub-segments such as predictive analysis and machine learning, where they look for patterns: that correlation is the same as causality. This is a recurring problem in the area of poor research, poorly written base data and poorly taken decisions. It is a fact that correlation (how well the variation between two things corresponds) has nothing at all to do with causality (whether one thing causes another). If you have no logical explanation for why one pattern indicates that another occurs, the predictive ability is only good until it is not good any more. Predictive models therefore tend to break down at any time and for reasons about which people have no idea. One example of this is Google Flu Trends, which followed the number of searches for terms including “flu symptoms” and was thus able on a geographical basis to predict outbreaks of flu in the USA. Having worked initially with an accuracy rate of 97%, its accuracy plummeted in 2011. As there was no clear analytical model apart from a scaled-down mathematical formula, it was also impossible to either predict that it would happen or simply to find out why it happened. Subsequent surveys indicated, among other things, that TV programmes that had been broadcast affected the number of searches for fly symptoms.
There’s actually no clear or established definition of what Big Data means, other than that there is such a vast volume of information that it is difficult to process in a simple, time-efficient way. So where is the boundary between Big Data and Not-So-Big Data? Of course, there’s very much a sliding scale over time. What was an amazing amount of data only a few years ago is no longer so: if 20 terabytes of storage in 2010 felt like something that should be on a computer rack in a server hall, today it fits in a small box on your desk.
For example, many people might think that the Internet, or more specifically the World Wide Web, is an example of Big Data. Google’s primary task is to help us search the web, and they’re a Big Data company, aren’t they? They have published many articles as well as presentations on YouTube about how everything relating to search technology fits together. According to them, they index around 30 trillion single pages. How accurate this is can be difficult to determine without having insight into the company, but there are of course other figures you can use for comparison. According to VeriSign, which has overall responsibility for all domains, there are around 300 million registered domain names in the world. Let’s say that half of them are actively used, are not spam and don’t point somewhere else. This would mean that each domain (everything from 0-0-0-0.com to ockelbonytt.se) would have an average of 200,000 sub-pages. By way of comparison, Sweden’s most-visited site, aftonbladet.se, which is updated with lots of material around the clock, has around 500,000-1 million sub-pages, depending a little on who you ask. There are also other sources, such as the voluntary project Common Crawl, whose downloadable scans of the web have around two billion pages, or WorldWideWebSize, which uses its own method to estimate the web to have at least 4.7 billion pages, although possibly up to 45 billion.
So it is probable that the web in total is below the limit of what might be called Big Data in 2015. Analyses of a few billion documents in a database are not much to shout about at Data Analytics conference nowadays. What could it be instead that enables us to call Google a Big Data operator?
Every month more than 100 billion googles are performed. We can assume that everything about these is registered. Google almost certainly saves your IP address, the name of your operating system, the version of your web browser, coordinates of every page you clicked on, the resolution of your screen, the exact time of all events and much, much more. It is not at all unreasonable to assume that Google saves more than 100 data points per search and then associates this information with a cookie, so that they get a complete profile for you. This profile is then used to be able to show you adverts that are as far as possible adapted for you. They thus need to analyse data that is growing by more than ten billion new data points per month.
The closest you come to Big Data when you’re running a search engine is thus to store and analyse our behaviour and personal preferences in order to show adverts that are as appropriate to us as possible, rather than to save and search in the actual web pages. (They have not, however, posted any presentations about this on YouTube.)
Josef Falk, Business Intelligence project manager, Enfo Pointer