BIG DATA
Also available in the Bloomsbury Sigma series:
Sex on Earth by Jules Howard
p53: The Gene that Cracked the Cancer Code by Sue Armstrong
Atoms Under the Floorboards by Chris Woodford
Spirals in Time by Helen Scales
Chilled by Tom Jackson
A is for Arsenic by Kathryn Harkup
Breaking the Chains of Gravity by Amy Shira Teitel
Suspicious Minds by Rob Brotherton
Herding Hemingways Cats by Kat Arney
Electronic Dreams by Tom Lean
Sorting the Beef from the Bull by Richard Evershed and Nicola Temple
Death on Earth by Jules Howard
The Tyrannosaur Chronicles by David Hone
Soccermatics by David Sumpter
For Linda,
who would have made this a better book,
were you here to read it.
BIG DATA
DOES SIZE MATTER?
Timandra Harkness
Contents
What is this book? asked my stepmother, Juliet.
Is it for people like me, who keep hearing the phrase big data and want to be able to talk about it at dinner parties?
Yes, I said, thats exactly what it is.
Not only for Juliet, and not just at dinner parties its a book for anyone who gets the feeling big data is interesting and important, and should be talked about, but doesnt want to study mathematics or computer programming.
In 10 chapters I aim to get you from the most basic ideas to some of the thorniest issues we need to be arguing about.
On the way, youll meet some of the people, ideas and projects Ive been lucky enough to encounter around the world. Much of this book is in other peoples words, telling their own stories or introducing concepts that help me understand why big data matters. Ive tried to structure it so each new idea builds naturally on whats gone before.
That means its written to be read in order. You can dip in and out if you prefer, of course. Hey, its your book, you can wallpaper your bathroom with it if you like. But I think youll get more out of it if you read from beginning to end.
Big data is a huge subject, and changing so fast I sometimes felt I was running as fast as I could just to stand still. The subject matter of any one of these chapters could fill an entire book. So there are things I only touch upon, or miss out altogether. It doesnt mean theyre not important or interesting. I hope I will give you enough of an overview that you will be able to go and find out more for yourself.
I have my own opinions on what is great, and not so great, about big data. I dont want you to accept them. I want you to make up your own mind. Thats kind of the point of the whole book.
But just as important to me is that you enjoy reading it. I hope you do.
Note
Unless you got it out of the library.
What is data?
Thirty thousand years ago, in central Europe, somebody scratched 57 notches into a wolf bone. Those 57 notches, grouped into fives just as you might tally something today, are the earliest known recorded data.
We dont know anything more about who scratched them, or even if the notches were all made by the same person. We have no idea what they denote, only that they were a record of something. Which may not seem like much to you, but it represented a breakthrough in how our ancestors were able to keep track of things.
Imagine for a moment that the fact its a wolf bone is significant, and knowing how many wolves youd killed was important for some reason. Perhaps you wanted to see if the local wolf pack was getting bigger or smaller, or whether the new flint-tipped arrows were more efficient than the old wooden ones, or just to win an argument about which member of your tribe was the best wolf-killer and got to sit nearest the fire.
You could hang on to a trophy from each wolf, and just see which pile of skulls is biggest, but that takes up room, and is vulnerable to being eaten by dogs. If you can represent each wolf with a notch, all you have to do is compare bones and see which has more notches.
Somebody in an Ice Age cave, in what is now the Czech Republic, had invented digital data.
Today, you can download Wild Wolf Data from the comfort of your own computer. The International Wolf Center in Minnesota, USA, fits wild wolves with tracking collars: radio collars since 1968, and more recently GPS collars that use satellite links to track the wolfs position. This has allowed them to locate individual wolves at any given time, but also to study patterns of wolf movement and behaviour, and even to predict likely conflicts between the wolves and their human neighbours.
The technology is more advanced, but the basic principle is the same: turn your information into numbers, and record it in a form thats easy to use and share. GPS data, tracking wolves through the forests of America, is digital information, by which we simply mean that it comes in numbers that you could, in theory, count on your fingers, your digits.
Youd need a lot of fingers, but thats where computers come in handy.
Today, computing technology is so cheap, compact and powerful that domestic washing machines use computers to control laundry cycles. Impressive. And yet its still easier to keep track of wild wolves in Minnesota than of your own socks.
Without computers, big data would be impossible, so lets take a quick look at their unstoppable rise.
A century of computers
The earliest computers wore petticoats. Until the twentieth century, computer was a job title, and people, mainly women, were paid to do mathematics, with the aid of primitive technology such as log tables and slide rules, both of which were still in use well into the space age.
The first computer in the modern sense was built by IBM in 1944, in partnership with Harvard University. The Automatic Sequence Controlled Calculator, affectionately known as the Mark 1, was 2.4m (8ft) high and more than 15m (50ft) long. It weighed nearly 4,535kg (5 tons) and worked by a combination of electrical and mechanical parts, relay switches, rods and wheels. Computer historian John Kopplin described it as sounding like a roomful of ladies knitting.
The Mark 1 could add together 23-digit numbers in under a second. Multiplication took around five seconds, and division over 10 seconds. It received its program and data in the form of holes punched into paper tape and cards.
Mathematician Grace Hopper became the chief programmer. Her work was central to the development of computer programming, but you may be more entertained to learn that she was the first person to debug a computer: she removed a moth that got stuck in the mechanism.).
For tasks such as predicting the path of artillery shells, you put numbers in, and you got numbers out. But after the war, both business and government wanted to use the computer for a wider range of tasks. Human beings dont naturally converse in a string of ones and zeroes. using words and structures from the English language, so that non-specialists could work more easily with computers.
The development of what today wed call software was a step towards introducing the power of computing into non-mathematical areas of human life. But the hardware was unwieldy and expensive. When Harvards Howard Aiken, the inventor of the Mark 1, was asked in 1947 to estimate how many computers the US might buy, he said six. It would take a transformation in how they worked, and how they were built, to get us to the present day.