Finding Wisdom in Big Data
Think of what you do online each day: search Google, maybe “like” a post on Facebook, browse a few sites, watch some YouTube videos and tweet about an upcoming vacation. Now think about the personal profile these digital crumbs add up to. Welcome to big data, and how companies mine a startlingly accurate picture of your life.
Whether you know it or not, you’re a data-generating machine. Your daily mouse clicks create huge reams of data that companies are slicing and dicing into insights on how you work and live. Or, as Bob Evans, Oracle’s vice-president of communications, questioned: “If you think we’ve got big data problems now — with “only” about nine billion devices connected to the Internet — what’s the situation going to be like when that number soars to 50 billion at the end of the decade?”
Who’s Collecting the Data?
It’s a mind-boggling statistic: in the past two years, we’ve created 90 percent of the world’s data, according to an IBM study, and only growing faster by the day. Everything from a mouse click to a page view is a goldmine to companies, and they’re developing ways to gleam insight into your habits and make predictions into the future.
Entertainment — Netflix, for example, aims to become the HBO of streaming video. But how it plans to do that is a bit different. Rather than producing shows based on audience reception to pilot episodes, like how traditional stations do it, Netflix looks to your viewing habits and preferences. It already knows what shows you watch, when you watch them and even when you pause to take a break. By feeding that data into an algorithm, it can predict what you’ll like and what will keep you watching. For example, when Netflix ordered two full seasons of the political drama “House of Cards,” executives didn’t watch a single scene. In fact, they knew it’d be a hit. How? From data mined from similar shows.
“We know what people watch on Netflix and we’re able — with a high degree of confidence — to understand how big a likely audience is for a given show,” said Jonathan Friedland, Netflix’s communications head. “As time goes on, we get better at selecting what gets high engagement.”
In the quest for blockbusters, Hollywood has also dabbled in sophisticated data mining techniques to greenlight movie productions. For example, Working Title Films, the company behind the movie adaptation of “Les Miserables,” used computer simulations to predict its box office success.
Stage musicals turned to films are risky genres, but by comparing Les Miserables with 5,000 similar movies, including those with the same lead actors, producers can gleam insight into what works and what to avoid. In fact, Working Title Films chiefs Eric Fellner and Tim Bevan say they’ve mined production research for over a decade.
“What’s the point of spending $1 million of development money on something that will never get made?” Fellner said. “We use our experience and knowledge, and everything else gets generated on an algorithm model.”
Whether his predictions are correct, theaters will release Les Miserables.
Healthcare/Medical Innovations — Meanwhile, medical agencies like the Centers for Disease Control are plotting and predicting epidemics — using data from over-the-counter sales to social media posts mentioning flu symptoms — in hopes of stemming outbreaks. Researchers are also improving medical recovery times by looking for patterns of patient behavior in big data. But perhaps, the most ambitious is using data to personalize cancer treatments, to create depositories of genomes to track changes in tumor cells.
The government is opening its vast troves of data to push healthcare innovation as well. The Institute of Medicine and U.S. Department of Health, for example, released data on hospital performance, community health and FDA recalls, giving tech titans like Google and Microsoft results to develop mobile healthcare software through hack-a-thons and developer challenges. And the collaborations are working, producing apps to manage chronic diseases, locate clinical trials and find providers — all using open data — in a fraction of the time it took in the past.
What will 2013 hold for big data? With agencies releasing even more information, tech companies and developers alike will release innovative tools to further improve our health.
Advertisers — Ad agencies are using big data to target buyers, and that is both a convenience as well as an unwelcome bombardment. You probably know companies are interested in your personal information, but what you may not know is that data brokers are ramping up collection in what you read, watch and do online — so they can sell that lucrative data interested groups, like credit agencies and political campaigns.
Corporations — Meanwhile, employers also want data on prospective hires. In the pursuit to dig up as much information, it’s not uncommon for companies to ask candidates for passwords. That trend, disturbing to some, is leading to legislation to make sure your privacy is protected.
What Will This Mean?
How will businesses, agencies and people respond to the mounting avalanche of data? As millions more devices log on to the Internet in the coming years, you’ll leave an ever-growing trail of data. Companies are building systems to better harness this burgeoning data, resulting in more jobs, as well as a need for storage and security solutions.
Big Data Jobs — Companies understand big data is changing their industries, and savvy businesses are already looking for scientists to interpret the results and give them a competitive edge. Big data is a booming business because the increase isn’t just more of the same stuff — it’s entirely new data streams that need to be sliced, diced and understood. By 2015, companies will create four million jobs to support big data, according to Gartner, with half of them in the U.S. Moreover, each big data-related job will create more jobs for three other people outside of tech, pushing the employment potential to six million in the U.S.
“Public and private education systems are failing us. Therefore, only one-third of the IT jobs will be filled,” said Peter Sondergaard, Gartner’s global head of research. He predicts a shortage of data experts, adding “these jobs are the future of the new information economy.”
If you’re looking for one of the hottest job markets to come, big data is it.
More Storage — Data is piling up. By some estimates, it’s growing by 50 percent each year, and companies are grappling with what to keep and delete — and for how long, and under what kind of security. One trend that’s rising is the prominence of technologies, like virtualization, specifically developed to help organizations store more data using existing infrastructure, to save costs and get up to speed. Privacy and security will also decide storage guidelines, and will shape the debate between Big Data proponents and privacy advocates. The sheer volume of data presents a physical challenge, one which new innovations, like cloud technology, can help manage. But every step forward is riddled with security concerns.
In This Year’s Election, Big Data Wins
If you love following politics, Nate Silver’s popular “fivethirtyeight” formula and its accurate predictions of voting outcomes could bring changes to how we participate in and follow elections.
Silver, a statistical enthusiast, poker player and baseball nerd, turned his mathematical model to political forecasting in 2007, creating a blog to provide a data-driven analysis of political elections with unusual accuracy. In 2010 the New York Times purchased Silver’s FiveThirtyEight blog, named after the 538 votes in the Electoral College. In the days before the election, it accounted for 20 percent of traffic for one of the biggest newspaper websites in the world.
After the 2012 results validated his election predictions, and matched the electoral college votes to a tee, Twitter exploded with his name, sales of his book “The Signal and the Noise: The Art and Science of Prediction” surged to second place in the bestseller charts, and nerds everywhere rejoiced. The day following the election, the mild mannered Silver went on The Daily Show, where Jon Stewart saluted him as “Nate Silver! The lord and god of the algorithm.”
But Silver will be the first to say it’s the algorithm, and not him, who deserve the credit for getting his election forecasts right. Algorithms like Silver’s apply math and computer formulas to vast amounts of data, part of the larger “Big Data” trend.
Some find it disheartening that meaningful debate over issues can be boiled down to a math equation that slices and dices a population’s interests, values and opinions with massive amounts of data to analyze questions, solve problems and predict race winners.
To others, it is the natural progression of big data’s predictive powers — which has also played a role in sussing out national health crises, compiling weather forecasts, and assembling baseball teams with the best talent for the price, to name a few.
Still, regardless of what the “Voterball” trend heralds, applying big data techniques to politics will affect the way media covers upcoming elections and how campaigns target votes, but it still doesn’t completely rule out the “X-Factor” for all candidates.
The Proof in the Pundit Pudding
Silver’s approach will most visibly change the circus around politics and elections: rather than relying on bluster and thunder, political pundits are going to have to do their homework — their math homework — before offering their often partisan proclamations in future campaigns.
In the days leading up to the election, Silver suggested that President Obama had an excellent chance of winning a second term. Plenty of pundits from both sides pushed back, arguing that the race between Obama and Romney remained amazingly close. Several of them heavily criticized Silver’s formula, while others directly attacked him, his homosexual orientation and mathematical inclination up until Election Day.
Times columnist David Brooks wrote, “If there’s one thing we know, it’s that even experts with fancy computer models are terrible at predicting human behavior.”
Joe Scarborough from MSNBC took a more strident tone, saying the race was tight and “anybody that thinks that this race is anything but a tossup right now is such an ideologue, they should be kept away from typewriters, computers, laptops and microphones for the next 10 days, because they’re jokes.”
On the eve of the elections, Washington Post columnist Michael Gerson critiqued the use of statistical analysis in politics. “An election is not a mathematical equation; it is a nation making a decision,” he wrote. “People are weighing the priorities of their society and the quality of their leaders. Those views, at any given moment, can be roughly measured. But spreadsheets don’t add up to a political community.”
Silver dismissed his critics because to him, the election always was an equation. The Michigan native explained in a television interview, “Our model basically averages the polls and then simulates the Electoral College. I get a lot of grief because I frustrate narratives that are told by pundits and journalists that don’t have a lot of grounding in objective reality, frankly.”
Still, Silver’s model continued to churn together national and state polls, with an added dash of historical data from previous elections and various weighted factors. By the morning of November 6, it calculated Obama had a 90.9 percent chance of victory.
Punditry Has its Limits
Up until the very end, television commentators described what they “felt” was going to happen, offering tantalizing “scoops” from places like Dublin, Ohio and dispatching hunches based on their years of experience at Georgetown cocktail parties and conversations with lobbyists — all of which the public was to take as a barometer for the outcome. Prognostication isn’t an exact science, but the crew addressing the U.S. public for weeks turned out to miss by quite a bit.
The pundits have a vested interest in promoting their conclusions and using cherry-picked polling data to give them the patina of expertise. Take polling data with a certain result, pair it with a report from a “campaign strategist” and lo and behold, instant credibility! In this light, their own techniques and intuition would be boosted if Silver’s formula failed.
Still, punditry has it limits, and more people were getting turned off by the shrill national conversation that often passes for debate during election cycles. Silver’s formula promised the flipside of this — what could be less biased and more objective than an equation? The idea of calculating the probability of something not only makes innate sense, but it provided a much-needed respite from the incessant shouting heads on television and radio.
And with the election results, Silver’s model was another notch in the belt for the growing Big Data movement. Going forward, calculations like Silver’s can wrangle complex amounts of data into simpler, easier to digest packages. When it comes to complex events with a lot of factors in play, we’ll expect more Big Data models to help make sense of sometimes inchoate experiences.
But there’s a limit to them as well. Silver’s algorithm may feel like rocket science when it works, but at its core it promises only to be as accurate as the data that is fed into it and future elections will place a premium on this element.
Targeting You to Bring Better Results
Silver isn’t the only one doing this, or even the only one who doing it well. Aggregating polling data isn’t complicated, but it’s the “secret sauce” — a unique factor formulated in, or how they weigh various parts — that makes these unique and can be tested over time.
Before the last Election Day confetti had drifted to the floor, both Republican and Democratic campaign officials were compiling their “lessons learned” lists, and the idea of better data to help calculate and predict outcomes was foremost among them.
Both campaigns made digital inroads this year, but analysts give the edge to the Obama campaign, which hired a “chief scientist,” according to his campaign manager Jim Messina, to “measure everything.”
And in the future, that means everything about you that can be targeted and teased to provide an edge in this new data game. The quest for votes and money will be driven by numbers produced by increasingly secret and massive political mining systems. The numbers will determine what political ads you view — and sometimes you won’t even recognize it as an ad — as campaigns continue to employ more sophisticated ways of targeting voter segments.
You probably won’t be able to seek refuge in social media like Facebook, either, as these social media places expand to offer a platform for political activities, voter registration and debate.
Both parties now are busy assembling their data analysis infrastructre by expanding the number of programmers, coders and engineers they welcome to their staffs. Stay tuned, because their efforts will likely creep across a screen near you very soon.
Is All the Mystery Gone?
Nate Silver’s formula, aggregation and big data will bring more science to elections, but that doesn’t mean there will be no surprises.
In the past, it was assumed that “Big Money” could pretty much call the election winner — the candidate with the most cash got the most votes. Well, not always. And so it is true with Big Data: the campaign with the best formula will have a greater probability of winning, but as the saying goes, the only two certainties in life are death and taxes.
A candidate with an “x-factor” like compelling charisma or the ability to inspire lasting confidence still has room to disrupt the narrative. And, unexpected outside events, by their nature, can upend calculations that aren’t designed to predict outcomes, just their likelihood.
Silver touches on this in a recent interview with The Guardian. “I’ve become invested with this symbolic power. And I’d be the first to say you want diversity of opinion. You don’t want to treat any one person as oracular.”
“I guarantee that we are going to start getting some things wrong,” the 34-year-old said. “We are confronted with all this data, but oftentimes we screw it up. Having more information doesn’t necessarily make us any better at predicting what will happen in the world.”
Anything can happen, and big data won’t replace individuals and groups, though it might invade their privacy. The change will be gradual, but election reporting will be more informed by algorithms and statistical analysis, gaining steam over time as new formulas best previous ones.
For his part, Silver used a football analogy to highlight one of the many fine points in the conversation about predictions with prophecy. “If the Giants lead the Redskins 24-21 in the fourth quarter, it’s a close game that either team could win. But it’s also not a ‘toss-up’: the Giants are favored. It’s the same principle here.”
Using DNA to Store Big Data
Scientists at Harvard successfully coded a book into DNA, proving how even large texts can fit into tiny bits of DNA and help big data storage in the future.
Algorithms can analyze big data to glean information and predict trends. For example, big data is how OKCupid makes accurate predictions about its users based on its amassed information, or how Wall Street decides to pick the best stocks to champion. Many researchers and companies in technology are excited by the possibilities of big data, which compiles large amounts of small discrete bits of information. The combination yields finely tuned insights about behavior, and could have large implications on advertising, for instance.
However, storing large amounts of these small bits of data could prove onerous. But DNA storage may help companies who mine silos of data drastically reduce space needed to hold information, and keep data secure for thousands of years.
Harvard biologists cultivated the DNA storage in several steps, eventually simplifying the original words into DNA’s simple nucleotide structure. The process could help big data companies determine the most efficient and affordable ways to keep information organized, analyzed and easily accessible.
High costs keep publishers or other companies from adopting the technology. However, DNA sequencing equipment costs are dramatically decreasing, so DNA storage may offer a workable solution for companies getting overwhelmed by this obstacle. It vastly reduces space needed to keep information stable.
DNA, as a stable biological component, is a durable storage unit that stays preserved for millions of years. And while other technologies go out of date, scientists point to DNA’s durability since it is the building block of life. It’s also much more compact than flash drives or other storage. Businesses trading in data could drastically reduce their storage costs, which may help DNA storage become a staple for big data.
So How Did They Do It?
First, the researchers translated the 53,000-word science book into HTML code. From there, they changed HTML into binary code, which is the 0s and 1s computers read. And then they changed the binary code into DNA sequencing by assigning the 1 value to two nucleotide bases (adenine and guanine) and the 0 value to the other two (cytosine and thymine) comprising DNA. To change the book back into regular text, they used equipment capable of helping them reverse the procedure.
What’s Next?
This innovation comes on the heels of a talk by genome savant Dr. J. Craig Venter describing DNA’s exciting future as all-natural software. Many revere Venter’s contributions to genetics, but he does not shy away from suggesting radical uses for the genome.
Venter outlined his plan to make a machine that turns digital DNA coding sent over e-mail into synthesized biological creatures, a far-fetched sounding project he insists can be done in the near future.
The Harvard DNA book storage project supports Venter’s insistence on the genome as a future storage unit. However, its goal is scaled back since it looks to store information and not develop original biological entities. But if the Harvard researchers’ data storage breakthrough takes off, it will only help Venter’s bigger dreams come true more quickly.
Finding a Balance
Since personal data is a security risk, organizations must handle information with the utmost of care, otherwise hackers, for example, can use the data in identity theft. To combat the problem, companies are stripping identifiable information, making it completely anonymous, but regulators are pushing for increased protection.
By using big data to find patterns and trends, innovations in healthcare can help us live longer lives, but the information can also be used to target us. Like most technologies, it’s a double-edge sword. And like it or not, big data is coming, and it’s going to transform the way we think and live in years to come.