Book Review: The Signal and The Noise, by Nate Silver

The Review of The Signal And The Noise: Why So Many Reviews Are Lame– But Some Don’t. Uh I mean Aren’t.

Nate Silver is a geek. Specifically a numbers geek — in the olden days he’d be called a statistician, but these days the hot term is “Data Scientist” — he’s one of those folks who were doing data science before it was recognized as a distinct discipline. I’ve always been a fan of his, as I’m of the same ilk — the only thing that stopped me from majoring in Data Science was that it didn’t exist back when I was in college (in 1830). Silver first came to national attention as the founder and primary contributor of the blog FiveThirtyEight, where he applied statistical forecasting to predict the outcome political races and spanked pretty much all pundits in the 2008 congressional races.

As his stock rose, Silver wrote this book to distill his worldview and bring statistics to the masses — this book is essentially a Love Letter to Probability and Statistics. His star later lost a little luster after apparently underestimating the Rise of Trump in the 2016 election (though to be fair, many pundits were surprised and he did better than many). But today FiveThirtyEight remains a mainstay of political prognostication (I checked it almost religiously leading up to the 2020 election) and has branched out into entertaining articles on sports, entertainment, and other topics — a “data journalism” site, as Silver calls it. Now that it’s closing in on a decade old, and all things data have seeped into the zeitgeist, how much of Silver’s book has turned out to be Signal (and not Noise)?

A lot of this book appeals to the numbers geek side of me (which admittedly is most of me — 87.3%), but I find it frustrating in parts. While Silver has a knack for conveying statistical concepts to laypeople and he’s grown a nice little media fiefdom around FiveThirtyEight, he’s not an expert statistician. I admire his goal to lift up the curtain of professional applied statistics so we can all get a peek, but I feel like he might not be sufficiently qualified to commentate on some of these statistical stories. While there’s no crime in over-simplifying some scientific concept for the sake of popularization (at least I hope not, for the sake of this blog), he’s reached a level of stature in the zeitgeist such that when he draws a conclusion without doing his homework, he could do some real damage.

Some Predictions that Fail, and Some That Don’t

Silver gives us a tour of a dozen statistical case studies from a wide range of disciplines, including baseball (where statistical modeling has essentially taken over from grizzled scouts), weather forecasting (which has quietly become much more accurate than most people realize), and financial market modeling (where accurate prediction remains notoriously difficult). Many fields are undergoing a data revolution, sweeping aside unreliable traditions and intuition in favor of hard-nosed mathematical analysis. Along the way he gives an entertaining lesson in various useful statistical concepts, like false positives vs. false negatives, Bayesian reasoning, the Winner’s Curse, and others. And he’s able to give us a insider’s view, having directly participated in at least three of these data revolutions himself.

Albert Pujols doesn’t care what you think of his name, he’s hit 37,500 home runs in his career (so far)

While political and financial predictions are probably the topics we most want to hear from the data geeks, Silver’s most entertaining case studies focus on more light-hearted topics. For example baseball — if you haven’t paid attention to MLB statistics since you snickered about Albert Pujols’ name, you might be shocked at the level of statistical sophistication used in baseball reporting these days. We can now quantify exactly how much each player contributes to their teams’ winning records (e.g. Mike Trout is worth an additional 8-9 wins to his team per year), and community sites like Fangraphs contain just as many articles debating the appropriateness of statistical estimators as homespun interviews with old-timers. Baseball is juuuust predictable enough, and has enough money at stake, that squeezing a few percentage points of efficiency using number crunching is worth the payoff. In contrast, earthquake prediction remains essentially intractable, despite decades of research and effort. About the only concrete predictive capability we have for earthquakes is the discovery that their frequency of occurence vs. intensity fit well to a power law (which is a straight line on a log-log graph); otherwise, the history of earthquake prediction is one of overconfidence (and overfitting to noise). This “power law” discovery is useful but still wouldn’t help you escape San Andreas before the big one hits. We also hear about predicting weather, which turns out to be one of the easier problems considered in this book as it “only” depends on well-understood principles of physics, and as such has demonstrated measurable improvement in forecasting accuracy over the last few decades. In contrast, predicting the stock market is a mess — if there are any underlying principles below the surface, they’re so horrendously complicated and contingent on random events that it’s essentially hopeless (as Silver hints to us).

I most appreciate Silver’s style when he shows us the behind-the-scenes personalities — he interviewed quite a few luminaries for this book, and hearing their opinions on the problem (and on each other) humanizes these case studies. Particularly interesting was the chapter on AI chess — he shares with us the story of chess program Deep Blue’s design and growth as it catches up to and ultimately defeats world chess champion Gary Kasparov. While this section had somewhat less to do with data forecasting per se, it was astounding to hear how Kasparov reacted as the program’s ability improved. At one point Kasparov was spooked and ultimately demoralized by an unconventional move that seemed to imply Deep Blue’s understanding of chess was indeed quite deep, but actually was a simple software bug. Silver uses these actual real-world examples of trying (and often failing) to build a predictive model to illustrate the (sadly) many pitfalls of this job — overtraining by failing to distinguish signal from noise, failure to remove your own subjective biases, the “winner’s curse” of a prediction model that got lucky. There are many ways to fail when you’re building a predictive model. These are all good lessons for an aspiring data geek — better to learn from someone else’s pain than your own. But often in these popularization books, I like reading about the personalities even more than the science.

Some of his case studies, however, rub me the wrong way. I enjoy hearing his firsthand accounts of making a living playing poker, but I’m not sure I welcome his perspective on such serious subjects as terrorism (why did we fail to predict 9/11?) or climate change (how accurate is the science?). Here I feel like Silver makes an error common to many smart people, particularly smart people who’ve had success in multiple fields — assuming they’re equally qualified to comment on all fields. While I have to applaud his goal to point us readers toward a statistical approach to these hard problems, his conclusions sometimes smell a bit undercooked. Particularly when he evangelizes for a particular style of statistics, called Bayesian statistics.

Bayes? Who says?

Like any field of study, the world of data science has a range of different subdisciplines, each with their own goals and values. Silver is on safest ground in this book when he writes about scenarios similar to where he’s made his own living. His education is in economics, then later developed a famous statistical prediction system for Major League Baseball in his spare time while not working his day job as an economic consultant, then even later quit that day job to make a living for a few years playing online poker. These are all domains where your goal is to forecast the future, using as precise a probabilistic model of the scenario as you can. You’re constantly updating your estimates of probability, trying to scrape small bits of improvement wherever you can. The end goal is of course, to have the most accurate predictor around.

“Am I gonna be on ESPN?”

These fields are also competitive fields — you’re going up against other data nerds (or poker players, or baseball GMs) and their models. It’s not your absolute accuracy that’s important, but your performance relative to your peers. In fact Silver point out that poker depends on having a pool of “suckers” (people whose predictive accuracy is poor), who subsidize the rest by continually losing — without such generous losers, poker becomes difficult to make a living in. (Similarly, the stock market wouldn’t exist if we all had optimal systems, because we would all make the same decisions and there would be nobody to buy the stock we want to sell.) Silver’s book is most convincing in these case studies — he makes a compelling argument for the need for using probability and statistics, and walks through the lessons he learned to become adept at these fields. If anything, these chapters suffer from too much detail and failure to “dumb down” the subject area for us novices. (To pick one example, he never defines the term “tilt” in his chapter on poker — probably obvious to some, but I spend more time playing Chutes and Ladders than poker these days.) But I believe his expertise in these types of forecast-driven fields.

Contrast this with how statistics are used in science. For example, statistics are central to modern medical and biological research, because there’s often so much noise in the results that we have to prove a discovery statistically. For example we might set up an experiment to find which genes are responsible for a particular disease — these days, all the easy genes have been discovered, and the remaining work is to measure the small influences a particular gene might have — e.g. we might find some gene increases susceptibility to cancer by 15%. For this application of statistics, prediction of future events is a tool we use, but is not the main goal. Rather, we care most about our model — we’re trying to find an explanation for what’s causing the phenomenon. Measuring the predictive power of a candidate model is important (e.g. as good evidence that it’s the right model), but we can’t just stop there. After all, predicting cancer is of course a worthy goal, but we really want to understand why cancer occurs so we can stop it. In contrast, if I told you I had a mathematical model that can predict stock prices but I don’t know why it works, you might not worry too much about the “whys” and the “hows”. In my opinion, the statistical tools and “frameworks” (the entire perspective you bring to solving statistical problems) are somewhat different for these two different scenarios. And I think Silver makes the mistake of asserting that the right tools for forecasting are thus also the right tools for scenarios where we care more about the model.

This is where we get into “Bayesian” and “Frequentist”, the two competing intellectual frameworks for applying statistics. Silver is a proponent of the Bayesian approach, where any statistical judgment must start with a “prior” probability, which is a probability you assign to the events you care about before you make any measurements. For example, for every coin I come across, I have a prior assumption that the probability of getting heads is 50%, then I can refine my estimate by actually flipping the coin a bunch of times. An important example is in medical testing — cancer tests aren’t perfect, and if you don’t account for the rarity of the disease you’re testing for, a seemingly decent test can become misleading. Silver walks through an excellent example of this — say a test for cancer has a 10% false positive rate, which doesn’t sound terrible, but the cancer itself is extremely rare. In this scenario, the vast majority of people who test positive are false positives — getting a positive result doesn’t actually tell you much. That means there’s not much point in taking the test in the first place. (And for a test like a CT scan, where the X-Rays can increase your chances of cancer, the test can actually do more harm than good.) If you’re calculating the probability that you actually have the disease given your test outcome, you need to include the “prior” (prevalence in the population), otherwise the results are misleading. The strong proponents of Bayesian approach see their work as an ongoing process of refining probabilities — start with your best guess at a “prior”, then update after every experiment. The “posterior” probability after every experiment becomes the prior for the next experiment.

The Frequentists, on the other hand, worry about the subjectiveness of choosing a prior probability. If we must pick a prior for all scenarios where we are about to compute a probability, what’s to stop us from picking a prior that works in our favor? The prior is by definition done before you run an experiment, so why not pick a prior that will help give you the results you’re hoping for? We might pick a best-guess prior based on what we think is “common sense”, but how do we judge whether it’s accurate? And if we pick a very strong prior (close to 0% or 100% probability), then if the experiment is less than absolutely conclusive, it might not budge us from the prior much at all. In that case, why did we bother to do an experiment?

Silver is a proponent of the Bayesian approach to statistics — for many of the case studies in the book, he walks us through the actual math behind calculating probabilities the Bayesian way (which is the only actual math in the book, by the way). I’m willing to grant Silver that the Bayesian approach is the right way to go for forecast-y, zero-sum-games-y, compete-against-other-suckers-y situations like poker, financial markets, and optimizing sports teams. But I’m not convinced it’s the right way to go for scientific research, as he contends. And it’s not just a philosophical debate of no practical significance — Silver recommends Bayesian stats for use in medical research, but now lives are on the line. Choosing the wrong tool (or using a tool incorrectly) could actually cost lives. I doubt the thousands of PhD researchers who have devoted their lives to curing cancer will take Silver’s suggestion as anything except well-meaning but ignorant interference from a statistics semi-celebrity. A researcher looking for the genes responsible for Alzheimers (for example) might choose a Bayesian prior probability that gives their favorite genes a head start. The Frequentist approach, on the other hand, could be seen as a way to estimate what probabilities we can without assuming any prior, which might not be the full story but at least avoids the subjectivity of how to pick a correct prior. In other words, the leaders in these fields have already thought carefully about the issues Silver raises, and have very good reasons for not converting wholesale to Bayesian frameworks. I find it telling that Silver nags the medical research community but not the AI chess community — does he perhaps feel more bold in chastizing medical researchers (who probably are bio majors and didn’t take statistics in college) than computer science experts working on AI?

You Can’t Put Too Much Water in a Nuclear Reactor

The state of the art in America’s nuclear energy program

There’s an old Saturday Night Live sketch that will help me explain another complaint I have about some of the book. (Bear with me — I know there’s not much more painful than reading a description of a comedy sketch in a blog post.) The sketch stars Ed Asner as a retiring nuclear power plant manager who imparts this wisdom on his staff, then immediately leaves: “You can’t put too much water in a nuclear reactor!” The staff must quickly decide if he meant “be careful not to add more water than a certain limit”, or “there is literally no such thing as ‘Too Much’ when it comes to adding water — you can add all you want”. What at first sounds like a crucial warning turns out to have so little information content that it’s essentially useless in helping the hapless coworkers diagnose a reactor problem. In the SNL sketch the problem is of course compounded by ambiguous wording (which Silver is not guilty of), but the situation unravels because none of the underlings think or dare to question the statements of the expert.

Silver’s transgression here is to pass off semi-tautological statements as important observations. The first obvious example (IMO) is when he explains Philip Tetlock’s research into the psychological tendencies of people who make predictions. Tetlock analyzed the accuracy of experts in many field, and found they fall into one of two groups — the “hedgehogs”, who tend to think dogmatically and believe systems are guided by strong underlying principles (often based on theories they developed themselves), and “foxes”, who are more willing to embrace complexity and nuance and hence are less likely to shoehorn observations into their own pet theory. Just from the choice of names, you’d perhaps not be surprised that the foxes tended to outperform the hedgehogs in prediction accuracy, even though hedgehogs often garner more media attention. On the surface the comparison seems surprising and novel — looking through the table Silver provides to contrast foxes with hedgehogs, a few of them are somewhat surprising (such as that hedgehogs tend to be specialized in one field, and foxes have multidisciplinary training). But many others seem to be simply a case of restating “foxes tend to be better predictors” in clever ways. For example foxes “find a new approach– or pursue multiple approaches at the same time– if they aren’t sure the original one is working.” That’s good advice to follow, but Silver frames it as a novel discovery — these supposed traits of “foxes” are supposed to be ones we wouldn’t have guessed would correlate with better accuracy. It’s like the clickbait internet ads that promise “you won’t believe what simple habit of yours will lead to a heart attack!” and it turns out to be “sitting on the couch all day”. So Silver’s assertions are partly (albeit not completely) tautological — what makes an expert good at making predictions? Having the ability to make their predictions better.

I should possibly concede that this semi-tautology really should be blamed on Tetlock, but there are other occasions where Silver provides some bit of advice that, on closer inspection, isn’t really “actionable” (in the sense of immediately implementable). An example of this is the recurring discussion that gives the book its name, the problem of distinguishing Signal from Noise. We of course don’t want to include too much noisy data in our predictions, because noise (by definition) is anything that isn’t helping to predict what we want to predict. But we don’t know ahead of time what data sources will be noise, and which are signals. So if you’re too selective, you’ll throw away valuable signals with the bathwater. Silver makes this point as a recurring theme, but the problem is it’s not prescriptive. In retrospect I can tell that I let too much noise in when I made my predictor, or that I threw away too many sources of signals, but I need to know how to do this while I’m building the predictor. Silver is essentially saying “don’t let in too much noise, but don’t be too conservative either.” This is a long way of saying “let in the right amount of signal in your predictor”. No sh*t! I’d *love* to do that — you don’t need to keep telling me that it’s a good idea, I’d rather you tell me how to get there. For a book named “The Signal and The Noise”, it’s surprising that Silver spends so much time explaining that we should separate signal from noise, but not how.

A high-level summary of the statistical concept of “feature selection”. We omitted some details that would just confuse you

Statisticians face this all the time of course. There are lots of tools available for separating the wheat from the chaff, for example the area of “feature selection” in machine learning. Let’s say you’re trying to predict the future health of the economy, and you’ve scraped together a few hundred economic indicators, like the current inflation rate, unemployment, housing prices, etc. These data streams from which you build a predictor are called “features” in the machine learning lingo. There are well-proven techniques to decide which of these features seem to help, and which are just noise. Silver never gets into these techniques, rather he spends most of the book trying to draw our attention to the fact that there is noise. Maybe I’m being too optimistic of Silver’s intended audience, and most people really do need to be convinced of this. But it makes Silver’s book a bit less useful for the practicing data nerd.

Trust The Process

But my review isn’t going to be all complaints — despite the book being aimed perhaps more towards a layperson or novice data nerd than I’d like, I still found a lot to learn in the book. Probably the most important lesson I took away is of trusting your process, not focus as much on the results. This is critically important for situations where random chance plays a role, where sometimes the breaks just don’t go your way and you have to resist abandoning your plans or resorting to superstitions. He makes this most concrete via examples from poker. When you’re facing down your opponent, trying to decide what cards they hold to gauge how aggressive you should be, you have no choice but to make a probabilistic decision. Despite what movies would have you believe, you rarely can whittle down possibilities enough to know exactly what cards others are holding. Let’s say you conclude there’s a 90% chance you have better cards than your opponent, so you decide to go “all in”. Turns out your opponent did have a straight flush nine-of-a-kind on the flop (I don’t know poker terminology, can you tell?), and they win the pot. The poker sportscasters on ESPN 12 might imply that the “right decision” would be for you to fold, and might praise a player who does indeed fold as being some sort of poker genius. But the fact is that all-in is the right decision, even though it leads to you losing the pot in this case. We should judge ourselves on our ability to accurately judge the probability (is 90% the right likelihood?), not on those inevitable times where the rarer outcome happens. If you play optimal poker, you’ll go all-in on any hand with 90% win likelihood, and over the long haul you’ll succeed. Silver points out how rare it is for observers (say sportscasters, or your boss judging a failed project) to avoid framing some random bad luck as some failure of process instead.

This is an important point — it is by no means the main theme of Silver’s book, but it echoes others in science and engineering who point out how dangerous it is to attempt to “learn” from luck. Let’s say you’re a middle manager at your job — you lead a team, and you’re trying to meet a deadline and so have cut corners on your usual processes. (“Processes” might be peer-review of code in software, or safety regulations in manufacturing, or careful vetting of investment risk in finances. Or simply properly disabling the industrial machines when the technician is inside trying to fix it.) If you succeed desipte cutting corners, you may (incorrectly) learn that cutting corners is fine — the success is really accidental (as processes are meant to catch mistakes, and maybe your team didn’t happen to make any serious mistakes), but can fool you into believing you did the right thing. So it’s counterintuitively better to judge the process you follow, and judge it over the long term, rather than trying to learn from individual results. Of course this isn’t true in all endeavors, but it’s critical for ones where randomness has a large role. And it’s hard to establish the discipline to follow this at businesses, where progress is measured by quarterly financial growth.

A quite relevant example of this now would be evaluating our response to emerging infectious diseases. We’re not doing so hot responding to Covid (at least at the time of this writing, January 2021) — so far, it looks like a massive failure to prepare and respond, both by governments and many individuals. Hopefully we’ll course-correct after this, and devote more resources to planning for massive global pandemics. But what if we had gotten lucky and stopped those first few Covid cases early on? Did we become complacent after watching supposed pandemics like SARS, Ebola, and Zika virus not play out as dramatically as the initial hype suggested? Ironically, successful responses early on to a possible pandemic might make society less likely to prepare in the future. Pandemics are inherently random — not just the infectiousness of the microbe itself, but the random behavior of those first few infected people (do they fly from LA to London, or do they stay in their house?), all can affect the eventual course of the pandemic. Events early on, in particular, will have large influence due to the exponential growth of the spread. I think we’re suffering now from overestimating our own success at handling the last few major health scares.

Final Predictions

So while I’m not sure I completely trust Silver’s opinions on matters of global safety such as pandemics, terrorism, or climate change, I think he’s written an entertaining tour of how probability and statistics are actually used, and makes some important points about important statistical literacy is these days (for making money, building better baseball teams, or making the world a safer place). The book is really more aimed at the layperson, and to be fair some of my criticisms are from wanting Silver to get into details that just aren’t going to fit in this book. And for those already in the data science revolution (or planning to join), Silver’s book reassures us all that yes, this job is really hard and frustrating for everyone. Even the famous data scientist with the really popular blog and talking-head appearances on CNN has gotten a lot wrong. But to put any of this into practice, you’ll need to ditch the popularizations and head toward the textbooks to get your hands dirty with statistical significance, sample sizes and study power, the “feature selection” approaches I mentioned earlier, and then simply try your hand at making some predictors. And when you do, inevitably leveraging your newfound statistical expertise into millions of earnings at the poker table, remember the little blog post that encouraged you to get started! We need funding for our earthquake-prediction research — I’m onto something, I can feel it!

© 2011 TimeBlimp Thith ith a pithy statement. Suffusion theme by Sayontan Sinha