Some people say that we now live in a “data rich” world. You can find a quantitative measurement for practically anything you want. This has been the case at least since the availability of Google search trends, and continues with the advent of things like Twitter searches (including so-called “sentiment analysis”). You can look up many national demographic and economic variables going back at least to the turn of the century, which potentially is very helpful in assessing things like medicine, the justice system, voting patterns in presidential elections, and so on.
Coming along with “data rich”, though, can be “information poor”. The idea that any quantitative variable tells you something meaningful is ludicrous. And it gets worse: even variables which are obviously meaningful can be so noisy that they end up being useless. An example would be the way that unemployment or gas prices affect whether an incumbent president is re-elected. These variables are bound to make voters unhappy with the guy in office; but with so many other things happening in the world, who honestly thinks you can understand all that much about it from this single variable? (Answer: nobody who is well-informed).
How can we assess the voting outcome of American Idol? Well, sit and think of what you would vote based on. How good was the singing? Naturally. Was the person in some other way annoying? Certainly that would make a difference. Was this week’s performance a dud, but last week’s really good? Maybe you’ll vote anyway based on that. Now sit and try to think of how other people might vote. Hawaiians might vote for this guy because he’s from Hawaii. Young girls would like this cute guy even though he sucks. Country fans will like this. Worsters are pulling for this guy. The list is endless.
For all the crap they take, weather forecasts and presidential polling forecasts are extremely accurate. The fact that you can take such a chaotic system, reduce it down to a few variables, measure those variables, and build a model that is accurate even a fair fraction of the time is astounding. And weather forecasts are a good deal more accurate than even a fair fraction of the time. Presidential polls have been accurate in the final days of an election in all years since Dewey lost to Truman. That’s pretty incredible.
What accounts for the success of these predictions? In weather, the advantage is in massive amounts of data. You have every possible indicator on every day for the past hundred years. Given that, you are bound to be able to say with considerable precision whether something will happen. Suppose the weather forecast calls for a 30% chance of rain. That means that in all of the data, when the indicators looked like that, it rained 30% of the time. If you collect data for a month, that will be pretty crappy. Barometric pressure, humidity, time of year, all of those vary too much to tell you. But looking at tens of variables over all years? That will be very good.
In political polling the advantage is that you’re measuring the actual thing that’s happening. You straight up ask people who they are going to vote for. Of course, this could still be faulty, and it changes a lot over time. But in the days before an election you can see that the numbers sort of “lock in”, as people make up their minds, and the predictions become nearly bulletproof based on these data. This was most staggeringly clear when Nate Silver predicted the number of electoral votes in 2008 within a very small fraction of votes.
These two strategies are useless for American Idol. The weather example fails because there have only been 10 full seasons of the show. If the show continued for another 100 years (not likely), then you would start to be able to make some damn good predictions. The polling example fails because there is no way to poll how people are voting during a 2 hour block of time during which voting happens which comes directly after the show. Even if you had the resources (money) to do it, it’s not practicable. You can do something like exit polls, which is how Votefair works, but this is fraught with sampling error, meaning that the people who vote on Votefair are not representative of those who vote in total.
Here are some variables that we could try to collect to help us predict:
- Approval rating of performance by WhatNotToSing.com
- Dialidol score
- Votefair votes
- Performance position (when did the performance take place?)
- Aggregate blogger predictions
- Pre-exposure time
- Gender
- Other polls (such as Ricky.org or IdolBlogLive)
Then we could think of some other attendant or derived variables. Maybe WNTS approval rating and last week’s approval rating. Maybe not just Dialidol score, but Dialidol raw votes, or Dialidol ranking. Maybe we take the performance position and extract a “performance order quotient” that fit the observed elimination rates. Again, the possibilities are endless.
Here’s the thing: it’s obvious that all of these variables are indicative of the outcome. If everyone says that Jeremy Rosado is gone, then he’s probably gone. But that isn’t what happened! The only indicator I saw that Elise Testone was the lowest vote getter among the women was that MJ from mjsbigblog thought that it was going to happen, and that’s just one person’s feeling, not any kind of measurement.
Let’s dig into this a little more. Here are the relevant variables from last week
Contestant | Order | Song | WNTS Rating | WNTS stdev | Result | Dialidol score | Dialidol Votes (thousand) |
---|---|---|---|---|---|---|---|
Joshua Ledet | 1 | I Wish | 70 | 14 | Bottom Group | 1.641 | 2.37 |
Elise Testone | 2 | I’m Your Baby Tonight | 33 | 17 | Bottom Group | 2.994 | 1.74 |
Jermaine Jones | 3 | Knocks Me Off My Feet | 40 | 19 | Bottom Group | 0.82 | 1.31 |
Erika Van Pelt | 4 | I Believe In You And Me | 73 | 11 | Bottom Group | 1.377 | 5.19 |
Colton Dixon | 5 | Lately | 51 | 17 | Safe | 2.719 | 4.26 |
Shannon Magrane | 6 | I Have Nothing | 23 | 17 | Bottom Group | 1.836 | 1.54 |
Deandre Brackensick | 7 | Master Blaster | 56 | 22 | Safe | 4.247 | 1.48 |
Skylar Laine | 8 | Where Do Broken Hearts Go | 78 | 16 | Safe | 3.572 | 2.65 |
Heejun Han | 9 | All In Love Is Fair | 45 | 20 | Safe | 3.055 | 1.38 |
Hollie Cavanagh | 10 | All The Man That I Need | 88 | 9 | Safe | 3.987 | 3.86 |
Jeremy Rosado | 11 | Ribbon In The Sky | 32 | 19 | Eliminated | 3.135 | .53 |
Jessica Sanchez | 12 | I Will Always Love You | 91 | 7 | Safe | 4.041 | 4.28 |
Phillip Phillips | 13 | Superstition | 60 | 25 | Safe | 4.66 | 4.08 |
So, the image is very muddy. Elise Testone ranks higher than (though within the margin of error of) Shannon Magrane on WNTS, and Dialidol records a higher score, both in votes and in computed score. And she ends up being the lowest, but maybe we just say that it was too close to call. Fine. But let’s now look at the 3rd lowest vote getter, Erika Van Pelt. She has the 4th highest WNTS score among women exactly where she should have been. Her Dialidol score is lower than either Shannon or Elise. But her raw number of Dialidol votes is the highest of all!
Now let’s move on to the men. The lowest WNTS among men was Jeremy. Good. Next lowest is Jermaine. Good! After that comes … Heejun Han, Colton Dixon, Deandre Brackensick, and Phillip Phillips. Only after those 4 guys do we see the actual third member of the bottom 3 men, Joshua Ledet. Exactly what Dialidol said!!!
This isn’t very surprising. Both WNTS and Dialidol are sampling, and they sample different subsets of the people who vote on American Idol. The people who blog about Idol and get polled by WNTS are different both in actual identity and in demographics than the people who install Dialidol and use their modem to power-vote. We should expect them to diverge and we should expect that they are both right in certain respects. Maybe Dialidol is simply a better gauge of men’s performances. Maybe it’s that there tend to be more overall votes for men anyway, so the sampling error is somewhat minimized.
Or, most plausible of all, maybe the data is just fricken noisy.
Now, to try to assess the noise, you do a regression analysis. That’s just a term that means that you take events that happened previously, you look at the outcome versus the variables you’re interested in, and then you guess at a shape of that dependence. Then you try to fit the shape to the data. You then plug your new data in (what happened this week) and see what the outcome should be based on this curve that you fit. Suppose I load all the data from the Top 12 and Top 13 from seasons 2 to 10 (season 1 started with Top 10). I can look at the outcomes (either eliminated or not eliminated) versus some of these variables. First is plain old WNTS score.
Call: glm(formula = Result ~ WNTS.Rating, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -1.29195 -0.42097 -0.22532 -0.07682 2.38317 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.66594 0.76767 0.867 0.38568 WNTS.Rating -0.08013 0.02445 -3.277 0.00105 ** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 66.828 on 108 degrees of freedom Residual deviance: 48.937 on 107 degrees of freedom AIC: 52.937 Number of Fisher Scoring iterations: 7
This most important thing here is the “significance”, under “Pr(>|z|)” which tells you what the probability is that your variable is not statistically meaningful. In this case, my stats program is telling me that WNTS.Rating has probability 0.00105 of not being significant, so that there is only about a 1% chance of that being the case. ok, yes, song quality affects whether or not people are eliminated. Duh.
Now, suppose that we try Dialidol. Why Dialidol? Well, I said that to do the regression analysis you need historical data, and Dialidol has been around quite a long time. Votefair has not. Twitter has not. It also has a decent track record. In any case, there’s no harm in trying it. Let’s try with the Dialidol ranking (where 1 is the highest number of votes recorded and 13 is the lowest)
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.7901 2.7310 -2.852 0.00434 ** DIVRank 0.6085 0.2529 2.406 0.01613 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So vote ranking by Dialidol is also statistically significant. Again, of course it is: you’re measuring while people are actually voting!
Now, we can actually fit both of these variables at the same time, so that we have a model that takes both things into account:
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.26410 3.21264 -1.639 0.1013 WNTS.Rating -0.07096 0.03809 -1.863 0.0625 . DIVRank 0.59675 0.29605 2.016 0.0438 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This is good. Both variables are significant, at least as well as can be hoped. So now let’s plug in the WNTS Rating and the Dialidol voting rank for last week’s contestants, and then use the regression that we just fit to say how likely elimination was:
Contestant Sex WNTS.Rating DIVRank Prob 1 Jeremy Rosado M 32 13 44.70 2 Jermaine Jones M 40 12 22.58 5 Shannon Magrane F 23 9 14.38 3 Heejun Han M 45 11 10.53 6 Elise Testone F 33 8 4.48 4 Deandre Brackensick M 56 10 2.94 7 Joshua Ledet M 70 7 0.19 11 Colton Dixon M 51 3 0.07 10 Phillip Phillips M 60 4 0.06 8 Skylar Laine F 78 6 0.06 9 Hollie Cavanagh F 88 5 0.02 13 Erika Van Pelt F 73 1 0.00 12 Jessica Sanchez F 91 2 0.00
To look at this, you might say that this isn’t too bad a projection. 4 out of the 5 highest probability contestants were indeed in the bottom 3. However, Joshua Ledet is out of order and Erika Van Pelt is way way out of order. The result is a bad projection on that end, predicting an event as 0.00% chance when in fact Erika was in the bottom 3 girls. That should not have happened.
So what went wrong? Nothing went wrong! All of this is totally intellectually sound. It was the data that were funky. A variable which was a good predictor in past years wasn’t this year, at least as far as Erika is concerned.
It is very tempting to start fiddling with the model. Erika is clearly being oversampled by Dialidol, on account of a small number of users who power vote only for her. You could just build in a mechanism to dock Erika some number of Dialidol votes. Likewise Heejun should get a bonus. Then you can get everything to line up nice. However, on what intellectual basis have you done this? Why adjust some values but not others? Why not adjust the WNTS score? Maybe the bloggers are the ones that rated it too highly. The answer is that there is no real intellectual basis. Since we don’t have any of these variables separated, we can’t tell anything about them.
So, I’m not going to screw with the model. It is just going to get some things very wrong. Seeing the underlying data may help explain to you why this is. This is certainly a situation of “information poor”, and it doesn’t look to get any richer anytime soon. By not adjusting the model, though, this actually makes it more robust, not less, since the model stays simple and doesn’t start making a bunch of ad hoc assumptions about what is going on.