The peril of model fiddling

Some people say that we now live in a “data rich” world. You can find a quantitative measurement for practically anything you want. This has been the case at least since the availability of Google search trends, and continues with the advent of things like Twitter searches (including so-called “sentiment analysis”). You can look up many national demographic and economic variables going back at least to the turn of the century, which potentially is very helpful in assessing things like medicine, the justice system, voting patterns in presidential elections, and so on.

Coming along with “data rich”, though, can be “information poor”. The idea that any quantitative variable tells you something meaningful is ludicrous. And it gets worse: even variables which are obviously meaningful can be so noisy that they end up being useless. An example would be the way that unemployment or gas prices affect whether an incumbent president is re-elected. These variables are bound to make voters unhappy with the guy in office; but with so many other things happening in the world, who honestly thinks you can understand all that much about it from this single variable? (Answer: nobody who is well-informed).

How can we assess the voting outcome of American Idol? Well, sit and think of what you would vote based on. How good was the singing? Naturally. Was the person in some other way annoying? Certainly that would make a difference. Was this week’s performance a dud, but last week’s really good? Maybe you’ll vote anyway based on that. Now sit and try to think of how other people might vote. Hawaiians might vote for this guy because he’s from Hawaii. Young girls would like this cute guy even though he sucks. Country fans will like this. Worsters are pulling for this guy. The list is endless.

For all the crap they take, weather forecasts and presidential polling forecasts are extremely accurate. The fact that you can take such a chaotic system, reduce it down to a few variables, measure those variables, and build a model that is accurate even a fair fraction of the time is astounding. And weather forecasts are a good deal more accurate than even a fair fraction of the time. Presidential polls have been accurate in the final days of an election in all years since Dewey lost to Truman. That’s pretty incredible.

What accounts for the success of these predictions? In weather, the advantage is in massive amounts of data. You have every possible indicator on every day for the past hundred years. Given that, you are bound to be able to say with considerable precision whether something will happen. Suppose the weather forecast calls for a 30% chance of rain. That means that in all of the data, when the indicators looked like that, it rained 30% of the time. If you collect data for a month, that will be pretty crappy. Barometric pressure, humidity, time of year, all of those vary too much to tell you. But looking at tens of variables over all years? That will be very good.

In political polling the advantage is that you’re measuring the actual thing that’s happening. You straight up ask people who they are going to vote for. Of course, this could still be faulty, and it changes a lot over time. But in the days before an election you can see that the numbers sort of “lock in”, as people make up their minds, and the predictions become nearly bulletproof based on these data. This was most staggeringly clear when Nate Silver predicted the number of electoral votes in 2008 within a very small fraction of votes.

These two strategies are useless for American Idol. The weather example fails because there have only been 10 full seasons of the show. If the show continued for another 100 years (not likely), then you would start to be able to make some damn good predictions. The polling example fails because there is no way to poll how people are voting during a 2 hour block of time during which voting happens which comes directly after the show. Even if you had the resources (money) to do it, it’s not practicable. You can do something like exit polls, which is how Votefair works, but this is fraught with sampling error, meaning that the people who vote on Votefair are not representative of those who vote in total.

Here are some variables that we could try to collect to help us predict:

  • Approval rating of performance by
  • Dialidol score
  • Votefair votes
  • Performance position (when did the performance take place?)
  • Aggregate blogger predictions
  • Pre-exposure time
  • Gender
  • Other polls (such as or IdolBlogLive)

Then we could think of some other attendant or derived variables. Maybe WNTS approval rating and last week’s approval rating. Maybe not just Dialidol score, but Dialidol raw votes, or Dialidol ranking. Maybe we take the performance position and extract a “performance order quotient” that fit the observed elimination rates. Again, the possibilities are endless.

Here’s the thing: it’s obvious that all of these variables are indicative of the outcome. If everyone says that Jeremy Rosado is gone, then he’s probably gone. But that isn’t what happened! The only indicator I saw that Elise Testone was the lowest vote getter among the women was that MJ from mjsbigblog thought that it was going to happen, and that’s just one person’s feeling, not any kind of measurement.

Let’s dig into this a little more. Here are the relevant variables from last week

Contestant Order Song WNTS Rating WNTS stdev Result Dialidol score Dialidol Votes (thousand)
Joshua Ledet 1 I Wish 70 14 Bottom Group 1.641 2.37
Elise Testone 2 I’m Your Baby Tonight 33 17 Bottom Group 2.994 1.74
Jermaine Jones 3 Knocks Me Off My Feet 40 19 Bottom Group 0.82 1.31
Erika Van Pelt 4 I Believe In You And Me 73 11 Bottom Group 1.377 5.19
Colton Dixon 5 Lately 51 17 Safe 2.719 4.26
Shannon Magrane 6 I Have Nothing 23 17 Bottom Group 1.836 1.54
Deandre Brackensick 7 Master Blaster 56 22 Safe 4.247 1.48
Skylar Laine 8 Where Do Broken Hearts Go 78 16 Safe 3.572 2.65
Heejun Han 9 All In Love Is Fair 45 20 Safe 3.055 1.38
Hollie Cavanagh 10 All The Man That I Need 88 9 Safe 3.987 3.86
Jeremy Rosado 11 Ribbon In The Sky 32 19 Eliminated 3.135 .53
Jessica Sanchez 12 I Will Always Love You 91 7 Safe 4.041 4.28
Phillip Phillips 13 Superstition 60 25 Safe 4.66 4.08

So, the image is very muddy. Elise Testone ranks higher than (though within the margin of error of) Shannon Magrane on WNTS, and Dialidol records a higher score, both in votes and in computed score. And she ends up being the lowest, but maybe we just say that it was too close to call. Fine. But let’s now look at the 3rd lowest vote getter, Erika Van Pelt. She has the 4th highest WNTS score among women exactly where she should have been. Her Dialidol score is lower than either Shannon or Elise. But her raw number of Dialidol votes is the highest of all!

Now let’s move on to the men. The lowest WNTS among men was Jeremy. Good. Next lowest is Jermaine. Good! After that comes … Heejun Han, Colton Dixon, Deandre Brackensick, and Phillip Phillips. Only after those 4 guys do we see the actual third member of the bottom 3 men, Joshua Ledet. Exactly what Dialidol said!!!

This isn’t very surprising. Both WNTS and Dialidol are sampling, and they sample different subsets of the people who vote on American Idol. The people who blog about Idol and get polled by WNTS are different both in actual identity and in demographics than the people who install Dialidol and use their modem to power-vote. We should expect them to diverge and we should expect that they are both right in certain respects. Maybe Dialidol is simply a better gauge of men’s performances. Maybe it’s that there tend to be more overall votes for men anyway, so the sampling error is somewhat minimized.

Or, most plausible of all, maybe the data is just fricken noisy.

Now, to try to assess the noise, you do a regression analysis. That’s just a term that means that you take events that happened previously, you look at the outcome versus the variables you’re interested in, and then you guess at a shape of that dependence. Then you try to fit the shape to the data. You then plug your new data in (what happened this week) and see what the outcome should be based on this curve that you fit. Suppose I load all the data from the Top 12 and Top 13 from seasons 2 to 10 (season 1 started with Top 10). I can look at the outcomes (either eliminated or not eliminated) versus some of these variables. First is plain old WNTS score.

glm(formula = Result ~ WNTS.Rating, family = binomial(link = "logit"))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.29195  -0.42097  -0.22532  -0.07682   2.38317  

            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.66594    0.76767   0.867  0.38568   
WNTS.Rating -0.08013    0.02445  -3.277  0.00105 **

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 66.828  on 108  degrees of freedom
Residual deviance: 48.937  on 107  degrees of freedom
AIC: 52.937

Number of Fisher Scoring iterations: 7

This most important thing here is the “significance”, under “Pr(>|z|)” which tells you what the probability is that your variable is not statistically meaningful. In this case, my stats program is telling me that WNTS.Rating has probability 0.00105 of not being significant, so that there is only about a 1% chance of that being the case. ok, yes, song quality affects whether or not people are eliminated. Duh.

Now, suppose that we try Dialidol. Why Dialidol? Well, I said that to do the regression analysis you need historical data, and Dialidol has been around quite a long time. Votefair has not. Twitter has not. It also has a decent track record. In any case, there’s no harm in trying it. Let’s try with the Dialidol ranking (where 1 is the highest number of votes recorded and 13 is the lowest)

            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -7.7901     2.7310  -2.852  0.00434 **
DIVRank       0.6085     0.2529   2.406  0.01613 * 
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So vote ranking by Dialidol is also statistically significant. Again, of course it is: you’re measuring while people are actually voting!

Now, we can actually fit both of these variables at the same time,  so that we have a model that takes both things into account:

            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -5.26410    3.21264  -1.639   0.1013  
WNTS.Rating -0.07096    0.03809  -1.863   0.0625 .
DIVRank      0.59675    0.29605   2.016   0.0438 *
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This is good. Both variables are significant, at least as well as can be hoped. So now let’s plug in the WNTS Rating and the Dialidol voting rank for last week’s contestants, and then use the regression that we just fit to say how likely elimination was:

            Contestant Sex WNTS.Rating DIVRank  Prob
1        Jeremy Rosado   M          32      13 44.70
2       Jermaine Jones   M          40      12 22.58
5      Shannon Magrane   F          23       9 14.38
3           Heejun Han   M          45      11 10.53
6        Elise Testone   F          33       8  4.48
4  Deandre Brackensick   M          56      10  2.94
7         Joshua Ledet   M          70       7  0.19
11        Colton Dixon   M          51       3  0.07
10    Phillip Phillips   M          60       4  0.06
8         Skylar Laine   F          78       6  0.06
9      Hollie Cavanagh   F          88       5  0.02
13      Erika Van Pelt   F          73       1  0.00
12     Jessica Sanchez   F          91       2  0.00

To look at this, you might say that this isn’t too bad a projection. 4 out of the 5 highest probability contestants were indeed in the bottom 3. However, Joshua Ledet is out of order and Erika Van Pelt is way way out of order. The result is a bad projection on that end, predicting an event as 0.00% chance when in fact Erika was in the bottom 3 girls. That should not have happened.

So what went wrong? Nothing went wrong! All of this is totally intellectually sound. It was the data that were funky. A variable which was a good predictor in past years wasn’t this year, at least as far as Erika is concerned.

It is very tempting to start fiddling with the model. Erika is clearly being oversampled by Dialidol, on account of a small number of users who power vote only for her. You could just build in a mechanism to dock Erika some number of Dialidol votes. Likewise Heejun should get a bonus. Then you can get everything to line up nice. However, on what intellectual basis have you done this? Why adjust some values but not others? Why not adjust the WNTS score? Maybe the bloggers are the ones that rated it too highly. The answer is that there is no real intellectual basis. Since we don’t have any of these variables separated, we can’t tell anything about them.

So, I’m not going to screw with the model. It is just going to get some things very wrong. Seeing the underlying data may help explain to you why this is. This is certainly a situation of “information poor”, and it doesn’t look to get any richer anytime soon. By not adjusting the model, though, this actually makes it more robust, not less, since the model stays simple and doesn’t start making a bunch of ad hoc assumptions about what is going on.

Bookmark the permalink.

Comments are closed.

Comments are closed