The peril of model fiddling

Some people say that we now live in a “data rich” world. You can find a quantitative measurement for practically anything you want. This has been the case at least since the availability of Google search trends, and continues with the advent of things like Twitter searches (including so-called “sentiment analysis”). You can look up many national demographic and economic variables going back at least to the turn of the century, which potentially is very helpful in assessing things like medicine, the justice system, voting patterns in presidential elections, and so on.

Coming along with “data rich”, though, can be “information poor”. The idea that any quantitative variable tells you something meaningful is ludicrous. And it gets worse: even variables which are obviously meaningful can be so noisy that they end up being useless. An example would be the way that unemployment or gas prices affect whether an incumbent president is re-elected. These variables are bound to make voters unhappy with the guy in office; but with so many other things happening in the world, who honestly thinks you can understand all that much about it from this single variable? (Answer: nobody who is well-informed).

How can we assess the voting outcome of American Idol? Well, sit and think of what you would vote based on. How good was the singing? Naturally. Was the person in some other way annoying? Certainly that would make a difference. Was this week’s performance a dud, but last week’s really good? Maybe you’ll vote anyway based on that. Now sit and try to think of how other people might vote. Hawaiians might vote for this guy because he’s from Hawaii. Young girls would like this cute guy even though he sucks. Country fans will like this. Worsters are pulling for this guy. The list is endless.

For all the crap they take, weather forecasts and presidential polling forecasts are extremely accurate. The fact that you can take such a chaotic system, reduce it down to a few variables, measure those variables, and build a model that is accurate even a fair fraction of the time is astounding. And weather forecasts are a good deal more accurate than even a fair fraction of the time. Presidential polls have been accurate in the final days of an election in all years since Dewey lost to Truman. That’s pretty incredible.

What accounts for the success of these predictions? In weather, the advantage is in massive amounts of data. You have every possible indicator on every day for the past hundred years. Given that, you are bound to be able to say with considerable precision whether something will happen. Suppose the weather forecast calls for a 30% chance of rain. That means that in all of the data, when the indicators looked like that, it rained 30% of the time. If you collect data for a month, that will be pretty crappy. Barometric pressure, humidity, time of year, all of those vary too much to tell you. But looking at tens of variables over all years? That will be very good.

In political polling the advantage is that you’re measuring the actual thing that’s happening. You straight up ask people who they are going to vote for. Of course, this could still be faulty, and it changes a lot over time. But in the days before an election you can see that the numbers sort of “lock in”, as people make up their minds, and the predictions become nearly bulletproof based on these data. This was most staggeringly clear when Nate Silver predicted the number of electoral votes in 2008 within a very small fraction of votes.

These two strategies are useless for American Idol. The weather example fails because there have only been 10 full seasons of the show. If the show continued for another 100 years (not likely), then you would start to be able to make some damn good predictions. The polling example fails because there is no way to poll how people are voting during a 2 hour block of time during which voting happens which comes directly after the show. Even if you had the resources (money) to do it, it’s not practicable. You can do something like exit polls, which is how Votefair works, but this is fraught with sampling error, meaning that the people who vote on Votefair are not representative of those who vote in total.

Here are some variables that we could try to collect to help us predict:

  • Approval rating of performance by WhatNotToSing.com
  • Dialidol score
  • Votefair votes
  • Performance position (when did the performance take place?)
  • Aggregate blogger predictions
  • Pre-exposure time
  • Gender
  • Other polls (such as Ricky.org or IdolBlogLive)

Then we could think of some other attendant or derived variables. Maybe WNTS approval rating and last week’s approval rating. Maybe not just Dialidol score, but Dialidol raw votes, or Dialidol ranking. Maybe we take the performance position and extract a “performance order quotient” that fit the observed elimination rates. Again, the possibilities are endless.

Here’s the thing: it’s obvious that all of these variables are indicative of the outcome. If everyone says that Jeremy Rosado is gone, then he’s probably gone. But that isn’t what happened! The only indicator I saw that Elise Testone was the lowest vote getter among the women was that MJ from mjsbigblog thought that it was going to happen, and that’s just one person’s feeling, not any kind of measurement.

Let’s dig into this a little more. Here are the relevant variables from last week

Contestant Order Song WNTS Rating WNTS stdev Result Dialidol score Dialidol Votes (thousand)
Joshua Ledet 1 I Wish 70 14 Bottom Group 1.641 2.37
Elise Testone 2 I’m Your Baby Tonight 33 17 Bottom Group 2.994 1.74
Jermaine Jones 3 Knocks Me Off My Feet 40 19 Bottom Group 0.82 1.31
Erika Van Pelt 4 I Believe In You And Me 73 11 Bottom Group 1.377 5.19
Colton Dixon 5 Lately 51 17 Safe 2.719 4.26
Shannon Magrane 6 I Have Nothing 23 17 Bottom Group 1.836 1.54
Deandre Brackensick 7 Master Blaster 56 22 Safe 4.247 1.48
Skylar Laine 8 Where Do Broken Hearts Go 78 16 Safe 3.572 2.65
Heejun Han 9 All In Love Is Fair 45 20 Safe 3.055 1.38
Hollie Cavanagh 10 All The Man That I Need 88 9 Safe 3.987 3.86
Jeremy Rosado 11 Ribbon In The Sky 32 19 Eliminated 3.135 .53
Jessica Sanchez 12 I Will Always Love You 91 7 Safe 4.041 4.28
Phillip Phillips 13 Superstition 60 25 Safe 4.66 4.08

So, the image is very muddy. Elise Testone ranks higher than (though within the margin of error of) Shannon Magrane on WNTS, and Dialidol records a higher score, both in votes and in computed score. And she ends up being the lowest, but maybe we just say that it was too close to call. Fine. But let’s now look at the 3rd lowest vote getter, Erika Van Pelt. She has the 4th highest WNTS score among women exactly where she should have been. Her Dialidol score is lower than either Shannon or Elise. But her raw number of Dialidol votes is the highest of all!

Now let’s move on to the men. The lowest WNTS among men was Jeremy. Good. Next lowest is Jermaine. Good! After that comes … Heejun Han, Colton Dixon, Deandre Brackensick, and Phillip Phillips. Only after those 4 guys do we see the actual third member of the bottom 3 men, Joshua Ledet. Exactly what Dialidol said!!!

This isn’t very surprising. Both WNTS and Dialidol are sampling, and they sample different subsets of the people who vote on American Idol. The people who blog about Idol and get polled by WNTS are different both in actual identity and in demographics than the people who install Dialidol and use their modem to power-vote. We should expect them to diverge and we should expect that they are both right in certain respects. Maybe Dialidol is simply a better gauge of men’s performances. Maybe it’s that there tend to be more overall votes for men anyway, so the sampling error is somewhat minimized.

Or, most plausible of all, maybe the data is just fricken noisy.

Now, to try to assess the noise, you do a regression analysis. That’s just a term that means that you take events that happened previously, you look at the outcome versus the variables you’re interested in, and then you guess at a shape of that dependence. Then you try to fit the shape to the data. You then plug your new data in (what happened this week) and see what the outcome should be based on this curve that you fit. Suppose I load all the data from the Top 12 and Top 13 from seasons 2 to 10 (season 1 started with Top 10). I can look at the outcomes (either eliminated or not eliminated) versus some of these variables. First is plain old WNTS score.

Call:
glm(formula = Result ~ WNTS.Rating, family = binomial(link = "logit"))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.29195  -0.42097  -0.22532  -0.07682   2.38317  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.66594    0.76767   0.867  0.38568   
WNTS.Rating -0.08013    0.02445  -3.277  0.00105 **

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 66.828  on 108  degrees of freedom
Residual deviance: 48.937  on 107  degrees of freedom
AIC: 52.937

Number of Fisher Scoring iterations: 7

This most important thing here is the “significance”, under “Pr(>|z|)” which tells you what the probability is that your variable is not statistically meaningful. In this case, my stats program is telling me that WNTS.Rating has probability 0.00105 of not being significant, so that there is only about a 1% chance of that being the case. ok, yes, song quality affects whether or not people are eliminated. Duh.

Now, suppose that we try Dialidol. Why Dialidol? Well, I said that to do the regression analysis you need historical data, and Dialidol has been around quite a long time. Votefair has not. Twitter has not. It also has a decent track record. In any case, there’s no harm in trying it. Let’s try with the Dialidol ranking (where 1 is the highest number of votes recorded and 13 is the lowest)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -7.7901     2.7310  -2.852  0.00434 **
DIVRank       0.6085     0.2529   2.406  0.01613 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So vote ranking by Dialidol is also statistically significant. Again, of course it is: you’re measuring while people are actually voting!

Now, we can actually fit both of these variables at the same time,  so that we have a model that takes both things into account:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -5.26410    3.21264  -1.639   0.1013  
WNTS.Rating -0.07096    0.03809  -1.863   0.0625 .
DIVRank      0.59675    0.29605   2.016   0.0438 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This is good. Both variables are significant, at least as well as can be hoped. So now let’s plug in the WNTS Rating and the Dialidol voting rank for last week’s contestants, and then use the regression that we just fit to say how likely elimination was:

            Contestant Sex WNTS.Rating DIVRank  Prob
1        Jeremy Rosado   M          32      13 44.70
2       Jermaine Jones   M          40      12 22.58
5      Shannon Magrane   F          23       9 14.38
3           Heejun Han   M          45      11 10.53
6        Elise Testone   F          33       8  4.48
4  Deandre Brackensick   M          56      10  2.94
7         Joshua Ledet   M          70       7  0.19
11        Colton Dixon   M          51       3  0.07
10    Phillip Phillips   M          60       4  0.06
8         Skylar Laine   F          78       6  0.06
9      Hollie Cavanagh   F          88       5  0.02
13      Erika Van Pelt   F          73       1  0.00
12     Jessica Sanchez   F          91       2  0.00

To look at this, you might say that this isn’t too bad a projection. 4 out of the 5 highest probability contestants were indeed in the bottom 3. However, Joshua Ledet is out of order and Erika Van Pelt is way way out of order. The result is a bad projection on that end, predicting an event as 0.00% chance when in fact Erika was in the bottom 3 girls. That should not have happened.

So what went wrong? Nothing went wrong! All of this is totally intellectually sound. It was the data that were funky. A variable which was a good predictor in past years wasn’t this year, at least as far as Erika is concerned.

It is very tempting to start fiddling with the model. Erika is clearly being oversampled by Dialidol, on account of a small number of users who power vote only for her. You could just build in a mechanism to dock Erika some number of Dialidol votes. Likewise Heejun should get a bonus. Then you can get everything to line up nice. However, on what intellectual basis have you done this? Why adjust some values but not others? Why not adjust the WNTS score? Maybe the bloggers are the ones that rated it too highly. The answer is that there is no real intellectual basis. Since we don’t have any of these variables separated, we can’t tell anything about them.

So, I’m not going to screw with the model. It is just going to get some things very wrong. Seeing the underlying data may help explain to you why this is. This is certainly a situation of “information poor”, and it doesn’t look to get any richer anytime soon. By not adjusting the model, though, this actually makes it more robust, not less, since the model stays simple and doesn’t start making a bunch of ad hoc assumptions about what is going on.

Huh…

So remember Haley Johnsen, the super-pretty ex-barista (or maybe she’s a barista again, who knows) who sang a truly execrable rendition of Sweet Dreams, failing to make it into the Top 13? Well apparently she wanted to do a remixed version that actually doesn’t suck:

(I really don’t recommend you go back and watch what she actually performed for comparison; it’s really painful to watch).

I wonder why they wouldn’t let her do it like that? Maybe the musicians couldn’t or wouldn’t learn a new version of the song with twenty-four other people to accomodate? I don’t know, but maybe if Haley had done that version…?

Then again, maybe not. In any case she never would have won; Hollie Cavanagh and Jessica Sanchez are much better prospects, not to mention my personal favorite, Erika Van Pelt. And that’s just the girls; as we’ve previously demonstrated, there is definitely a statistically significant gender bias in American Idol.

And credit where credit’s due, I saw this first linked to from mjsbigblog

And Now We All Feel Old*

Oh boy. So next week’s theme is going to be songs from the years the Idols were born. I am now officially older than all of the contestants, which I sort of intellectually knew, but all the same, what can prepare me for this? Two thirds of those young whippersnappers were born in the nineties. Crazy.

Via MJ:

Colton Dixon – October 19, 1991
Deandre Brackensick – October 21, 1994
Elise Testone – July 29, 1983
Erika Van Pelt – December 12, 1985
Heejun Han – April 20, 1989
Hollie Cavanagh – July 5, 1993
Jermaine Jones – November 3, 1986
Jessica Sanchez – August 4, 1995
Joshua Ledet – April 9, 1992
Phillip Phillips – September 20, 1990
Shannon Magrane – October 21, 1995
Skylar Laine – February 1, 1994

I have no idea what any of them should sing; Erika needs to sing something awesome, because she’s awesome, and Hollie needs to sing something that was not included in a Disney movie. I need for no one to sing any Nirvana songs (Colton Dixon I’m looking at you: just don’t).

What do you guys think?

*and if you’re as young as or younger than these kids, get off my lawn!

On Hollie Cavanagh’s Speech

I feel really bad that a few of you dear readers have arrived here by searching for some variation on “does Hollie Cavanagh have a speech impediment.” I myself pondered this in our Women of the Top 24 liveblog.

The answer is no, she does not. She’s merely Liverpudlian in origin (i.e. from Liverpool, in England, the city arguably most famous for being that in which The Beatles formed as a band). She mentioned in her interview for the Top 24 show that she was born in England, and her wikipedia entry further clarifies her origins. Evidently that accent is what you get when you sprinkle a little Texas on top of Liverpool.

Actually it will be really interesting to hear her speak in additional interviews throughout the season; she was particularly disadvantaged by being virtually unseen in the Hollywood and Vegas weeks, and for some reason it wasn’t mentioned at all that she had made it to Hollywood last year. Quite odd, if you ask me.

In any case, Hollie is a really excellent singer, and I hope she goes far in the competition!

Would sentiment analysis work for Idol projections?

I was recently made aware of the following video interview of Richard Foley, who presented results of a “sentiment analysis” in predicting American Idol outcomes:

Put simply, sentiment analysis attempts to determine what the public thinks about a certain topic by scouring data sources like Twitter, and then trying to classify the data according to whether it was positive or negative. As far as I can tell, the results were not published outside of the conference.

I’m highly skeptical of this method for several reasons. Continue reading