I really hope that you take the time to read Nate Silver’s blog fivethirtyeight, now part of the New York Times. Silver is a successful developer of numerical models of baseball, and has since turned the same methods onto politics. What I like best about his analysis is how unfirm his predictions are: he’s careful in consideration of variables, always gives an error estimate, and rightly criticizes the practice of overfitting.

When I say overfitting, I mean the following. Suppose that I want to model a certain physical phenomenon like air flow over an airfoil. Then, the flow rate at all points can be absolutely modeled as a function of things like the shape of the airfoil, the density of air, the moisture concentration, and at most a few other variables. That is, in principle it is possible to find a relationship where there is no unpredictability or aspects of the observed data that are not captured by the model. Just collect more data until your noise is small, and do a fit.

By contrast, something like an election has intangibles and a lot of noise. Think of how many there could be. In addition to the amount of money a candidate raises, how many field offices he has set up, how many endorsements he’s won from prominent newspapers, all of which are quantifiable, he may have several intangibles that aren’t represented in those numbers, but which greatly affect his chances of being elected. Does he look good? How high pitched is his voice? Were there other news events when he announced his candidacy? And there are sources of noise: the voting or polling conditions change from day to day, as do the people who are interested in the race, the actual voting turnout, registration, and such things.

If you want to overfit, you simply add variables to a model until everything lies on your regression line. Then you, of course, get a perfect regression that fits all of the data. However, your model has absolutely no predictive power, because the way that you fit the data was just on an ad hoc basis: it wasn’t based on any real underlying causes, you just rigged it up so that there were no exceptions. Future data points will almost certainly not lie on the curve. A clear-cut example of this is Harry Enten’s supposed model of the number of congressional seats that were won in US elections, given 6 variables. If you extrapolate to one year outside the data that was fit, you see this terrible prediction:

In physics, this kind of method could be legitimate, since there is no variability in underlying causes for things (though over-fitting is still a problem). The disparity between these two fields is termed physics envy.

What exactly broke down here? Well, there are only 15 elections, and 6 variables. With that ratio, there is just no confidence. What you’ve modeled isn’t the underlying phenomenon, since you didn’t know enough about it with only 15 data points, but instead you’ve modeled the noise!

Now, I bring this up because American Idol, as a contest, is much more like politics than it is like physics. The demographic makeup of those voting changes a lot from episode to episode. Some people will vote many times, and some will vote only a few. And just think about the number of variables that could swing these votes! The contestant may be attractive, or have been introduced with a touching or funny intro. He may have a handicap, or have been disadvantaged. Perhaps he had a good week previously, but is off his game because he was sick. Maybe he plays an instrument well, or selects a song nobody has yet sung on the show. Perhaps a certain contestant sings better on nights when she only sings one song, and others perform better on nights with multiple songs. And, for good measure, maybe the contestant is just plain more likable than the others.

It’s easy to think of instances of all of these things and more.

Let’s start with attractiveness. Ryan Starr in S1 was practically a model. Did that help or hurt? It’s hard to say. Perhaps she got further than she should have. Perhaps, on the other hand, she wasn’t taken as seriously as some of the other women, like Kelly Clarkson, who weren’t quite as attractive and were hence considered more “serious”. Being an attractive woman didn’t seem to help Julie Zorrilla much in S10, but Haley Reinhart likely gained some from it, as did Katharine McPhee in S5, and Haley Scarnato the following year. And what of the men? Would Kris Allen have won S8 if he looked, instead, like Scott Savol? Could Tim Urban have advanced as far as he did?

The question is whether you can credibly incorporate this into a numerical model to help with predictions. To do this, you would need some way of assessing the attractiveness of each contestant, which is rather difficult to do. One could simply assign a 1 for “generally attractive” and a 0 for “generally unattractive”, but opinions on which value to assign could differ greatly between the sexes, between age groups, and possibly based on how the person was styled for the show.

How about likeability? Scotty McCreery was plainly likeable. He was featured heavily in the Hollywood round episodes. During that, he was seen self-flagellating after his group booted Jacee Badeaux, a very nice kid with a heart of gold. McCreery felt terrible about it, and the producers made a point of showing just how bad he felt: he apologized on stage before his audition. This is a huge amount of airtime, considering that there were, at this point, over 100 contestants. Is it any wonder that his scores seemed completely disconnected from the voting results? America may not have loved his singing, but they did love *him*.

One can think of a ton of other factors that affect the voting other than the variables I usually take into account in projections (gender, approval rating, and performance order). Jasmine Trias was able to secure an entire state’s votes (Hawaii). Sanjaya Malakar was probably at least somewhat helped by Vote For the Worst and Howard Stern. Finally, I would say that the fact that the judges seem to change every 1-2 years has greatly influenced the voting.

American Idol has had 10 full seasons. Judging the Top 3 against the Top 10 doesn’t make any sense, as the dynamics for each round change a lot. So, for any given prediction, you have at most 10 data points, 1 for each year. Including two variables in a regression model may already be overfitting! Nonetheless, as I’ve said before, the variables that I personally think are the most important are:

**Quality of performance**, as measured by the approval rating on WhatNotToSing, a site that polls a large number of blogs to determine how many people liked the performance. A logistical regression with this variable alone shows excellent statistical significance.

And

**Gender**. Having a high score on WNTS of course helps, but during the first 4-6 rounds being a woman is a major handicap, as I’ve shown several times.

This is complicated by the fact that the only thing consistently revealed by the show as the response variable is the lowest vote-getter. In an election, you get an actual tally of votes, but in Idol you only know whether or not someone was eliminated. Thus, the only real regression model at our disposal is the logistic regression, which gives a number between 0 and 1 representing the probability of the event happening.

Someone wrote to me over the break saying that he wanted to look at the effect of judge’s comments on the outcome. Though I never heard from him again, I myself have always been curious about whether the judge’s comments sway the vote (my guess is that they do—a lot). Unfortunately, there are two problems with this. First is that the data is not readily available, and not already quantified. It would be required to go through each and every performance (if you can find them) and determine subjectively whether the judge had given a good, bad, or mixed review. Needless to say, this is imprecise and *very *labor intensive. I’d like to do it, and may. The second problem is that the judges are not the same! Paula Abdul left, and was replaced by Kara Dioguardi and Ellen Degeneres. Then, both of those women left and Steven Tyler and Jennifer Lopez were added. Their personalities are very different than what came before, they are somewhat sycophantic (in my estimation), and it’s just not an apples to apples comparison.

In any case, the point of this 1500 word essay is this: American Idol is a very complex system, and prediction of it through numerical methods (rather than by feeling) will always be quite imprecise, and often *very* imprecise. The number of data points is tiny, the number of variables huge, and the fact that the model stood up to (and even exceeded) the experts last year is impressive to me by itself.