When it comes to actual predictions of American Idol outcomes week-to-week, I’ve always been interested but skeptical. Idol is a shifting and surprising thing. As a result, I did something of a reasonable but overly cautious nature, which is to compare a couple variables to outcomes of comparable shows. For instance, the Top 8 could be directly compared to all other top 8 rounds. Then, using a simple logistical regression, I let R dictate the coefficients of regression and left it at that. It was a toy model, not claimed to be worth much.
The old model ignored all of the personal observations I’ve made over the years about how the voting plays out, and was therefore not representative of the show as I see it. The data for the comparable rounds gets so sparse that it’s hard to draw much of a conclusion. And as a result, it wasn’t very good. But this year, I expended a lot of effort and a lot of time thinking about the problem in more detail.
What I recognized is that a fairly good statistical model could be built if I used some reasonable assumptions, sound observations, and careful methodology. The result of this I will outline in this post. This model is not necessarily predictive of the next year, but it is a quite good description of the years past, is not overfit, and hence stands a good chance of being an accurate description for the next. I’ve well characterized the confidence and ensured that predicted probabilities work out reasonably well to the actual observations.
Hit the jump for the details of the new IdolAnalytics forecast model. This is going to get decently technical, but that shouldn’t put you off.
Throughout this discussion, I’m going to refer to the bottom group as the revealed bottom 3, bottom 2, or bottom 1, depending on how many people there are left in the contest. Until the Top 6, this group is normally the bottom 3. Past that, it has fluctated a bit, but normally until the Top 4 there is a bottom 2 and past that it’s just the bottom 1.
In order to make the most of the information available, we have to consider what we really know about the voting tendencies. I narrow my attention to one question: what is the probability that someone is in the bottom group rather than safe?
What do we know about the answer to this from experience? Good performances play into it. A good singer is less likely to be the lowest vote-getter than a poor singer is. Popularity must also factor in, based on a number of intangible things such as charisma, good looks, or whatever. And finally, the ability to get votes from the audience must be factored in. That is, someone may be popular and favored among a demographic that doesn’t vote very much, and that in itself will cut against him. Or, for instance, if someone is eliminated that is most similar to another contestant, the one that was not eliminated stands to gain.
None of these things seems to dictate the results by itself. As such, a model, if it is to get the basics, must include something that measures singing quality, something that measures popularity, and something that registers actual voting patterns. Fortunately, there are a measurements available to us that cover each of these. Since the following variables were only available since season 5, seasons prior to that are not considered. This model predicts the probability of being “not safe”, which is to say in the bottom group, only in finals rounds (not semi-finals, which requires its own model).
The most robust data set available is the approval rating measured by WhatNotToSing.com (a very good site whose name is a faulty premise). These approval ratings are calculated, the site runners say, by a random sample of author reactions from blogs. The result is a number between 0 and 100, with a given standard deviation, and data goes back even to season 1 (though season 1’s data was collected after the season was over, whereas all subsequent measurements were made on the actual night of the show). This is broken down by song in the event that multiple songs were sung on the same night. Once per year the site renormalizes all their measurements, though I only record the score they gave on the night of the performance (where I had access to it, which is as far back as season 10). As I’ll explain, this adjustment shouldn’t matter, since each record is only compared to its cohorts’ records.
The maintainer of a site called VoteFair.org runs, among others, an American Idol poll asking a simple question: who is your favorite contestant (and also second favorite, third favorite, etc.)? He publishes the number of votes on a weekly basis. The poll is open before that week’s performance and runs until after the performance. As such, we can expect it to measure not necessarily how good the performance is (since much of the voting takes place before the song is even sung), but on how much someone likes the contestant. Data goes back to the beginning of season 5, although sample sizes were quite limited for the first year.
While Votefair is a valuable resource, by its nature it is an unscientific poll, which limits its usefulness. A scientific poll would ask a random sample of viewers about this choice, or alternatively could poll a fixed set of viewers throughout the competition. However, by having the vote be totally opt-in, it is not anything like a representative sample (it likely suffers from extreme response bias). That does not, however, make it useless. Its usefulness is to be determined by how well it has performed historically. (This isn’t a criticism of Votefair per se, as doing a scientific poll requires much more time, effort, and money than any organization would put into something as trivial as American Idol, but it is the reality of the situation.)
Dialidol.com is a site that measures two things:
1. the number of votes being cast by phone by its users, and
2. the proportion of calls to the contestant’s number that result in a busy signal.
Those numbers are put into an empirical formula weighing the proportion of busy signals in each time zone and calculating a score based on those proportions. As such, it is a direct measurement of voting, rather than popularity or performance quality. The data goes back to the Top 6 of season 4. Like Votefair, this is unscientific, but I would wager that the people using the dialing program (which is how the measurement is made) are relatively consistent between episodes, and hence it is still an ok, though skewed, sample.
Year to year comparison
The first matter to deal with is that to leverage all of our data set we need to make sure we are making an apples-to-apples comparison. We should ask ourselves, for instance, what relationship the relative magnitude of variables for the Top 3 of season 5 has to the Top 10 of season 11. There are a couple considerations. First, if the average of one set is higher than the other, then a given score doesn’t mean the same thing. Suppose that in the Top 10 the average score was 50 and in the Top 3 the average score is 70. Then, having a score of 50 in the former situation is average, but having a 50 in the latter means being well below average, and the numbers need to be indexed to this. That is, all of the data needs to be normalized.
There are several reasonable normalization schemes that could work here, and none is a priori necessarily better than the other. I found empirically that the most predictive method was a linear renormalization. Suppose we have data xi for a given week. Then, we can apply a transformation yi = m xi + b, where the constants m and b are determined on a weekly basis so that the mean of yi is forced to be 0 and the maximum is 100. (Some simple algebra, which I omit here, will give you the values for these constants.) One reason this works well is that while the maximum is fixed at 100, the minimum can be lower than -100, emphasizing very below-average performances that tend to result in bottom group status. Also, at a glance one can rate how good it was: positive numbers are above average and negative numbers are below average, with 100 always being the #1 slot for that episode.
I also draw no distinction between people who are in the bottom group and who is actually eliminated, and hence the model is strictly speaking only trying to determine who is not-safe. The implicit assumption here is that the bottom vote-getter is the one most likely to be in the bottom group, which I think is pretty reasonable as assumptions go. This is altogether more informative than just looking at the eliminations, since otherwise you’re ignoring a large part of the results as they are revealed to us by the show. Another way of handling this would be to create a response variable that isn’t just binary (yes or no) but categorical with three possible outcomes (safe, bottom group but not eliminated, and eliminated). However, this is unnecessarily complicated given the reasonableness of the above assumption, and unfeasible given that as the rounds proceed there are fewer and fewer in the “bottom group but not eliminated” category, making that group hard to detect.
Weighing the variables
A numerical model should not include variables that are highly correlated. Including two variables that both say the same thing will serve to outsize the weight that that one factor has. Thus, we need to check whether our variables above, approval rating, Votefair vote proportion, and Dialidol, are highly correlated (and, as such, should be averaged together into one variable) or not very correlated (in which case they each represent a separate dimension and should be incorporated independently).
The first thing to look at is Dialidol versus the WNTS rating, where both scores are normalized. The horizontal axis has adjusted WNTS scores and the vertical axis Dialidol adjusted scores for contestants in all final rounds since season 5 (0 is the mean, so all negative numbers are below average, and all positive are above average).
A linear fit has coefficient of determination R2 of only about 0.1, so these variables are fairly weakly correlated. This means that Dialidol provides mainly complementary information to WNTS (though the association is positive, as one might expect). There are many instances of people having negative scores along one dimension and positive scores along the other (in fact, there are several incidents of having the highest score on one scale and being below average on the other).
The same procedure, carried out for Votefair versus both other variables, shows a similar result. Thus, it’s justified to include all three in a model.
I fit all three normalized quantities for all final rounds using a logistical regression. The natural log of the odds is fit to a linear relationship
to obtain a first estimate of the time averaged effect of these variables (the process is done iteratively). That is, the initial fit pays no attention to when the result happened, and determines the coefficients βi according to if there were no time dependence.
The variables given above have different meaning depending on the time in the season. This is logical, since people choose a favorite and vote for her consistently past some point. As such, we ought to expect some of our variables to have less weight as the contest goes on.
My observation is that WNTS and Votefair become less predictive, and Dialidol more predictive, as the year goes on. One could credibly rationalize that this is because low-volume voters (people who vote at most a few times) who are badly sampled by Dialidol become less important, and those voting upwards of 20 times become more important. This observation is borne out in the data as well.
As such, one can substantially improve on this model by taking the time dependence into account. The relative weight in the calculation for each variable is shown below. The most quickly decaying indicator is WNTS performance approval rating. Votefair hangs on a bit longer. And Dialidol steadily grows in time. The relative weights given to each indicator are plotted below, where the horizontal axis is time in episodes (2 being the first finals episode, since I count 1 as the semi-finals. This makes no difference).
As such, we adjust the above to include time,
where the functions fi(t) are those shown in the graph. These functions are linear, but saturate at 0.2, a value that is arbitrary but verified to be a reasonable empirically. That means that the model never totally discounts WNTS or Votefair, but only weights them 0.2 of their full value in the final weeks.
Applying the time dependence and weighting factors, we can calculate the probabilities from the logit given above:
The final step in the model fit is to optimize the parameters, which is done by checking the accuracy of calls. A call is considered accurate if, for instance, there is a bottom 3, the person ranked with one of the three highest probabilities is actually revealed to be in the bottom 3, and otherwise the call is declared a miss. The parameters βi are found iteratively by minimizing the number of misses and degree of the miss (how far off the result was from the prediction).
In reality, this designation is not strictly speaking true. It is not a miss if a given outcome of an experiment that has a certain probability happens or doesn’t happen (unless the probability is 0 or 1). However, the proper values for the coefficients should reasonably be those which minimize the number of such ranking errors. The noise inherent in the system (due to variables that cannot be taken into account) prevents ranking errors from being eliminated. However, we can reasonably demand that they be minimized over time. For simplicity, I will call any disparity between rank and result to be “wrong”, though it’s imprecise.
If the model predicts that a given contestant has a 30% chance of being not-safe, then 30% of those that had the same assignment should be not-safe over the long run. If we go through every assigned probability and calculate the rate that someone was in the bottom group, we see the following dependence (data are binned with bins of 2 percentage points):
The data reads out the assigned not-safe probability on the horizontal axis and the actual rate of being in the bottom group on the vertical axis. The red line is what the curve should ideally be, and the black line is a linear fit to the data itself (with zero intercept). The red line is reasonably close to the black line, indicating that the model is fairly accurate in its predictions quantitatively (and not just accurate at assigning people safe or not-safe according to rank). The slope is 1.27, as opposed to the ideal which would be 1.
This indicates that the model is in some respect “correct”. The points at the extreme of the scale (< 0.1 and > 0.7) have very few instances associated with them, just because the model hasn’t been so sure that many times. Nobody predicted safe by a 90% margin (not-safe probability < 0.1) has ever been in the bottom group, and nobody predicted not-safe by 70% or more has ever been safe. This is expected to change as more data is included in the model. For reference, here is the data for the extremes of the scale:
The lowest probabilities ever projected (P < 0.1)
|Name||Round||WNTS Adj.||Dialidol Adj.||Votefair Adj.||Bottom
These tend to be during the early rounds (when even the chance of randomly being in the bottom group is low). Curiously, it includes no winners except Taylor Hicks and Scotty McCreery, just going to show that most years the winner emerges sometime into the contest.
The highest probabilities ever projected (P > .7):
|Name||Round||WNTS Adj.||Dialidol Adj.||Votefair Adj.||Bottom
3 of these were for the finale (Katharine McPhee, Blake Lewis, and Lauren Alaina). The model considered Matt Giraud in the Top 5 of season 8 to be basically assured to be in the bottom group (he was, in fact, eliminated that night).
Confidence and error estimation
The model has made predictions on 78 episodes that had 185 bottom group results. It made 50 wrong calls, people that were within the top ranked most likely to be not-safe that were actually safe. This means that the model makes approximately 0.8 wrong calls per episode. Considering the total bottom-group results, 135 calls were correct, an accuracy of about 73%. However, this is if we consider no margin of error, and take seriously every assigned probability without hedging against any possible error. Thus, to be confident, we need to analyze the times when the model was wrong.
Since there were 50 wrong bottom-group calls, that gives a total of 100 wrong calls, since for every person you wrongly include in your projected bottom group, you also wrongly exclude someone. Taking as the true bottom-group threshold the midpoint between threshold probabilities, we can plot the frequency of a miss as a function of the magnitude of that miss:
The magnitude of the misses are relatively tightly clustered around 0. Positive values mean that a contestant was predicted to be in the bottom group but wasn’t, and negative values mean that the contestant wasn’t predicted to be in the bottom group but actually was. In terms of percentages:
50% of the errors were within 3 points
80% of the errors were within 8 points
90% of the errors were within 11 points
Thus, so long as you give a margin of error of +/- 3 points, you immediately increase the model accuracy to about 87%. If you’re willing to have a margin of error of +/- 8 points, you have about 95% accuracy. To me, this is a little bit too careful, and I’m comfortable with a +/- 3 point error hedge.
Summary and thoughts
Ultimately, I want the model to produce outcomes that match conventional wisdom about things. My intuition about the finale to season 11 was that Phil Phillips led Jessica Sanchez 60 to 40 (which I said at the time). The model says 59/41. I paid no attention to that result when devising the model. That’s kind of amazing.
And when the model has been shocked, I was shocked too. Here are the 10 biggest over-ratings ever (where someone was not predicted to be in the bottom group but was):
|10||Thia Megia||Top 11 (ii)||Yes||0.116||-0.249|
|6||Melinda Doolittle||Top 3||Yes||0.168||-0.248|
|10||Pia Toscano||Top 9||Yes||0.231||-0.193|
|11||Jessica Sanchez||Top 7 (i)||Yes||0.281||-0.159|
|9||Siobhan Magnus||Top 6||Yes||0.359||-0.147|
|8||Adam Lambert||Top 5||Yes||0.334||-0.145|
|11||Hollie Cavanagh||Top 9||Yes||0.294||-0.132|
|8||Allison Iraheta||Top 11||Yes||0.259||-0.127|
|6||Blake Lewis||Top 4||Yes||0.344||-0.122|
|6||Blake Lewis||Top 7||Yes||0.348||-0.100|
These are were pretty surprising outcomes. Adam Lambert in the Bottom 3? Melinda Doolittle out instead of Blake Lewis? Jessica Sanchez and Pia Toscano the lowest-vote-getters in early rounds? Surprising outcomes. The model presents them as such by missing them hugely; that’s a feature, not a bug.
I want to repeat this for clarity: the model represents conventional wisdom. When violations of that happen, and they will frequently, the model will miss. In 2010 the Seattle Seahawks had a record of 7 wins and 9 losses going into their playoff game in the NFC Wild card game against the New Orleans Saints. The Saints were 11 point favorites going into that game, but the Seahawks won. Were the bettors wrong? No, they were making a rational and well-founded assessment about a terrible team. That makes the Seahawks’ win that much more significant, I think.
We like watching sporting events to see people defy the odds. If everything worked out the way odds-makers said, why would anybody watch the games? But the odds inform our view of the contest, give us a baseline to judge what’s happening. And this model does that.
Obviously, the first thing to be considered for the future is whether this model is even worth a damn this year. Only time will tell. The model has not actually made any predictions yet.
There is one potential issue with the normalization process I use. When the contest comes to the finale, the explanatory variables become categorical instead of continuous variables. This is because the mean is constrained to be 0, and so somebody gets 100 and the other must get -100 even when the scores are very close. I can’t decide whether I think this is an actual drawback or not. If you do not employ this, and start using something like a multiplicative normalization, the model starts getting very wishy washy about its calls, and I don’t like it. On a fundamental level the voting does end up being categorical: even one single vote can flip the result. So, I basically think this method is good enough, but might have room for improvement.
It’s interesting to see what doesn’t end up improving this statistical snapshot. One might think that looking at past performance ratings would serve as another indicator of someone’s disposition as being in the bottom group or not, but it doesn’t. Any effect of momentum in the statistics is already priced into the variables that are measured, at least insofar as I can tell. Attempts to explicitly incorporate the overall bias in an indicator (such as if someone were a Vote For The Worst pick, and therefore had lower than usual WNTS scores for someone who was safe) yielded no improvement. Just as often as it would correct a wrong call, it would cause a different error somewhere else. Similarly, there is no particular bias towards women or men that isn’t already included in the variables; again, any disparity due to gender, age, or race appears to be priced into the variables already, and doesn’t benefit from a further correction. This was surprising to me, but upon a significant amount of fruitless work I find it undeniable.
A more sophisticated model would account for prior probability in a Bayesian sense, and retain some skepticism for large swings in not-safe probability. This would likely have to be done using Markov chain Monte Carlo methods, something I am not at present familiar enough with to implement, though something I am currently investigating. My sense of it, though, is that there isn’t enough data for something like that to chew on.
Of course, this is no substitute for more reliable statistical variables. You can only wring so much accuracy out of unscientific poll results. I would certainly love to see a polling site that randomly selected subjects and got them to report weekly what they were thinking. This is similar to the RAND tracking poll used for the most recent presidential election, which was conducted online, and which was quite accurate. I don’t foresee this happening, but it’s not impossible.