Singing order

In the methodology post, I mentioned that contestant Order was an important variable. In fact, it you correct for the other variables, you are significantly better off being near the end of the show versus the beginning of the show:

SongOrderVsSafenessPoints at the bottom are people who were not “safe” (either eliminated, bottom 3, or saved) in Final rounds with no-multiple performances. Points at the top are people who were Safe. The two curves are the probability of an average contestant (both male and female) being safe. All other things being equal, a man goes from a 65% chance singing first to a 95% chance of being safe at the end of the show. A woman goes from a 50% chance to about 85% if she is in the pimp spot.

I’ve idly speculated about why this was in the past. It could be that the producers put the better contestants at the end of the show for ratings purposes, and it could be that most people vote at the end and only remember the recent singers. Most likely it’s a combination of both of these things. The forecast model from previous years didn’t include this explicitly as a variable because it was no better than Dialidol at accounting for singing order. But now that Dialidol is kaput, it’s worth to consider explicitly.

Sarina-Joi’s elimination was not shocking (Finals model methodology)

I’m seeing a lot of surprise among the Idol blogosphere about Sarina-Joi’s elimination, and while I kind of understand it, I disagree. If you read the Top 12 forecast, you saw that I had Sarina-Joi second most likely to be eliminated, within a hair’s breadth (1 percentage point) of Daniel Seavey. After the jump, I’ll say how I get that.

Continue reading

Top 16 model methodology

Last year saw the end of an era: the practical demise of Dialidol. Dialidol was a look inside actual voting patterns, because it measured how many busy signals each contestant’s phone line had on it. Combine that with the number of phone calls actually being made by the power-dialing software, and you had a direct measurement of votes. It was still, by itself, not a perfect predictor, but it was damn good a lot of the time. However, it’s not hard to understand why Dialidol stopped working: when was the last time you made a land-line call to vote for the show? Probably not for awhile.

So now we have to look around for a model with what quantities we have around. We still don’t know the number of votes cast for each contestant, or even the relative ranking of the contestants in the vote. The only response variable we have is whether or not the contestant was safe (the alternative being eliminated). In the finals, we also have a third category: bottom group, but not eliminated. However, we don’t have that in the semifinals.

Prior to this season, there were 600 performances made in semifinal rounds. My database tracks a number of quantitative variables for these contestants, including WhatNotToSing rating, Votefair percentage, the order that the contestant sang, the details about the song he or she sang, and a few others. When building a model that describes the people who were Safe in the semifinals that does not include Dialidol, the variables that have significance are popularity (Votefair), WhatNotToSing, and singing order. The probability can be estimated by

P = 1/(1 + ex)

where x is the log of the odds ratio. In the linear logistical model, x depends linearly on the parameters

x = β0 + x1 β1 + x2 β2 + x3β3

where I have named the xi the variables I listed above. Of course, you don’t have to assume linearity. We could just as easily assume that x goes as Order of performance squared instead of to the first power, but linearity is usually a good place to start.

To estimate the parameters, a software package like R looks at how many people were eliminated for a given set of these scores, and estimates that as the odds at those values. Then it determines the values for βi that most closely give that odds ratio. It does this for all the data points that are usable, which is 359, since it can only do the fit if all of the records are complete (and Votefair did not exist for the first few seasons of Idol). Starting with the data (click there to download as a csv), the parameter estimate is

glm(formula = Safe ~ WNTS.Rating + VFPercent + Order, family = "binomial", 
    data = SemiFinals)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3947  -0.9279   0.2182   0.8571   1.6938  

             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.984820   0.381815  -5.198 2.01e-07 ***
WNTS.Rating  0.024835   0.007103   3.496 0.000472 ***
VFPercent    0.181024   0.037073   4.883 1.05e-06 ***
Order        0.093890   0.045729   2.053 0.040055 *

The estimate column contains the values for β, the standard error gives the estimated variance of the data, and the column on the right is a test of how significant the fit is. It estimates (via an appropriate statistical test) what the probability is that the response variable (in this case, whether someone was Safe or not) varies with that variable by chance. The values are low in this column, indicating likely significance of these variables.

Note that the value of each of these is positive, meaning that your chance of being safe increase with WNTS Rating, with Votefair, and with performance order. This means that order of performance is indeed important.

OrderSemisWNTS2The above curve shows the data points for all the records. A point near the top represents someone who was safe, a point at the bottom someone was eliminated, and the curve shows the predicted probability (where the top is 1 and the bottom is 0) of advancing. By performing at the end of the show, you have a higher probability of advancing for the same WNTS Score. The same is also true of Votefair popularity


Pimp spot indeed.

Now that we have a serviceable model, we need to quantify our uncertainty about the predictions. To do this, I apply the following criterion. In the following, I’ll assume 8 people advance. I want to know who had the 8 highest probabilities. Anybody who advanced but was not among those with the top 8 probabilities, I want to record this as a false negative: technically it ranked that person eliminated, but in reality he or she advanced. Similarly, anyone ranked in the top 8 probabilities who was eliminated, I will record as a false positive. To quantify how bad the misranking was, we’ll take a value half-way in between the 8th and 9th spot probabilities and call that the “cut off”. Then we’ll look at how far the model’s assigned probability was from that cutoff value.

SemifinalModelDIffHistogramAbove shows the histogram of misses. Most are right around 0, meaning the mistakes were small. There are a number with larger differences at the fringes. The shaded portion shows a fit to the histogram using a normal distribution. This distribution has a standard deviation of about 11.3 percentage points.

Conclusion: if we want to be reasonably sure (so that we account for 95% of our previous errors) about our calls, we should only call someone safe if she is more than about 22 percentage points above the cutoff, and only call someone eliminated if she is more than 22 points below the cutoff. There will still be some people slipping through, but probably not too many.

Note that I apply this criterion to the unnormalized probabilities. Since the sum of all probabilities in a round where 8 people advanced should be 8, the probabilities all get scaled up, but I do this procedure before applying that.

And that’s it! There’s no magic here, and it’s not even particularly sophisticated. Fit the old data as best we can, hope that future events are similar, quantify how unsure we are, generate the probabilities. Easy peasy.

Why and how modeling American Idol kinda sucks

Suppose you wanted to make a mathematical model for predicting a given event. The usual way somebody would go about this is to collect a bunch of information about how that system behaves. He would identify factors that were meaningfully correlated with the outcomes. If somebody wanted to predict the outcome of a football game, for instance, he would look at things like the team’s win record, the quarterback’s passing record that year, how many points on average the team scored under certain conditions.

To be a bit more specific, suppose that we’re at halftime in a football game, and one team is up 7 points on the other team. Then you can assign a probability that the team that’s ahead will win by looking at how often teams win games when they are ahead 7 points at halftime. If you wanted a better measure, you could look at how often teams like this team won when they were ahead 7 points. In fact, given enough data, you could predict the win percentage for being ahead any number of points, as a function f (x).

Now suppose that all of a sudden the NFL decides to drastically change its rules. Touchdowns are now worth a different number of points, the second half has less time than the first half, the teams no longer have a offensive and defensive line, the same 11 players are on the field for the whole game. All your data are much more difficult to interpret. Being ahead by 7 points at the half doesn’t mean the same thing as it did before, though it’s still a positive indicator. However, your ability to quantify and predict is significantly hindered.

To take a somewhat more relevant example, what if we are predicting the outcome of elections? We can look at a lot of different things about the past to try to predict the future. How someone’s opinion polls looked just before the election, how much money the candidate raised, how many newspaper endorsements the candidate got. And then you look at how many votes the person got, and you fit a function to the data. That function, to a certain degree of accuracy, should predict future elections. But what if we never knew how many votes the candidate got? And what if the number of votes allowed for each citizen was increased and decreased from year-to-year? And what if several of the prominent pollsters went out of business, so that you didn’t have good data for the present years, whereas you did have good data in the past?

Such is the state of American Idol. There have been good indicators in the past. There are websites that track how well-liked a performance is (WhatNotToSing is one such aggregator, although one could easily compile the same stats by visiting a few dozen blogs and tabulating the results). There are websites that track how popular a given contestant is (Votefair was one prominent example). There was even a website that tracked how many votes it seemed like someone was getting (Dialidol).


As of this writing, Votefair has registered 56 votes for last night’s competitors. In season 11, there were more than 500 votes for the same round. As you might imagine, when the sample size is decreased that markedly, the accuracy of the poll goes way down (it’s about 3 times less accurate, which is a lot). WhatNotToSing has yet to publish results, but the runners of that site have said in the past that it is getting harder to find 100 blogs to poll for approval ratings. And Dialidol hasn’t registered a meaningful score in over a year.

The reasons for the death of these indicators are quite obvious. American Idol’s viewership has been ravaged by falling network ratings, competition from other shows, lack of novelty, and the loss of Simon Cowell. Fewer voters means fewer rabid fans to show up on sites like Votefair. Institution of rules to accept online voting have made Dialidol, really just a measurer of busy-signals on phone lines, irrelevant. Who still dials a land-line to vote for a contestant? Well, probably nobody.

The rule changes are manifold. First, there was the institution of text voting way back in the early days. Then there was Facebook voting, and voting through Google. The supervote was instituted, being worth 50 votes at one time (now only 20 votes). Voting through any given method is limited to 20, whereas in the past it has been unlimited (you could call as many times as you wanted within the 2 hour period). The voting time period has been extended from a few hours after the show ended, to the following evening, and now until the following morning. Previously the bottom 3 was always revealed until the Top 5 or 6, and now the bottom 3 is only sometimes shown even in the early rounds.

So it’s basically a mess.

Add to all of these factors that the demographics of the voters may have changed substantially, and it’s hard to say much about what the future holds. In the end, I’m just going to model this year as a simple regression. Since we are determining a categorical or qualitative outcome (safe, bottom 3, eliminated), we need a link function that is suited to that, and the most convenient is the logistical link, 1/(1 + ex). The logit, x, is fit using a linear relationship to the variables. Since Dialidol is not longer an input variable, we just choose approval rating and poll popularity

x = β0 + β1 × (Approval rating) + β2 × (relative popularity)

where the betas are determined by historical data. What shall we choose as our dependent variable? We could choose “being safe” as the variable. Then a software package will try to determine what fraction of people were safe for a given set of values for approval rating and relative popularity. In this case, we can fit for the first voting round in the software package R. If we include, as variables, the order of singing, the approval rating by WhatNotToSing, and the Votefair popularity percentage as input variables, R can estimate the values for the betas:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.098799   0.079361   1.245 0.214635    
Order       0.023111   0.009908   2.333 0.020684 *  
WNTS.Rating 0.005474   0.001630   3.358 0.000943 ***
VFPercent   0.009912   0.004019   2.466 0.014499 *

What R is telling us is that each of these variables is predictive of whether someone was safe or not. The later in the episode you sang (Order), the higher your chances of being safe are. The higher your WNTS rating was, the more often you are safe. And the more popular you are on Votefair (VFPercent) the more likely you are to be safe. The estimate column tells us the values for the betas. The rightmost column tells us how significant the variable is, or the probability that the variable is actually not meaningful at all. For what we’re doing, we can accept anything below 0.05 as being OK.

Now, one thing you can object to is that it isn’t good practice to compare a weak season to a strong one. If somebody got an 83 WNTS Rating in season 5, when there were a bunch of good singers, that might not be as significant as getting an 83 during a weak season, when the people had much lower scores on average. Fair enough, but now you’re faced with how to normalize these from year to year, and in fact from season to season there hasn’t been a markedly varying mean WNTS:


This would suggest that normalizing the data will probably not lead to a much better prediction.

I’m haunted by the fact that the model treats these events as independent, when in fact they are not. However, it’s not easy for me to see a way out of the problem. Of course, the probability of someone being safe must affect the probability of a cohort (someone in the same year in the same group). If everyone loves Nick Fradiani, that surely reduces the chances of someone else like Quentin Alexander. But how? I don’t think there’s a good answer other than to normalize the probabilities within a given group, which I always do. (That is, the sum of probabilities must be the number of people who will be safe, in this case 8). If we are calculating the number who are not safe (four, in this semifinal round), then the not-safe probabilities should add up to 4. We multiply each probability by a factor to get the sum to equal four. But arguably that still doesn’t really solve the problem.

We also have a problem that Votefair alone doesn’t seem to have a large enough sample set. To correct for this, I will probably be taking several internet polls and grouping them together. So, if there is a poll at MJsBigBlog, and one at TVLine, and one at Votefair, I’ll take all of these and combine them into one large poll, and use the results of that as a stand-in for Votefair. This has many potential problems, such as double-counting many voters, but nothing else seems to be viable.

The amount of, and state of, data regarding American Idol is frankly pitiful. While it has always had its problems, it is as bad today almost as it ever has been. This is one reason why I’ve started to track Twitter sentiment, as Idol is still dominant on so-called “second screen” engagement. However, the reliability of this data is still unknown without historical data to compare against. It is possible that it can be incorporated this year as the year progresses, and we start to get a feeling for how well it’s doing. Until then, I’ll just present it to you, the reader, to make up his own mind.

How this year’s Idol is like the financial crisis

crisis_21The inclusion of Jena Irene in the bottom 3 tonight was a big shock. Though Jena scored a relatively lousy 43 on WhatNotToSing, she was number 1 on Votefair and on Dialidol.

It is not that people who were highest on Votefair were never eliminated, but so early in the contest this is extremely rare. Siobhan Magnus, Adam Lambert, Thia Megia, and Jessica Sanchez are the only contestants to ever be in the bottom group while most popular on that site during the initial weeks (Siobhan was eliminated and Jessica had to be saved). Normally Votefair is pretty damn predictive so early in the contest. But none of these people was also the highest on Dialidol. The closest historical example we have to tonight’s result is Adam Lambert, but that was in the Top 5, near when Votefair becomes irrelevant. All of which is to say that tonight may be the most surprising result ever with regards to Votefair. Continue reading