Last year saw the end of an era: the practical demise of Dialidol. Dialidol was a look inside actual voting patterns, because it measured how many busy signals each contestant’s phone line had on it. Combine that with the number of phone calls actually being made by the power-dialing software, and you had a direct measurement of votes. It was still, by itself, not a perfect predictor, but it was damn good a lot of the time. However, it’s not hard to understand why Dialidol stopped working: when was the last time you made a land-line call to vote for the show? Probably not for awhile.
So now we have to look around for a model with what quantities we have around. We still don’t know the number of votes cast for each contestant, or even the relative ranking of the contestants in the vote. The only response variable we have is whether or not the contestant was safe (the alternative being eliminated). In the finals, we also have a third category: bottom group, but not eliminated. However, we don’t have that in the semifinals.
Prior to this season, there were 600 performances made in semifinal rounds. My database tracks a number of quantitative variables for these contestants, including WhatNotToSing rating, Votefair percentage, the order that the contestant sang, the details about the song he or she sang, and a few others. When building a model that describes the people who were Safe in the semifinals that does not include Dialidol, the variables that have significance are popularity (Votefair), WhatNotToSing, and singing order. The probability can be estimated by
P = 1/(1 + e–x)
where x is the log of the odds ratio. In the linear logistical model, x depends linearly on the parameters
x = β0 + x1 β1 + x2 β2 + x3β3
where I have named the xi the variables I listed above. Of course, you don’t have to assume linearity. We could just as easily assume that x goes as Order of performance squared instead of to the first power, but linearity is usually a good place to start.
To estimate the parameters, a software package like R looks at how many people were eliminated for a given set of these scores, and estimates that as the odds at those values. Then it determines the values for βi that most closely give that odds ratio. It does this for all the data points that are usable, which is 359, since it can only do the fit if all of the records are complete (and Votefair did not exist for the first few seasons of Idol). Starting with the data (click there to download as a csv), the parameter estimate is
Call: glm(formula = Safe ~ WNTS.Rating + VFPercent + Order, family = "binomial", data = SemiFinals) Deviance Residuals: Min 1Q Median 3Q Max -2.3947 -0.9279 0.2182 0.8571 1.6938 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.984820 0.381815 -5.198 2.01e-07 *** WNTS.Rating 0.024835 0.007103 3.496 0.000472 *** VFPercent 0.181024 0.037073 4.883 1.05e-06 *** Order 0.093890 0.045729 2.053 0.040055 *
The estimate column contains the values for β, the standard error gives the estimated variance of the data, and the column on the right is a test of how significant the fit is. It estimates (via an appropriate statistical test) what the probability is that the response variable (in this case, whether someone was Safe or not) varies with that variable by chance. The values are low in this column, indicating likely significance of these variables.
Note that the value of each of these is positive, meaning that your chance of being safe increase with WNTS Rating, with Votefair, and with performance order. This means that order of performance is indeed important.
The above curve shows the data points for all the records. A point near the top represents someone who was safe, a point at the bottom someone was eliminated, and the curve shows the predicted probability (where the top is 1 and the bottom is 0) of advancing. By performing at the end of the show, you have a higher probability of advancing for the same WNTS Score. The same is also true of Votefair popularity
Pimp spot indeed.
Now that we have a serviceable model, we need to quantify our uncertainty about the predictions. To do this, I apply the following criterion. In the following, I’ll assume 8 people advance. I want to know who had the 8 highest probabilities. Anybody who advanced but was not among those with the top 8 probabilities, I want to record this as a false negative: technically it ranked that person eliminated, but in reality he or she advanced. Similarly, anyone ranked in the top 8 probabilities who was eliminated, I will record as a false positive. To quantify how bad the misranking was, we’ll take a value half-way in between the 8th and 9th spot probabilities and call that the “cut off”. Then we’ll look at how far the model’s assigned probability was from that cutoff value.
Above shows the histogram of misses. Most are right around 0, meaning the mistakes were small. There are a number with larger differences at the fringes. The shaded portion shows a fit to the histogram using a normal distribution. This distribution has a standard deviation of about 11.3 percentage points.
Conclusion: if we want to be reasonably sure (so that we account for 95% of our previous errors) about our calls, we should only call someone safe if she is more than about 22 percentage points above the cutoff, and only call someone eliminated if she is more than 22 points below the cutoff. There will still be some people slipping through, but probably not too many.
Note that I apply this criterion to the unnormalized probabilities. Since the sum of all probabilities in a round where 8 people advanced should be 8, the probabilities all get scaled up, but I do this procedure before applying that.
And that’s it! There’s no magic here, and it’s not even particularly sophisticated. Fit the old data as best we can, hope that future events are similar, quantify how unsure we are, generate the probabilities. Easy peasy.