Why and how modeling American Idol kinda sucks

Suppose you wanted to make a mathematical model for predicting a given event. The usual way somebody would go about this is to collect a bunch of information about how that system behaves. He would identify factors that were meaningfully correlated with the outcomes. If somebody wanted to predict the outcome of a football game, for instance, he would look at things like the team’s win record, the quarterback’s passing record that year, how many points on average the team scored under certain conditions.

To be a bit more specific, suppose that we’re at halftime in a football game, and one team is up 7 points on the other team. Then you can assign a probability that the team that’s ahead will win by looking at how often teams win games when they are ahead 7 points at halftime. If you wanted a better measure, you could look at how often teams like this team won when they were ahead 7 points. In fact, given enough data, you could predict the win percentage for being ahead any number of points, as a function f (x).

Now suppose that all of a sudden the NFL decides to drastically change its rules. Touchdowns are now worth a different number of points, the second half has less time than the first half, the teams no longer have a offensive and defensive line, the same 11 players are on the field for the whole game. All your data are much more difficult to interpret. Being ahead by 7 points at the half doesn’t mean the same thing as it did before, though it’s still a positive indicator. However, your ability to quantify and predict is significantly hindered.

To take a somewhat more relevant example, what if we are predicting the outcome of elections? We can look at a lot of different things about the past to try to predict the future. How someone’s opinion polls looked just before the election, how much money the candidate raised, how many newspaper endorsements the candidate got. And then you look at how many votes the person got, and you fit a function to the data. That function, to a certain degree of accuracy, should predict future elections. But what if we never knew how many votes the candidate got? And what if the number of votes allowed for each citizen was increased and decreased from year-to-year? And what if several of the prominent pollsters went out of business, so that you didn’t have good data for the present years, whereas you did have good data in the past?

Such is the state of American Idol. There have been good indicators in the past. There are websites that track how well-liked a performance is (WhatNotToSing is one such aggregator, although one could easily compile the same stats by visiting a few dozen blogs and tabulating the results). There are websites that track how popular a given contestant is (Votefair was one prominent example). There was even a website that tracked how many votes it seemed like someone was getting (Dialidol).

Votefair

As of this writing, Votefair has registered 56 votes for last night’s competitors. In season 11, there were more than 500 votes for the same round. As you might imagine, when the sample size is decreased that markedly, the accuracy of the poll goes way down (it’s about 3 times less accurate, which is a lot). WhatNotToSing has yet to publish results, but the runners of that site have said in the past that it is getting harder to find 100 blogs to poll for approval ratings. And Dialidol hasn’t registered a meaningful score in over a year.

The reasons for the death of these indicators are quite obvious. American Idol’s viewership has been ravaged by falling network ratings, competition from other shows, lack of novelty, and the loss of Simon Cowell. Fewer voters means fewer rabid fans to show up on sites like Votefair. Institution of rules to accept online voting have made Dialidol, really just a measurer of busy-signals on phone lines, irrelevant. Who still dials a land-line to vote for a contestant? Well, probably nobody.

The rule changes are manifold. First, there was the institution of text voting way back in the early days. Then there was Facebook voting, and voting through Google. The supervote was instituted, being worth 50 votes at one time (now only 20 votes). Voting through any given method is limited to 20, whereas in the past it has been unlimited (you could call as many times as you wanted within the 2 hour period). The voting time period has been extended from a few hours after the show ended, to the following evening, and now until the following morning. Previously the bottom 3 was always revealed until the Top 5 or 6, and now the bottom 3 is only sometimes shown even in the early rounds.

So it’s basically a mess.

Add to all of these factors that the demographics of the voters may have changed substantially, and it’s hard to say much about what the future holds. In the end, I’m just going to model this year as a simple regression. Since we are determining a categorical or qualitative outcome (safe, bottom 3, eliminated), we need a link function that is suited to that, and the most convenient is the logistical link, 1/(1 + ex). The logit, x, is fit using a linear relationship to the variables. Since Dialidol is not longer an input variable, we just choose approval rating and poll popularity

x = β0 + β1 × (Approval rating) + β2 × (relative popularity)

where the betas are determined by historical data. What shall we choose as our dependent variable? We could choose “being safe” as the variable. Then a software package will try to determine what fraction of people were safe for a given set of values for approval rating and relative popularity. In this case, we can fit for the first voting round in the software package R. If we include, as variables, the order of singing, the approval rating by WhatNotToSing, and the Votefair popularity percentage as input variables, R can estimate the values for the betas:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.098799   0.079361   1.245 0.214635    
Order       0.023111   0.009908   2.333 0.020684 *  
WNTS.Rating 0.005474   0.001630   3.358 0.000943 ***
VFPercent   0.009912   0.004019   2.466 0.014499 *

What R is telling us is that each of these variables is predictive of whether someone was safe or not. The later in the episode you sang (Order), the higher your chances of being safe are. The higher your WNTS rating was, the more often you are safe. And the more popular you are on Votefair (VFPercent) the more likely you are to be safe. The estimate column tells us the values for the betas. The rightmost column tells us how significant the variable is, or the probability that the variable is actually not meaningful at all. For what we’re doing, we can accept anything below 0.05 as being OK.

Now, one thing you can object to is that it isn’t good practice to compare a weak season to a strong one. If somebody got an 83 WNTS Rating in season 5, when there were a bunch of good singers, that might not be as significant as getting an 83 during a weak season, when the people had much lower scores on average. Fair enough, but now you’re faced with how to normalize these from year to year, and in fact from season to season there hasn’t been a markedly varying mean WNTS:

WNTSMean

This would suggest that normalizing the data will probably not lead to a much better prediction.

I’m haunted by the fact that the model treats these events as independent, when in fact they are not. However, it’s not easy for me to see a way out of the problem. Of course, the probability of someone being safe must affect the probability of a cohort (someone in the same year in the same group). If everyone loves Nick Fradiani, that surely reduces the chances of someone else like Quentin Alexander. But how? I don’t think there’s a good answer other than to normalize the probabilities within a given group, which I always do. (That is, the sum of probabilities must be the number of people who will be safe, in this case 8). If we are calculating the number who are not safe (four, in this semifinal round), then the not-safe probabilities should add up to 4. We multiply each probability by a factor to get the sum to equal four. But arguably that still doesn’t really solve the problem.

We also have a problem that Votefair alone doesn’t seem to have a large enough sample set. To correct for this, I will probably be taking several internet polls and grouping them together. So, if there is a poll at MJsBigBlog, and one at TVLine, and one at Votefair, I’ll take all of these and combine them into one large poll, and use the results of that as a stand-in for Votefair. This has many potential problems, such as double-counting many voters, but nothing else seems to be viable.

The amount of, and state of, data regarding American Idol is frankly pitiful. While it has always had its problems, it is as bad today almost as it ever has been. This is one reason why I’ve started to track Twitter sentiment, as Idol is still dominant on so-called “second screen” engagement. However, the reliability of this data is still unknown without historical data to compare against. It is possible that it can be incorporated this year as the year progresses, and we start to get a feeling for how well it’s doing. Until then, I’ll just present it to you, the reader, to make up his own mind.

Bookmark the permalink.

Comments are closed