# How the model works (and doesn’t work): Part 1

This week I had a Twitter exchange with a site called Zabasearch, in which they directed me to their own prediction site. Their system for prediction is totally opaque whereas I have always disclosed how the system works. However, I realized that it’s been some time since I’ve done this, and it’s past time that I did an assessment and consider improvements. This is going to get fairly technical, so I would read over it only if you are really interested.

In Part 1, the present article, I will explore how I have approached this topic. In Part 2, coming next week, I will do an assessment of how good the model was compared to other sites on the internet, principally Dialidol, Votefair, and Zabasearch, and try to explore why the model hit and why it missed.

### Regressions

I first want to clarify what constitutes a “model”. In principle, a given phenomenon can be expressed as a function of independent variables and dependent variables, such that there is a formula

where y is the thing you are trying to predict (the dependent variable) and x1, x2, x3, etc. are variables that you think affect this outcome (independent variables). What you are starting from is a list of data that you’ve collected (sets of y for given inputs x). Discovering the form of this function may be possible, but in many cases it is quite difficult.

There are a couple ways to attack this problem. First, one could assume the form of the function to be something reasonable, and then fit that form to the available data you have by adjusting the free parameters in the function. The other is to start with the underlying phenomenon and construct a rule-based simulation of it. This produces an answer without having to discover the form of f.

For American Idol analysis, I take the first option, assuming a form of a function, and do a regression analysis. This is the only plausible way to do this that I know of given the constraints of the problem. A regression analysis simply means to fit the assumed function as well as possible to the data you already have. The most well-known example is a linear regression, whereby you assume that a given variable depends on another in a proportional way, so that the data fits the function

I have used this in many cases on this site. For instance, I said in an earlier article that the number of episodes a wild card pick survives is roughly linearly related to their score on the WhatNotToSing performance approval rating. I’ll reprint the graph here:

The dots correspond to actual observations made in the past (how many rounds a wild card pick lasted and what their average approval rating was in the semifinal and wild card round). The equation has two free parameters (slope m and y-intercept b), and an algorithm determines the best values for those parameters using the data points that you give it. “Best” means that the difference between the line and the data points is minimized over the entire data set, and the difference is measured by the R2 value, displayed above. The closer R2 is to 1, the better the fit, and the better the model. If R2 is 1.000 or very close to it, then you have actually probably stumbled upon the correct relationship, and if it is significantly lower than 1 then you have only captured part of the relationship. This does not mean that the linear relationship is “wrong”, it just means it’s only part of the picture.

With the above fit, I projected that Erika Van Pelt would expect to last 6 rounds, given that her approval rating was about 67. This is done by putting her score in for x in the formula. However, she only lasted 3 episodes, making this projection quite wrong. Given that R2 was only 0.5, this is not shocking at all. If R2 was 0.999, it would have been shocking. However, there are clearly more variables involved in this, and so the fact that this crude measurement failed is not unexpected. In this case, Erika made song choices that were less shrewd than the average contestant, and that was in no way considered in this relationship.

What kind of things make this situation unpredictable? Although the actual phenomenon is entirely deterministic (meaning that for sure the person with the lowest votes goes home), the details of the phenomenon are so complex that it appears to have a lot of randomness in it. All the secondary effects (the quality of other contestants, the demographics of the voters, song choice, judge comments, gender of contestant, etc.) are included in the vote count. The overall effect of all these variables looks like randomness or noise in the data. This accounts for why the dots in the graph don’t appear to fit any curve you can imagine.

What we could do is impose a given fit by expanding the number of free parameters in it. If I were to allow for 7 free parameters, by fitting to a sixth order polynomial, it would give us the following:

The fit is indeed much better (R2 = 0.874). However, what we have done is totally meaningless. To demonstrate, let me add the three new data points from this year, Erika, Deandre, and Jeremy (red dots):

You can see that our fit had no predictive power at all. As people say, we have “overfit” our data, or “fit the noise”. That might have been ok for an R2 of 0.5, but when you claim such a significantly higher fit accuracy, you would not expect points to lie so far outside of the curve.

This goes to demonstrate a point that you can only allow for a certain number of free parameters based on the available quantity of data you have. It’s ok to add a new parameter to a fit for perhaps every 10 data points you collect, but anything more than that is just fitting the noise, and sometimes even this is still a bad idea if the data is noisy enough. In this particular case, the model is incredibly crude, taking into account almost nothing except how good the person was in the first week, and nothing about how they did down the line.

### Probabilities in Idol

The discussion of independent and dependent variables in modeling has slightly different nomenclature in statistics; they tend to be called explanatory and response variables. Now, the response variable in Idol (the dependent variable, the thing you want to predict) isn’t the same as the previous example of wild card survival. In that case the result was a continuous variable, namely a whole number of votes survived.

But this isn’t the thing I’m mainly interested in. I want to figure out who’s going home and who’s safe in any given week. And that response variable is categorical, by which it means that somebody is put into one category or another. Either safe or not safe, bottom 3 or not bottom 3, eliminated or not eliminated. Thus, a function like a linear regression will not work. We need a function that is binary in nature, giving 0 for some cases and 1 in other cases, and is also multivariate, so that we can take into account multiple things at once.

On the other hand, we still want the model to be a linear model in any given variable. If I were to look at, say, the total percentage of people who were “Safe” in a final round vs WNTS Approval rating, you get an incredibly good fit overall (data are binned in units of 10, “finals” are defined by rounds in between Top 13 and Top 2):

We want to know “eliminated” or “not eliminated” in a given round. To do that, we need a function which returns a number between 0 and 1 for all possible values of explanatory variable. If I were to plot the observed data as a function of WNTS approval rating for, let’s say, the Top 10 round, it would look like this, where 0 means “was not eliminated” and 1 means “was eliminated”:

All but one of the people eliminated (shown as a dot at the top of the graph) had a WNTS approval rating of less than 50. We might be tempted to say that the probability of being eliminated approaches zero as WNTS Rating approaches 100. On the other hand, the probability quite clearly does not approach 1 as WNTS Rating goes to zero. Plenty of people over on the left hand side of the graph were not eliminated. We might therefore expect the function to look something like this (here the data is histogrammed for clarity):

The curve shown in red is a fit to the data using a single-variable logistical regression. This is a fit to the function

where e is the natural log base (sometimes called Euler’s number). This function has the virtue that the linear regression comes out as the odds of something happening (actually the natural logarithm of the odds), but the function itself is between 0 and 1 for all possible input values. In this particular case, there is only one variable (x1), the WNTS Rating, and the coefficients are determined by a program called R as a0 = -0.09280315 and a1 = -0.04954404. Substitution of a contestant’s score in a subsequent season for x1 gives a probability, based on this one indicator alone, of elimination that week.

Though this method clearly allows us to factor multiple variables into a prediction, it has some drawbacks.

Firstly, this model treats the eliminations as being separate events from the not-eliminations, but this is not actually the case. Someone is more likely to be eliminated with a score of 60 if everyone else scored an 80 that night. There is also the issue that WNTS Ratings appear to my eyes to have undergone some inflation over the years. As such, events that happened in Season 1 are perhaps not directly comparable to Season 10. That said, I don’t think this is a fundamental conceptual problem, just a technical one: in reality, I’m a bit cautious on declaring that there are “epochs” in American Idol and that every few seasons all the rules go out the window. I’m inclined to think that if patterns are contradicted by the current season, somebody may have been a bit presumptuous with declaring something was a pattern in the first place.

Secondly, while the number of statistical observations increases every year, it decreases with each round as the season progresses. After all, in the Top 10 you have 10 separate cases over (now) 11 seasons, or 110 observations. But in the Top 3, you have only 33 observations, and the statistics start to get pretty ragged.

Finally, and linked with the second point, this method is at the mercy of a few statistical indicators, whose reliability is not assured. With 100 well-behaved data points with reasonable variance, you could include several indicators, but with 30 incredibly noisy data points, you are lucky to see statistical significance with even one or two variables. This means that you have to take what you perceive to be the most important explanatory variables and hope that they aren’t flawed.

How could we address these points? To the first, we could, instead of going with the absolute score, invent a differential score. So, whoever has the lowest score on a given night is assigned 0, and all other people’s scores are marked relative to that score. The problem with that is that when all the scores are really high (as happened this and last year) it’s tough to say that someone with a differential score of 10 in a previous year is just like someone with a differential score of 10 from this year. One could also imagine going to a ranking system instead of an absolute score, but this would lose granularity, as a distribution of scores from a certain night like (4,51,54,61,90) would simply be (5,4,3,2,1) and there’s no way to know that the guy who was ranked fifth was way behind the rest of the pack. I am still thinking about this issue.

To the second point, there is nothing to be done. If you want better statistics then you need more observations, period.

For the final point, I am entirely cognizant of the deficiencies of the various statistics I have used. Gender has historically played a part, but in a more subtle way than it would seem. WhatNotToSing, as a site that collects approval ratings, is generally well run, but favors the things that bloggers like rather than everyday voters, which should be considered. And Dialidol, well, they just have an occasional hiccup due to the demographic makeup of their sampled voters. This doesn’t help much with the actual numerical modeling, but it does inform how I think about the show as a whole.

The approval ratings published by WhatNotToSing are quite good, but they are not the only thing that matters to the voting, and they are just a sample, and not necessarily representative of what the total viewership thinks. For instance, we might take into account gender in the voting by doing the logistical fit described earlier with two variables: the approval rating and the gender simultaneously. The data is shown below in a plot as before, with x-axis as the WNTS Rating, but now the fit is different depending on which gender is specified:

Because there have been 6 women and only 5 men eliminated in the Top 10 round, and because more men overall have been safe (since there has been an imbalance of men and women over the years, favoring men), this model now rates men as less likely to be eliminated for the same approval rating. So, a woman with a score of 20 has about a 40% chance of elimination, whereas a man has only about a 20% chance.

The only problem with this is that gender is probably not statistically significant. One can estimate the probability that the factor is not actually significant (that is, the probability of the null hypothesis), and in this case the probability is about 30%. With a 30% chance that it’s not statistically significant, the above fit may be of little use. With more data, it may turn out to be, but we just don’t know right now on the basis of 11 seasons. Contrast that with the same statistic with WNTS Rating, which is a mere 0.4% chance of being insignificant.

Overall, I’m nearly sure that there is a gender imbalance, induced mainly by societal preferences for musicians. However, when it comes to the Top 10, it’s hard to say just how significant that is. With such a messy fit, the model is basically silent on this issue until more observations are made.

Now let’s add in another possible indicator: Dialidol. There are a few ways we could explore to do this: we could include the raw number of Dialidol votes cast for a given contestant, or the Dialidol score (which is generated by a formula taking into account votes and the number of measured busy signals). Finally, we could try the rank that Dialidol assigns, 1st, 2nd, 3rd, etc. Any of these may work, and we can simply test them. I’ve nearly always found Dialidol rank to be the best indicator, with the possible exception of in the semifinal rounds.

You can see that 3 of the 11 people who were eliminated in the Top 10 were ranked lowest (10th) by Dialidol. None eliminated in the Top 10 were the highest (1st) on Dialidol. The model’s algorithm computes the probability for all possible ranks and all possible WNTS Ratings, but to display the curves for all 10 ranks would clutter this graph up quite a bit. Suffice it to say that the projected elimination probability is far higher for 10th ranked contestants (red curve) than the probability for 1st ranked (blue curve). The significance here is better, though not stellar, with around a 12% chance of not being statistically significant. However, I take what I can get.

The variables that matter for one round don’t necessarily have such an impact on further rounds. WNTS approval rating tends to have declining importance in the later stages, while Dialidol rank increases. By doing a fit separately for each round, these temporal factors are taken into account, at the expense of statistical confidence. This is a choice I’ve made, and it may not be the right one. However, I don’t know of a better way of doing it.

As a matter of fact, during this year I made a choice to compute, not based on eliminations, but based on who was in the bottom group in previous years. This way, what I was computing was technically “probability of being in the bottom 3” instead of “probability of being eliminated”. The inherent assumption is that a given contestant is more likely to be eliminated based on the same factors as being more likely to be in the bottom 3. While the accuracy was a bit better, this didn’t make a huge difference one way or another. In principle, the statistics would be improved, since we know more about the situation (the model basically ignores the bottom 3 in its historical consideration of just eliminations, which is kind of silly).

### Future directions

The obvious question is what new variables could potentially be added to give better accuracy to the predictions. The one I am most interested these days is the VoteFair total vote count. The accuracy of this index has been quite good this year (I will address this in Part 2). However, you have to realize that such a factor cannot be weighed by a model until there is significant historical precedent. How do you weigh two variables if they disagree? Only by seeing which was more accurate in the past, and by how much it was inaccurate. If you know that, you can give a quantitative measure of how much you should consider it, and that will in principle produce a better prediction. Minimally, Votefair will have to be around for another year before I could include it, but the early results are promising.

Some other things of note: It may be that certain songs are more apt to garner a WNTS score incommensurate with the result. I’m thinking specifically of Tina Turner songs, which get gonzo ratings on WNTS, but then the contestant finds herself in the Bottom 3 or off the show (Hollie Cavanagh and Pia Toscano, respectively) despite this. I think there’s something there, but I haven’t investigated in detail. I also think that genre can make an impact, though country music appears not to have been the vote-garnering panacea that it was last year. Judges comments? It’s hard to say. For the foreseeable future, with the horrendous judging of the current panel, there could not be much impact. But if we see some kind of return to normalcy in judging, that could work out to something.

### To be continued…

Next week I’ll do a roundup of all the season 11 predictions, how good or (mostly) bad they were, why it happened, and how accurate it was compared to experts and other prediction websites.