How the model works (and doesn’t work): Part 2

This is a continuation of a thorough explanation of the forecasting model I’ve used. The present article is a wrap-up of the season and an assessment of the model’s accuracy. Please see Part 1 of this series for a somewhat detailed explanation of the analytical model.

Overview

Most people’s perception of the model this season is that it’s really terrible, but in fact it made a ton of very good calls early on in the season. Many people did not start reading this blog until later, when its foibles became more evident. The semi-finals, Top 13, and Top 11 went swimmingly. It had great accuracy. Then the wheels came off.

Of the 13 lowest-vote getters, 6 of them were either the bottom contestant or the next-to-bottom contestant in the model:

This histogram (blue bars) shows which place the model had the person who was actually eliminated (1 being most likely to be eliminated, 2 being second-most likely to be eliminated, etc.). 3 correct calls, 3 nearly correct, 1 almost nearly correct, and 6 that were way the hell off. The red line shows the cumulative percentage (right axis). Not terrible, not great.

My ultimate assessment of the logistical regression model of Idol is that it still basically sucks after the Top 8. Every week I collect the data, do the fit, check the significance, and generate the projection. And then I sit and fret about it, because I normally don’t agree. In the Top 4 I said it was crazy that the results didn’t reflect that Hollie was going to be eliminated (I was correct). Skylar was similarly a bit obvious, and the input variables just didn’t reflect that. I knew, I felt. But the model doesn’t know or feel, it’s just a formula.

Why does it suck so bad? Because the practice of analytics is one that requires lots of data, and Idol isn’t data rich. The voting results are never revealed except for the Bottom Group, and even that is not always shown (last week, for instance, we only know that Joshua Ledet received the lowest number of votes). Predicting presidential elections is tricky enough, and for that you have something like 16 modern elections to work with, vote totals for each candidate, primary votes, and poll results leading up to the election. For Idol, I have none of that. I have the result and the outcome of two polls (WhatNotToSing and DialIdol).

Is it time to move on from Dialidol? Perhaps so. Votefair has been far better this year, and since it’s been around long enough, there probably is enough historical data to factor it in. I will point out some examples of Dialidol misses later on, and there have been some biggies. I’m becoming convinced, slowly, that Dialidol is having a rough time because of Facebook voting. I’m intrigued by Zabasearch, but I have no idea how it works and how sound the methodology is. Again, someone working with presidential politics has many results for each pollster. I have only a few, and there are many “candidates” in each “election”, rather than just 2 or 3.

Can the model be improved? Sure. More data is helpful. Weighting things like Dialidol, which appears to be having some glitches, lower in the forecasts is tempting, though it would just be me putting my finger on the scale to get the result I want. (I think I’ve been accused of being biased enough to even contemplate that option.)

Semifinals

For the semifinals this year we had 25 contestants, where 10 were elected. There were 13 men and 12 women, and 5 were selected from each.

On the women side, everything was roses. The model predicted Jessica, Elise, Hollie, Skylar, and Shannon, for a 100% accuracy. There was no effect of pre-exposure that I could see, so it’s a good thing that I couldn’t factor it in (otherwise Jen HIrsh would have been called a shoo-in, which obviously didn’t happen). Point of interest: inclusion of other variables can reduce the estimated significance of a given variable. Once Dialidol and WNTS were factored in, pre-exposure just didn’t matter as much.

For the men, though, the results are ugly. Eben, Chase, Adam, Jermaine, Joshua were called as the most likely, and I found these selection woeful but believable. A precocious kid, a country guy, a soul-singing white guy: those sounded like Idol picks. But they weren’t. Not at all. Jermaine and Joshua were good calls, but the other three were not.

Why the hell did the model miss Phil Phillips? His WNTS score for “In The Air Tonight” was 65, not the highest of the night (by a long shot), but second-best. No, in the end Phil missed the cut because of the inclusion of Adam Brock. Brock’s Dialidol score was quite a bit higher than Phil’s. Couple that with Eben Frankewitz’s astronomical score, and Phil just got pushed out. Damn.

I think it’s fair to say that Dialidol should never be used for semi-finals again. Where they used to be good, they just aren’t. Votefair got 4/5 correct on the men’s side. There’s a very good chance they will be the standard next year.

The overall accuracy was 70% for the semifinals. I think it could have been improved to 80%, but I can’t say I’m too disappointed.

Top 13

The Top 13 this year was basically an extension of the semifinals, in that men and women were treated separately. Thus, there were 6 calls for the bottom group, and 2 for the lowest-vote-getter. The model called Jeremy, Jermaine, and Heejun, with Jeremy the lowest. The model was right, and Jeremy was sent home. Dialidol and Votefair both called Jermaine. (racists!) (I’m kidding.) For the girls, the model thought it was Elise, Shannon, and Erika, with Shannon going home. 3/3 of the bottom group, but missed Elise as the lowest. Overall 5/6 good calls on the bottom group, one good on the lowest-vote-getter and one bad.

Is the model just better at predicting the result when they are segregated by male/female? Yes, for sure. The sampling bias as far as gender is mainly wiped out on both Dialidol and WNTS by having separate columns. There’s no doubt in my mind that this affects the model’s predictive ability.

Top 11: looking good

Things started out pretty well. After Jermaine’s disqualification, the model set its sights on Shannon, Erika, and Deandre. Shannon and Erika’s call was a beautiful example of the model doing what it’s supposed to do. It picked out Erika for having a low Dialidol score, middle-of-the-road WNTS score, and being a woman. Shannon had a much worse WNTS score but a much better Dialidol score, and it weighed those unequally and called her correctly as the lowest-vote-getter, judging that Dialidol just isn’t that accurate early on. The model missed Elise, based primarily because she had a rather high score (71)  on WNTS for her rendition of Let’s Stay Together. Don’t know what was going on with that.

Top 10 though Top 7

In the next 5 rounds there were 15 instances of someone being sent to the bottom 3, and the model called 8 of them correctly. It made no correct calls on the person being eliminated. It missed in the Top 10 when Erika Van Pelt was ROBBED by Heejun. Then it turned around and missed Heejun’s swan song, predicting Hollie instead. Ah, my beloved Hollie, how many times people said you’d be gone, and how many times you beat the odds.

The model persisted in thinking Hollie would be gone. At this point it was highly biased against people who had already been in the bottom 3. So in the Top 8 it stuck with Hollie and Deandre, but considered the former more likely to go home based on her lower WNTS score. I’d like to point out that if the model took Dialidol more seriously, Deandre would have been predicted gone.

Then the Top 7 happened. Jessica Sanchez would have been voted off. No service or expert that I saw got this one correct, nor the Top 7 redux choice of Colton Dixon. (Maybe somebody thought Colton would go, but nobody that I saw. No letters, please.)

Top 6 through Top 3: how a bunch of newcomers came to think I’m an idiot

So, a funny thing happened around the time of the Top 6, just as my horrifying slump reached its nadir: a lot of people started coming to my website. I went from having 50 people on a good day to having around 700 people on a slow day. I don’t know why.

It was a ghastly and embarrassing thing to have happen. Each week I felt I knew what was happening, but was unable to find any evidence for it. In the Top 6 Elise was called out (duh), which the model got correct. But in the Top 5 Dialidol voters abandoned Phil, and my model went along with it. Shucks.

The Top 4 saw Hollie Cavanagh finally eliminated, which no other major service caught. The Top 3 saw Joshua Ledet go home, a result which was, in retrospect, too close to call. I’d like feedback from commenters about whether they think I should abstain from calls that are marginal, or keep staking my name like I do. Go ahead and weigh in below. It’s the section at the bottom of the article with 15 people bitching about how big of a moron I am (WHY DO THEY READ THE SITE???).

Other sites

Regarding the cumulative percentage chart at the top of this article, we can look at my model vs Dialidol and Votefair. This is not a straight comparison, because those services actually sample, whereas mine is just a model. Nevertheless, I think such a comparison is interesting.

The model I use is behind the other two, but it’s not awful.

Here are the comparisons for Bottom Group projections for the three services. I’ve also included numbers for Zabasearch, but note that they made significantly fewer projections than the other three.

IdolAnalytics Votefair Dialidol Zabasearch
Correct 17 20 16 10
Total 30 30 30 18
Percentage 57% 67% 53% 56%

A two or three parameter regression is not yet as good at projecting the Bottom 3 as a sampling system like Votefair. It beat out Dialidol by  only one call. Zabasearch is a “wait and see.” I hope they are back next year, I hope they start at the beginning (where, arguably, getting high accuracy is harder because of reduced random guess possibility).

Finally, here are the elimination assessments (this ain’t pretty):

IdolAnalytics Votefair Dialidol Zabasearch
Correct 3 5 4 3
Total 12 12 12 8
Percentage 25% 42% 33% 38%

It’s also interesting to me to see how the model goes up against someone who knows the game, and calls from the gut. MJ, from mjsbigblog correctly guessed 45%, better than any single service reviewed here. On the other hand, DJ Slim, proprietor of IdolBlogLive scored only a 17%.

Will the model be back?

I don’t know if it’s worth continuing with a model like this. It’s one thing to try to subject a topic to mathematical scrutiny, but quite another to task yourself with hitting a moving target inside of a black box.

I’m quite happy with the beginning of the season and quite unhappy with the ending of the season. What started out as a side project on this site (I mainly just wanted to do analysis and provide context) has become my prime obsession, and I don’t like that very much. If the model comes back next year (assuming I still write for this site), it will have to have substantial improvements built into it before I publish anything it predicts. Weigh in with a comment if you’d like to see the model return.

This will likely be the last article I write this year for this site. All that’s left is the final liveblog (perhaps of both the performances and the results), a final projection (which, yes, I’m sure will be wrong), and possibly some closing thoughts. If you made it all the way here, well, thanks for reading.

Bookmark the permalink.
  • v

    I’ve found the process of reading your analytics fascinating. Clearly I’m not a stats person except in the most general way but the play between the numbers and feelings is interesting to watch.

    I don’t really care which of the contestants win as I never purchase any of their music ever, nor do I believe in the title American Idol. Most time I find the second place person is who I liked better – in a contest that isn’t ever my taste in music.

    I also find the idiotic comments hilarious as it’s a matter of taste and preference and a penchant for defending their own choice simply because someone else doesn’t agree with their obviously stellar perception. It’s quite entertaining really.

    The curious mix of entertainment and science is fun. Maybe you should have a disclaimer at the top of every post that says: here is what the numbers say – limited data – limited results. Here is my editorial opinion which may or may not agree with the data.

  • jeff duong

    how about a discrete time survival model incorporating time-constant and time-varying covariates?

    perhaps you could use a long formed model to model parameters that include data from previous seasons?

    i’m not a super strong statistician, but i feel like this might be possible…