Season 12 model accuracy assessment roundup

As we head into the finale, it’s time to look at how predictable the season has been, particularly with regard to the three prediction models on this site. The first is the semi-finals model, predicting who would be in the finals. The second is the finals model, a week-to-week guess as to who would be in the bottom group, safe, and/or eliminated. Finally, the Top 3 Tracker, which tried to guess who the Top 3 would be.


Here is the semi-finals projection as posted before the results were announced.

Name Probability
of advancing
Lazaro Arbos 0.893 Yes
Devin Velez 0.856 Yes
Burnell Taylor 0.804 Yes
Curtis Finch 0.730 Yes
Paul Jolley 0.591 Yes
Charlie Askew 0.575 No
Vince Powell 0.403 No
Nick Boddington 0.126 No
Cortez Shaw 0.012 No
Elijah Liu 0.010 No
Angela Miller 0.936 Yes
Candice Glover 0.830 Yes
Kree Harrison 0.650 Yes
Amber Holcomb 0.447 Yes
Janelle Arthur 0.446 Yes
Adriana Latonio 0.418 No
Breanna Steer 0.412 No
Tenna Torres 0.350 No
Zoanette Johnson 0.343 No
Aubrey Cleland 0.169 No

Green means the model predicted the person would advance to the finals. Red meant the opposite. Yellow means too close to call.

The model was confident of 7 out of 10 of them, and all of those advanced. It was confident that 7 others would not advance, and none of those did. It was unsure of 6, and half of those advanced. All 10 of the highest ranked contestants advanced, which is really just dumb luck. Because it was so accurate, I’m led to believe that the margin of error I allowed for was perhaps too large, and more names should have been in green/red instead of yellow.


Here is a listing of all the model calls for the Top 10 through the Top 3. Green means safe, yellow means unsure. Red means one of two things: either the model was confident that the person would be in the bottom group, or it is not confident, but the person is ranked within the bottom 3 or 2 names, so I forced it to be red. I’ve noted when this happened. (I forced there to be a projected bottom 3 for obvious reasons; people come to the website to see who is the likeliest bottom 3.)

Contestant Not-safe
Result Forced?
Top 10
Curtis Finch 0.407 Eliminated No
Janelle Arthur 0.383 6th Yes
Paul Jolley 0.367 Bottom 3 (8th) Yes
Burnell Taylor 0.358 7th
Devin Velez 0.336 Bottom 3 (9th)
Lazaro Arbos 0.326 4th
Amber Holcomb 0.276 5th
Kree Harrison 0.244 Top 3
Candice Glover 0.226 Top 3
Angela Miller 0.078 Top 3
Top 9
Paul Jolley 0.481 Eliminated No
Burnell Taylor 0.445 Safe Yes
Lazaro Arbos 0.432 Safe Yes
Devin Velez 0.420 Bottom 3
Janelle Arthur 0.363 Safe
Amber Holcomb 0.352 Bottom 3
Kree Harrison 0.217 Safe
Candice Glover 0.201 Safe
Angie Miller 0.089 Safe
Top 8
Devin Velez 0.602 Eliminated No
Burnell Taylor 0.567 Bottom 3 No
Lazaro Arbos 0.536 Bottom 3 No
Janelle Arthur 0.380 Safe
Amber Holcomb 0.302 Safe
Kree Harrison 0.235 Safe
Candice Glover 0.228 Safe
Angie Miller 0.150 Safe
Top 7
Burnell Taylor 0.725 Eliminated No
Lazaro Arbos 0.658 Top 3 No
Janelle Arthur 0.491 Bottom 2
Candice Glover 0.340 Safe
Amber Holcomb 0.335 Safe
Kree Harrison 0.313 Top 3
Angie Miller 0.138 Top 3
Top 6
Janelle Arthur 0.570 Middle 2 No
Lazaro Arbos 0.467 Eliminated Yes
Amber Holcomb 0.433 Bottom 2
Kree Harrison 0.244 Top 2
Angie Miller 0.166 Middle 2
Candice Glover 0.119 Top 2
Top 5
Janelle Arthur 0.633 Eliminated No
Amber Holcomb 0.401 Safe Yes
Kree Harrison 0.376 Bottom 2
Candice Glover 0.347 Safe
Angie Miller 0.244 Safe
Top 4 (ii)
Candice Glover 0.565 Safe No
Kree Harrison 0.485 Safe
Angie Miller 0.479 Safe
Amber Holcomb 0.472 Eliminated
Top 3
Angie Miller 0.382 Eliminated No
Candice Glover 0.309 Safe
Kree Harrison 0.309 Safe

I’ll break the assessment off into several categories.

Safe calls

There were 27 safe calls (in green), and 25 of those were safe (not bottom group or eliminated). Janelle in the Top 7 and Amber in the Top 9 were the only exceptions. This gives about a 93% accuracy. Of the people called safe, none was eliminated.

Bottom group calls

Among the group including all of the projected bottom group (all names in red), the ranking is not particularly impressive (nor was it designed to be). If I ignore the margin of error, there were 17 bottom-group calls, and only 10 of those were correct, a rate of about 59%. One reason this is a little lower than it would otherwise be is that the bottom group reveals this year have been remarkably reduced. In addition to cutting the Top 12 and 11 rounds out, we were only told the bottom 2 in the Top 6 and Top 7, which is not normal.

Now, taking into account the margin of error, there were 11 people who were declared red confidently (not forced by me), and 8 of those was in the bottom group, a respectable rate of 73%. Only 3 people that were confidently red ended up being safe. Note that this is lower than previous years (which is more like 87%), so there has been higher than average unpredictability this year as far as the bottom group.


Out of 8 eliminations, 7 of the people were in red (88% accuracy). The person ranked the highest not-safe probability was eliminated 6 out of 8 times (75% accuracy). The person eliminated had either the highest or second highest not-safe probability 7 out of 8 times (88% accurate). The person eliminated never appeared in green.

Most surprising results

As measured by the ranking error (in percentage points), the biggest misses (in descending order) for the model were the following.

  • In the Top 7, the model was quite surprised that Lazaro was safe and Janelle was in the bottom 2.
  • Amber being in the bottom 3 in the Top 9 was also very surprising
  • In the Top 6, the model was pretty surprised that Janelle was not in the bottom 2

Top 3 Tracker

Here is the time series for all rounds showing the Top 3 Tracker’s assigned probability of making the Top 3

The eventual Top 3 (Angie, Kree, and Candice) was always the highest rated. This is obviously the most important measure.

The contestant eliminated in each round was the lowest ranked on the T3T 6 of 8 times (75% accurate). The person eliminated was either lowest or second-lowest rated 100% of the time, meaning that for elimination predictions the T3T was actually better than the finals model. For the bottom group, the accuracy of the T3T was about 69%, again better than the finals model itself. This is due to the fact that the T3T makes use of an averaging scheme to smooth out the system. However, it is also partly luck. In some seasons, incorporating an averaging effect actually had a detrimental effect on prediction accuracy.

Reuben is the statistician and lead writer at IdolAnalytics. Follow him on Twitter. His personal site (run all year round) is IdleAnalytics

Bookmark the permalink.
  • Robert Henry Eller

    Reuben, you don’t say that Angie’s elimination from the top 3 was a surprise, although she was consistently the highest probability top 3 contestant, according to the model. Does that mean that Angie’s elimination was not within the model’s prediction capabilities? Were you able to measure the probability of Angie’s elimination? Was it statistically surprising?

  • Robert Henry Eller

    Reuben, I meant to say, thank you for running this site. Being a Nate Silver fan, my enjoyment of AI was greatly enhanced by your analytics.

    Looking forward to your final prediction on who wins.