## Song choice revisited

As regular readers of this blog will know, I don’t ascribe as much weight to the idea of song choice as other sites. The popular site “What Not To Sing” seems to rest on the idea that song choice is critical, but I just don’t see it. As people have said many times, if you can sing, you can sing. People have pulled out great performances on strange songs like Hemorrhage (In My Hands) (Chris Daughtry, a 94 WNTS rating in the Men’s Semifinals of Season 5), and people have crashed and burned with what are ostensibly good choices (how about Paige Miles’ performance of Against All Odds (Take a Look at Me Now) in Season 9’s Top 11?).

But those are just cherry picking. Suppose we look back on all seasons and try to suss out what the import of song choice is. Unfortunately, we’re going to run into a problem right away, in that any given song will only have been sung a few times in all of Idol history. The most sung is I Have Nothing by Whitney Houston, with only 8. I’m not going to be drawing any conclusions with a sample set of 8.

The obvious analytic way out of this is to group songs by common factors, or otherwise quantify songs along some dimension. One way to do this is using the “Whitburn score” (my term, don’t bother googling it). Suppose a song charts for one week on the Billboard Hot 100, at position 100. Then we give that song a Whitburn Score of 1. If it never charts we give it a Whitburn score of 0. We define the score as the sum of 101 minus the Hot 100 position for each week the song charts. Imagine Dragons’ Radioactive charted for 74 weeks, and has a Whitburn Score of almost 6000. I’m Yours, Rolling in the Deep, Smooth, Somebody That I Used to Know also all rate near the top.

So the first question I have to ask is: does the frequency with which someone is safe in the contest depend at all on a metric like this? If the answer is yes, we ask an even more interesting question: is the more familiar song safer or less safe?

Let’s use the logit link function to do a simple linear fit to the historical data. We fit whether a performance was safe or not versus a bunch of potentially relevant variables, such as the order of the performance, the WNTS Rating, whether or not the contestant was in the bottom group previously, and some metrics that rate the popularity of a song

```Call:
glm(formula = Safe ~ Season + Order + TotPerfs + WNTS.Rating +
Bottom.Prev + VFPercent + YearOfRanking + YearlyRank + WeeksCharted +
Charted40OrBetter + Charted10OrBetter + HighestPosition +
RadioPlays + WhitburnScore, family = binomial, data = SongData)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.6702  -0.9666   0.4720   0.8418   1.8900

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)       -1.501e+01  1.415e+01  -1.061   0.2887
Season            -7.699e-02  3.333e-02  -2.310   0.0209 *
Order              1.134e-01  2.760e-02   4.108 3.98e-05 ***
TotPerfs           4.324e-02  4.411e-02   0.980   0.3270
WNTS.Rating        3.007e-02  4.157e-03   7.233 4.73e-13 ***
Bottom.Prev       -8.484e-01  1.730e-01  -4.903 9.44e-07 ***
VFPercent          4.024e-02  7.135e-03   5.640 1.70e-08 ***
YearOfRanking      7.042e-03  7.218e-03   0.976   0.3292
YearlyRank         1.626e-03  1.906e-03   0.853   0.3936
WeeksCharted       2.216e-02  2.172e-02   1.020   0.3077
Charted40OrBetter  9.316e-03  3.306e-02   0.282   0.7781
Charted10OrBetter -3.449e-02  2.450e-02  -1.408   0.1592
HighestPosition   -1.717e-02  9.983e-03  -1.720   0.0854 .
WhitburnScore     -4.273e-04  4.336e-04  -0.985   0.3244
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1296.6  on 1007  degrees of freedom
Residual deviance: 1058.4  on  993  degrees of freedom
(823 observations deleted due to missingness)
AIC: 1088.4

Number of Fisher Scoring iterations: 5
```

The initial findings are not promising. By far the most significant variables are WNTS Rating, Votefair popularity, and whether or not the person had previously been in the Bottom Group. We can see that many of the variables are not significant. For instance, whether the song is actually played on the radio, which you might think would matter. In fact, it’s quite likely that makes no difference (at least on the entire data set).

Now let’s limit the model to just variables that seem to have a snowball’s chance in hell of mattering:

```Call:
glm(formula = Safe ~ Order + WNTS.Rating + Bottom.Prev + VFPercent +
WeeksCharted + HighestPosition + WhitburnScore, family = binomial,
data = SongData)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.6673  -0.9666   0.4831   0.8441   1.8820

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)     -1.2971209  0.2918839  -4.444 8.83e-06 ***
Order            0.1263050  0.0258266   4.891 1.01e-06 ***
WNTS.Rating      0.0279386  0.0039534   7.067 1.58e-12 ***
Bottom.Prev     -0.9486016  0.1578972  -6.008 1.88e-09 ***
VFPercent        0.0395262  0.0066878   5.910 3.42e-09 ***
WeeksCharted     0.0403195  0.0182330   2.211  0.02701 *
HighestPosition -0.0089462  0.0037305  -2.398  0.01648 *
WhitburnScore   -0.0006699  0.0002277  -2.942  0.00326 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1333.5  on 1037  degrees of freedom
Residual deviance: 1097.9  on 1030  degrees of freedom
(793 observations deleted due to missingness)
AIC: 1113.9

Number of Fisher Scoring iterations: 5
```

We can see that it’s still really dependent on how popular the contestant is and how well they sang the song. I am fairly confident that the idea that “song choice is everything” is really not the case.

But anyway, this article is focusing on song choice only, and we do see some effect of how well the song did on the Billboard Hot 100 charts, like the Whitburn Score. If we wanted to model the contest only that way, we could fit based on only song-related variables

```Call:
glm(formula = Safe ~ WhitburnScore, family = binomial, data = SongData)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.5172  -1.3885   0.8725   0.9611   1.3168

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)    0.7706726  0.0756893  10.182  < 2e-16 ***
WhitburnScore -0.0002011  0.0000519  -3.875 0.000107 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2405  on 1830  degrees of freedom
Residual deviance: 2390  on 1829  degrees of freedom
AIC: 2394

Number of Fisher Scoring iterations: 4```

Of course, the Whitburn score now seems much more important when the other variables are not taken into account, as it’s easier to pass the significance test (NB: some of that is due to the fact that number of data points is now higher, since not all variables are known for all performances, such as Votefair). In fact, it’s quite likely that the WNTS Rating is related to the Whitburn score. We can test this using a linear regression, but before we do that, let’s look at the parameter estimate generated by R. Safe is being fit versus Whitburn Score, which is significant … but you are less safe the more popular the song was! The probability of being safe is actually reduced by singing a song that charted a lot.

To see how these are related, let’s do that linear regression now:

```Call:
lm(formula = WNTS.Rating ~ WhitburnScore, data = SongData)

Residuals:
Min      1Q  Median      3Q     Max
-53.237 -16.237  -0.292  17.131  45.543

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   55.2371786  0.7801521  70.803  < 2e-16 ***
WhitburnScore -0.0036882  0.0005484  -6.726 2.33e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.78 on 1829 degrees of freedom
Multiple R-squared:  0.02414,   Adjusted R-squared:  0.0236
F-statistic: 45.24 on 1 and 1829 DF,  p-value: 2.327e-11```

Indeed, WNTS Rating is decreased by increasing Whitburn Score, less popular songs are rated more highly. People actually rate unfamiliar songs better than familiar ones!

What conclusions can we draw from this? Someone’s chances overall are mostly dependent on how well they sang, how popular they are, and not on what song they chose. But to the degree that song choice does matter, you are better off choosing a song the audience is only marginally familiar with, or a song they’ve never heard.

Here’s the data in csv form.

## Handicapping the Top 2 is a fool’s errand

This is an extremely contentious season, with no clear favorite to win. Let’s dive into some historical data to see if we can calibrate our expectations.

The first thing we could look at, which is most obvious, is season-wide approval rating. However, we find no association between this and winning the finale. The overall difference between the winner and the runner up has the winner up just 0.13 points, not statistically significant. Kelly Clarkson, David Cook, and Candice Glover all had much higher averages than their opponents, but Bo Bice also had a huge lead on Carrie Underwood, Adam Lambert was comfortably higher than Kris Allen, and Lauren Alaina soundly beat Scotty McCreery by this measure. If we look at recent scores (from the Top 6 to the Top 3), again there is no real discernible difference.

Second is we could look at how up-and-down the contestant is. Someone with a lot of low scores and a lot of high scores could be more likeable than one who had consistent good, but not amazing scores. Here we find that the winners are more consistent than the runners up, by about 2 percentage points, measured as the standard deviation of all scores.

What about how many times the contestants were in the bottom group? Yes, the winners have been in the bottom group on average 0.75 fewer times than the runners up. And never has the winner been in the bottom group more times than the runner up. But even this comes with a caveat: Idol has been revealing much less information this year about who is in the bottom group. It could be that in a past year we would know that, say, Jena had been in the bottom 2 of the Top 5, but we were never told who was the second-lowest-vote-getter.

In Votefair, the winner is on average 7.1 points more popular than the runner up. However, it’s had the winner leading only half of the time! Votefair predicted a double-digit win for David Archuleta and Adam Lambert, neither of which came to pass.

What about the number of great moments the person has had? We can tally the number of performances with WNTS greater than, say 80. What we find is that the winner actually has 1.25 fewer hits than the runner up, not what we would expect. These numbers are small enough that this may just be noise, since 79s don’t get counted even though they are very close.

Nothing can be judged from gender either. In male-female pair-ups for the final, men have beaten women 4 times, and women have beaten men 3 times (4  years had mono-gendered finales). I would note, though, that a woman hasn’t beaten a man in a finale since Season 6.

At this point we have to take a step back and ask ourselves whether there is any real season-wide indicator that we trust on who will win. I don’t think there is.

As to the variables that are week-to-week, by far the most predictive has been Dialidol. Due to the rule changes that make Dialidol mainly irrelevant (possibly misleading), it’s very difficult to say what to think about who is going to win.

My feeling (for what it matters) is that this is going to be as close a call as it gets. I might be inclined to say the finale to Season 13 is a coin flip. And anyone (including Bing) who says they know who the winner is going to be is blowing smoke.

## How this year’s Idol is like the financial crisis

The inclusion of Jena Irene in the bottom 3 tonight was a big shock. Though Jena scored a relatively lousy 43 on WhatNotToSing, she was number 1 on Votefair and on Dialidol.

It is not that people who were highest on Votefair were never eliminated, but so early in the contest this is extremely rare. Siobhan Magnus, Adam Lambert, Thia Megia, and Jessica Sanchez are the only contestants to ever be in the bottom group while most popular on that site during the initial weeks (Siobhan was eliminated and Jessica had to be saved). Normally Votefair is pretty damn predictive so early in the contest. But none of these people was also the highest on Dialidol. The closest historical example we have to tonight’s result is Adam Lambert, but that was in the Top 5, near when Votefair becomes irrelevant. All of which is to say that tonight may be the most surprising result ever with regards to Votefair. Continue reading

## Top 13 result was the 9th most surprising ever

Given the abysmal performance of the model in ranking the contestants in the Top 13, and given that M.K. being included was not foreseen at all (at least in my opinion), it’s worth asking how surprising this week’s result was compared to all other Idol episodes. There have been 89 episodes in the finals over 13 seasons, and of those this week’s was the 9th most surprising episode, putting it in the top 10% of episodes in surprising…ness. Continue reading

## Wildcard picks last fewer rounds in finals, but gender is a factor

Season 12 was unexpectedly short, just 10 weeks of finals, and it had no wildcard picks—all 10 people were voted in. But this year the wildcard was back, and it promoted C.J. Harris, Jena Irene, and Kristen O’Connor to the finals. How long should we expect these contestants to last, given that they didn’t get enough votes to advance from the semifinals?

It’s worth mentioning that we  only have 6 seasons to compare to. Seasons 1, 2, 3, 8, 10, and 11 had wildcard picks, 10 women and 7 men. The first season had R.J. Helton; Season 2 was Kimberly Caldwell, Trenyce, and Carmen Rasmusen; Season 3 was Jennifer Hudson (!), George Huff, and Leah LaBelle. Season 8 had four picks: Anoop Desai, Matt Giraud, Jasmine Murray, and Megan Joy. In Season 10 Naima Adedapo, Stefano Langone, and Ashthon Jones were picked, and in Season 11 there was Deandre Brackensick, Erika Van Pelt, and Jeremy Rosado.

To see how long they last, we can first look at the distribution of the number of finals rounds that the contestants competed in before being eliminated. It’s more uniform than you’d think: