# Song choice revisited

As regular readers of this blog will know, I don’t ascribe as much weight to the idea of song choice as other sites. The popular site “What Not To Sing” seems to rest on the idea that song choice is critical, but I just don’t see it. As people have said many times, if you can sing, you can sing. People have pulled out great performances on strange songs like Hemorrhage (In My Hands) (Chris Daughtry, a 94 WNTS rating in the Men’s Semifinals of Season 5), and people have crashed and burned with what are ostensibly good choices (how about Paige Miles’ performance of Against All Odds (Take a Look at Me Now) in Season 9’s Top 11?).

But those are just cherry picking. Suppose we look back on all seasons and try to suss out what the import of song choice is. Unfortunately, we’re going to run into a problem right away, in that any given song will only have been sung a few times in all of Idol history. The most sung is I Have Nothing by Whitney Houston, with only 8. I’m not going to be drawing any conclusions with a sample set of 8.

The obvious analytic way out of this is to group songs by common factors, or otherwise quantify songs along some dimension. One way to do this is using the “Whitburn score” (my term, don’t bother googling it). Suppose a song charts for one week on the Billboard Hot 100, at position 100. Then we give that song a Whitburn Score of 1. If it never charts we give it a Whitburn score of 0. We define the score as the sum of 101 minus the Hot 100 position for each week the song charts. Imagine Dragons’ Radioactive charted for 74 weeks, and has a Whitburn Score of almost 6000. I’m Yours, Rolling in the Deep, Smooth, Somebody That I Used to Know also all rate near the top.

So the first question I have to ask is: does the frequency with which someone is safe in the contest depend at all on a metric like this? If the answer is yes, we ask an even more interesting question: is the more familiar song safer or less safe?

Let’s use the logit link function to do a simple linear fit to the historical data. We fit whether a performance was safe or not versus a bunch of potentially relevant variables, such as the order of the performance, the WNTS Rating, whether or not the contestant was in the bottom group previously, and some metrics that rate the popularity of a song

```Call:
glm(formula = Safe ~ Season + Order + TotPerfs + WNTS.Rating +
Bottom.Prev + VFPercent + YearOfRanking + YearlyRank + WeeksCharted +
Charted40OrBetter + Charted10OrBetter + HighestPosition +
RadioPlays + WhitburnScore, family = binomial, data = SongData)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.6702  -0.9666   0.4720   0.8418   1.8900

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)       -1.501e+01  1.415e+01  -1.061   0.2887
Season            -7.699e-02  3.333e-02  -2.310   0.0209 *
Order              1.134e-01  2.760e-02   4.108 3.98e-05 ***
TotPerfs           4.324e-02  4.411e-02   0.980   0.3270
WNTS.Rating        3.007e-02  4.157e-03   7.233 4.73e-13 ***
Bottom.Prev       -8.484e-01  1.730e-01  -4.903 9.44e-07 ***
VFPercent          4.024e-02  7.135e-03   5.640 1.70e-08 ***
YearOfRanking      7.042e-03  7.218e-03   0.976   0.3292
YearlyRank         1.626e-03  1.906e-03   0.853   0.3936
WeeksCharted       2.216e-02  2.172e-02   1.020   0.3077
Charted40OrBetter  9.316e-03  3.306e-02   0.282   0.7781
Charted10OrBetter -3.449e-02  2.450e-02  -1.408   0.1592
HighestPosition   -1.717e-02  9.983e-03  -1.720   0.0854 .
WhitburnScore     -4.273e-04  4.336e-04  -0.985   0.3244
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1296.6  on 1007  degrees of freedom
Residual deviance: 1058.4  on  993  degrees of freedom
(823 observations deleted due to missingness)
AIC: 1088.4

Number of Fisher Scoring iterations: 5
```

The initial findings are not promising. By far the most significant variables are WNTS Rating, Votefair popularity, and whether or not the person had previously been in the Bottom Group. We can see that many of the variables are not significant. For instance, whether the song is actually played on the radio, which you might think would matter. In fact, it’s quite likely that makes no difference (at least on the entire data set).

Now let’s limit the model to just variables that seem to have a snowball’s chance in hell of mattering:

```Call:
glm(formula = Safe ~ Order + WNTS.Rating + Bottom.Prev + VFPercent +
WeeksCharted + HighestPosition + WhitburnScore, family = binomial,
data = SongData)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.6673  -0.9666   0.4831   0.8441   1.8820

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)     -1.2971209  0.2918839  -4.444 8.83e-06 ***
Order            0.1263050  0.0258266   4.891 1.01e-06 ***
WNTS.Rating      0.0279386  0.0039534   7.067 1.58e-12 ***
Bottom.Prev     -0.9486016  0.1578972  -6.008 1.88e-09 ***
VFPercent        0.0395262  0.0066878   5.910 3.42e-09 ***
WeeksCharted     0.0403195  0.0182330   2.211  0.02701 *
HighestPosition -0.0089462  0.0037305  -2.398  0.01648 *
WhitburnScore   -0.0006699  0.0002277  -2.942  0.00326 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1333.5  on 1037  degrees of freedom
Residual deviance: 1097.9  on 1030  degrees of freedom
(793 observations deleted due to missingness)
AIC: 1113.9

Number of Fisher Scoring iterations: 5
```

We can see that it’s still really dependent on how popular the contestant is and how well they sang the song. I am fairly confident that the idea that “song choice is everything” is really not the case.

But anyway, this article is focusing on song choice only, and we do see some effect of how well the song did on the Billboard Hot 100 charts, like the Whitburn Score. If we wanted to model the contest only that way, we could fit based on only song-related variables

```Call:
glm(formula = Safe ~ WhitburnScore, family = binomial, data = SongData)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.5172  -1.3885   0.8725   0.9611   1.3168

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)    0.7706726  0.0756893  10.182  < 2e-16 ***
WhitburnScore -0.0002011  0.0000519  -3.875 0.000107 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2405  on 1830  degrees of freedom
Residual deviance: 2390  on 1829  degrees of freedom
AIC: 2394

Number of Fisher Scoring iterations: 4```

Of course, the Whitburn score now seems much more important when the other variables are not taken into account, as it’s easier to pass the significance test (NB: some of that is due to the fact that number of data points is now higher, since not all variables are known for all performances, such as Votefair). In fact, it’s quite likely that the WNTS Rating is related to the Whitburn score. We can test this using a linear regression, but before we do that, let’s look at the parameter estimate generated by R. Safe is being fit versus Whitburn Score, which is significant … but you are less safe the more popular the song was! The probability of being safe is actually reduced by singing a song that charted a lot.

To see how these are related, let’s do that linear regression now:

```Call:
lm(formula = WNTS.Rating ~ WhitburnScore, data = SongData)

Residuals:
Min      1Q  Median      3Q     Max
-53.237 -16.237  -0.292  17.131  45.543

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   55.2371786  0.7801521  70.803  < 2e-16 ***
WhitburnScore -0.0036882  0.0005484  -6.726 2.33e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.78 on 1829 degrees of freedom
Multiple R-squared:  0.02414,   Adjusted R-squared:  0.0236
F-statistic: 45.24 on 1 and 1829 DF,  p-value: 2.327e-11```

Indeed, WNTS Rating is decreased by increasing Whitburn Score, less popular songs are rated more highly. People actually rate unfamiliar songs better than familiar ones!

What conclusions can we draw from this? Someone’s chances overall are mostly dependent on how well they sang, how popular they are, and not on what song they chose. But to the degree that song choice does matter, you are better off choosing a song the audience is only marginally familiar with, or a song they’ve never heard.

Here’s the data in csv form.