As regular readers of this blog will know, I don’t ascribe as much weight to the idea of song choice as other sites. The popular site “What Not To Sing” seems to rest on the idea that song choice is critical, but I just don’t see it. As people have said many times, if you can sing, you can sing. People have pulled out great performances on strange songs like Hemorrhage (In My Hands) (Chris Daughtry, a 94 WNTS rating in the Men’s Semifinals of Season 5), and people have crashed and burned with what are ostensibly good choices (how about Paige Miles’ performance of Against All Odds (Take a Look at Me Now) in Season 9’s Top 11?).
But those are just cherry picking. Suppose we look back on all seasons and try to suss out what the import of song choice is. Unfortunately, we’re going to run into a problem right away, in that any given song will only have been sung a few times in all of Idol history. The most sung is I Have Nothing by Whitney Houston, with only 8. I’m not going to be drawing any conclusions with a sample set of 8.
The obvious analytic way out of this is to group songs by common factors, or otherwise quantify songs along some dimension. One way to do this is using the “Whitburn score” (my term, don’t bother googling it). Suppose a song charts for one week on the Billboard Hot 100, at position 100. Then we give that song a Whitburn Score of 1. If it never charts we give it a Whitburn score of 0. We define the score as the sum of 101 minus the Hot 100 position for each week the song charts. Imagine Dragons’ Radioactive charted for 74 weeks, and has a Whitburn Score of almost 6000. I’m Yours, Rolling in the Deep, Smooth, Somebody That I Used to Know also all rate near the top.
So the first question I have to ask is: does the frequency with which someone is safe in the contest depend at all on a metric like this? If the answer is yes, we ask an even more interesting question: is the more familiar song safer or less safe?
Let’s use the logit link function to do a simple linear fit to the historical data. We fit whether a performance was safe or not versus a bunch of potentially relevant variables, such as the order of the performance, the WNTS Rating, whether or not the contestant was in the bottom group previously, and some metrics that rate the popularity of a song
Call: glm(formula = Safe ~ Season + Order + TotPerfs + WNTS.Rating + Bottom.Prev + VFPercent + YearOfRanking + YearlyRank + WeeksCharted + Charted40OrBetter + Charted10OrBetter + HighestPosition + RadioPlays + WhitburnScore, family = binomial, data = SongData) Deviance Residuals: Min 1Q Median 3Q Max -2.6702 -0.9666 0.4720 0.8418 1.8900 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.501e+01 1.415e+01 -1.061 0.2887 Season -7.699e-02 3.333e-02 -2.310 0.0209 * Order 1.134e-01 2.760e-02 4.108 3.98e-05 *** TotPerfs 4.324e-02 4.411e-02 0.980 0.3270 WNTS.Rating 3.007e-02 4.157e-03 7.233 4.73e-13 *** Bottom.Prev -8.484e-01 1.730e-01 -4.903 9.44e-07 *** VFPercent 4.024e-02 7.135e-03 5.640 1.70e-08 *** YearOfRanking 7.042e-03 7.218e-03 0.976 0.3292 YearlyRank 1.626e-03 1.906e-03 0.853 0.3936 WeeksCharted 2.216e-02 2.172e-02 1.020 0.3077 Charted40OrBetter 9.316e-03 3.306e-02 0.282 0.7781 Charted10OrBetter -3.449e-02 2.450e-02 -1.408 0.1592 HighestPosition -1.717e-02 9.983e-03 -1.720 0.0854 . RadioPlays -1.142e-05 7.494e-05 -0.152 0.8789 WhitburnScore -4.273e-04 4.336e-04 -0.985 0.3244 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1296.6 on 1007 degrees of freedom Residual deviance: 1058.4 on 993 degrees of freedom (823 observations deleted due to missingness) AIC: 1088.4 Number of Fisher Scoring iterations: 5
The initial findings are not promising. By far the most significant variables are WNTS Rating, Votefair popularity, and whether or not the person had previously been in the Bottom Group. We can see that many of the variables are not significant. For instance, whether the song is actually played on the radio, which you might think would matter. In fact, it’s quite likely that makes no difference (at least on the entire data set).
Now let’s limit the model to just variables that seem to have a snowball’s chance in hell of mattering:
Call: glm(formula = Safe ~ Order + WNTS.Rating + Bottom.Prev + VFPercent + WeeksCharted + HighestPosition + WhitburnScore, family = binomial, data = SongData) Deviance Residuals: Min 1Q Median 3Q Max -2.6673 -0.9666 0.4831 0.8441 1.8820 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.2971209 0.2918839 -4.444 8.83e-06 *** Order 0.1263050 0.0258266 4.891 1.01e-06 *** WNTS.Rating 0.0279386 0.0039534 7.067 1.58e-12 *** Bottom.Prev -0.9486016 0.1578972 -6.008 1.88e-09 *** VFPercent 0.0395262 0.0066878 5.910 3.42e-09 *** WeeksCharted 0.0403195 0.0182330 2.211 0.02701 * HighestPosition -0.0089462 0.0037305 -2.398 0.01648 * WhitburnScore -0.0006699 0.0002277 -2.942 0.00326 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1333.5 on 1037 degrees of freedom Residual deviance: 1097.9 on 1030 degrees of freedom (793 observations deleted due to missingness) AIC: 1113.9 Number of Fisher Scoring iterations: 5
We can see that it’s still really dependent on how popular the contestant is and how well they sang the song. I am fairly confident that the idea that “song choice is everything” is really not the case.
But anyway, this article is focusing on song choice only, and we do see some effect of how well the song did on the Billboard Hot 100 charts, like the Whitburn Score. If we wanted to model the contest only that way, we could fit based on only song-related variables
Call: glm(formula = Safe ~ WhitburnScore, family = binomial, data = SongData) Deviance Residuals: Min 1Q Median 3Q Max -1.5172 -1.3885 0.8725 0.9611 1.3168 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.7706726 0.0756893 10.182 < 2e-16 *** WhitburnScore -0.0002011 0.0000519 -3.875 0.000107 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2405 on 1830 degrees of freedom Residual deviance: 2390 on 1829 degrees of freedom AIC: 2394 Number of Fisher Scoring iterations: 4
Of course, the Whitburn score now seems much more important when the other variables are not taken into account, as it’s easier to pass the significance test (NB: some of that is due to the fact that number of data points is now higher, since not all variables are known for all performances, such as Votefair). In fact, it’s quite likely that the WNTS Rating is related to the Whitburn score. We can test this using a linear regression, but before we do that, let’s look at the parameter estimate generated by R. Safe is being fit versus Whitburn Score, which is significant … but you are less safe the more popular the song was! The probability of being safe is actually reduced by singing a song that charted a lot.
To see how these are related, let’s do that linear regression now:
Call: lm(formula = WNTS.Rating ~ WhitburnScore, data = SongData) Residuals: Min 1Q Median 3Q Max -53.237 -16.237 -0.292 17.131 45.543 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 55.2371786 0.7801521 70.803 < 2e-16 *** WhitburnScore -0.0036882 0.0005484 -6.726 2.33e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 21.78 on 1829 degrees of freedom Multiple R-squared: 0.02414, Adjusted R-squared: 0.0236 F-statistic: 45.24 on 1 and 1829 DF, p-value: 2.327e-11
Indeed, WNTS Rating is decreased by increasing Whitburn Score, less popular songs are rated more highly. People actually rate unfamiliar songs better than familiar ones!
What conclusions can we draw from this? Someone’s chances overall are mostly dependent on how well they sang, how popular they are, and not on what song they chose. But to the degree that song choice does matter, you are better off choosing a song the audience is only marginally familiar with, or a song they’ve never heard.
Here’s the data in csv form.