Initial Introduction to SpotifyR with KMeans

15 Aug 2020

In 2015, Spotify changed the game. The Swedish-based musical streaming service gave its users an inside look at their listening habits for the previous year. The data-dump, now called “Spotify Wrapped”, is a major content-release on social media; it is nearly impossible to go a day in December without seeing a beautiful graphic describing each person’s favorite artists.

Spotify Wrapped, along with my recent obsession with music and constant quest to find new data, led me to look for a way to further interact with my Spotify data. Using the company’s API and spotifyR, a package in R to assist with using the API, I recently used the huge amounts of personal, artist, and track data to create another summary of my listening habits as well as analyze one of my favorite playlists.

When looking at Spotify’s API and the package spotifyR, it is immediately easy to see its how applicable/interesting its data can be. The API and package offer a number of interesting applications, including song-specific metadata, playlist recommendations, and a slew of “biographical” information on the song/album. Additionally, users can investigate their own listening habits, finding their top 50 artists and top 50 songs over a span of four weeks, six months, or since the creation of their Spotify account.

After playing around with the data for a little bit, I wanted to show my friends my listening habits in an aesthetic fashion. Simple tables are fine, but it was not a pretty picture to look at or easily shared to social media. Luckily, spotifyR includes a link to each artists’/songs’ cover art. The images are artist/album-specific, allowing for identification of each artist in my top artists and the album affiliation of each song in my top songs. Each image itself is interesting, of course, but a better, more concise picture of my listening habits would include all of my top 50 artists/songs. This motivation led me to create a collage of the cover art of each of my top 49 (perfect 7x7 square!) in a collage. The result: a beautiful summary of my favorite artists in an easy-to-understand, easy-to-share image.

Artist Cover Art Collage

This image is a summary of my top 49 artists in April 2020. As you can see, The Weeknd was my top artist throughout that time, followed by Tame Impala, Lil Uzi Vert, and Caribou, each of whom had released an album in the previous two months.

After finishing this visualization project, I wanted to do something a little more analytical. I had recently started a playlist titled “elán vital”, which I was attempting to give a certain “mood” while incorporating a number of different genres. This playlist was taking me out of my comfort zone; it included songs from Michael Jackson’s jazzy “Baby Be Mine” to rapper JID’s mellow trap R&B song “Hereditary”. Because of the multiple genres in the playlist, I wanted to see if songs from similar genres would be clustered together, or if some other factor would influence their grouping. To accomplish this, I used Spotify’s audio features, which attempts to describe songs using a number of different data points.

Spotify’s audio features are divided into six categories: danceability, energy, speechiness, acousticness, instrumentalness, liveness, and valence (it is important to note that lyrics and/or thematic elements of songs are not included in this analysis – yet). To find these clusters, I used an unsupervised machine learning techniques called KMeans clustering. KMeans clustering is among the most used clustering techniques in the analytics world, searching for patterns and trends in the data and grouping like data together. The technique will be implemented from the cluster package in R.

To begin, I prepared my playlist’s data with spotifyR::get_playlist_audio_features() and created a dataframe containing only the song’s audio features. Next, I randomly selected the number of centers, or number of clusters, I wanted the algorithm to search for in my data. In this case, I selected three clusters.

Along with KMeans, I performed principal component analysis, a dimensionality reduction technique, on my data. KMeans and principal component analysis (PCA) have a deep relationship. By performing PCA on our data, we now have seven variables, with the top two variables accounting for over 50% of the data. Next, we can plot the manipulated data, as labeled by the song titles, on the top two PCA variables. The plot can also be seen with the eigenvectors of the original variables. Unfortunately, we can not easily see the clusters in this plot, an issue that the factoextra package from R easily solves.

PCA and KMeans Plots

As you can see, the two plots are essentially the same; the axes and relative location of each song, with the plot on the right easily showing the clusters.

While that looks pretty good, let’s see if we can do any better. If you recall, I randomly chose the number of clusters for our initial investigation; let’s try to fine-tune that and rely on math to give us the optimal number of clusters.

To start, it is important to know that there are three main methods of fine-tuning KMeans models: the elbow method, the silhouette method, and the gap statistic method. The elbow method finds the number of clusters that has the minimum total sum of squares within each cluster. The silhouette attempts to determine how well each object lays within a cluster, finding the average residual distance from the center for each cluster. The gap statistic method compares intracluster variation to null distribution clustering, searching for the maximum “improvement” the additional cluster provided the model. Each plot can be seen below.

Tuning Plots

According to our plot, the elbow method continuously decreases, meaning it suggests more clusters than less. The silhouette method has its maximum value at five clusters. The gap statistic method, on the other hand, tells us that just one cluster is optimal, suggesting that the distribution is no better than a null distribution. Because the silhouette method suggests five clusters and the elbow method supports it as a good selection, we will continue with five centers in our method.

After rerunning the model, we get a new plot with new clusters:

KMeans Plots

In cluster one in the top right, we have songs from iconic artists such as Michael Jackson and The Isley Brothers to more current, lowkey artists such as Yves Tumour and KAYTRANADA. To the second cluster, it is a number of recent songs with a notable instrumental uptick in its song, giving this cluster the highest mean acousticness and mean instrumentalness of our five clusters. The third cluster contains the most danceable songs of any of the five clusters, including songs from Bobby Caldwell to Clairo. In cluster four, the four songs are by artists more known for their production and instrumentals than their lyrics, containing the highest mean valence and mean energy of any cluster. The final cluster has the highest mean speechiness along with the lowest mean valence of the five groups, containing odes to failed relationships and the beginnings of new ones.

As you can see in the above plot, two of our clusters overlap. While this seems incorrect, it is important to remember that the current plot is only using two of our seven variables created during the dimensionality reduction from PCA. These two variables account for 54% of the data, but 46% of the data is unplotted. The KMeans analysis of elán vital’s resulted in within cluster sum of squares of 0.43 for cluster one, 0.56 for cluster two, 0.66 for cluster three, 0.36 for cluster four, and 0.44 for cluster five with 69.4% of the variation in the data explained by the clusters.

Overall, these strong measures show that even though elán vital has a consistent “mood” throughout the playlist, there are still specific groupings within it. In addition to the information gleaned from this analysis, I could use the clusters as an additional variable in future analysis. Altogether, I am only scratching the surface of this data’s potential. In later posts, I will analyze the lyrics and themes present in albums, combine lyrical themes with each song’s audio features, further investigate my listening habits, and more.