Neo4j Playlist Builder: Organize your Music with Graph Data Science

September 23, 2020 - Neo4j Projects - 12 minute read

Not too long ago my "Liked songs" list on Spotify surpassed 1900 songs. A big pile of uncategorized songs...


Not too long ago my “Liked songs” list on Spotify surpassed 1900 items. I was a little daunted by the big pile of uncategorized songs, and obviously far too lazy to organize them manually. I wondered, instead of going to do this manually, why not build a tool to do it for me?

This post describes how I built a tool to organize my songs into playlists using Neo4j and (graph) data science:

  1. Using Python, I load data directly into Neo4j using the Spotify API.
  2. I run (graph) algorithms to build playlists with similar genres and sound.
  3. I use keywords to dynamically name playlists.
  4. Finally, I use the resulting graph to create playlists in Spotify.

As always, my code is available on GitHub. Instructions for building your own playlists are also there. Lets dig in.

1. Building the Spotify music graph

First, I’m building a graph that contains all the info on my music - including tracks, albums, artists and genres. These features will be eventually used to generate the playlists.

My initial graph model is shown below: headergds

Using the Spotify API

Using the Spotify API and the Neo4j Python driver, I’ll load my music details into a graph. The Python library spotipy is a neat wrapper that allows you to get/set Python variables directly from the API:

spotify = spotipy.Spotify(client_credentials_manager=
    SpotifyClientCredentials(client_id=client, client_secret=secret))
results = spotify.playlist(playlist_uri)['tracks']

This function will directly return a list of tracks as Python dict objects. After some basic preprocessing, I create a fresh Neo4j database and use the Neo4j Python driver to create my nodes and relationships:"UNWIND $tracks as track CREATE (t:Track{id:}) SET t = track",
              parameters={'tracks': list(tracks.values())})    "UNWIND $albums as album CREATE (a:Album{id:}) SET a = album",
          parameters={'albums': list(albums.values())})
# Link tracks to albums"MATCH (t:Track), (a:Album{id: t.album}) CREATE (t)-[:IN_ALBUM]->(a);")

As a bonus, we can use the Spotify API to load a detailed audio analysis of each track into our ‘Track’ nodes, which tells us about song features:

  "duration_ms" : 255349,
  "key" : 5,
  "mode" : 0,
  "time_signature" : 4,
  "acousticness" : 0.514,
  "danceability" : 0.735,
  "energy" : 0.578,
  "instrumentalness" : 0.0902,
  "liveness" : 0.159,
  "loudness" : -11.840,
  "speechiness" : 0.0461,
  "valence" : 0.624,
  "tempo" : 98.002,
  "id" : "06AKEBrKUckW0KREUWRnvT",
  "uri" : "spotify:track:06AKEBrKUckW0KREUWRnvT",
  "track_href" : "",
  "analysis_url" : "",
  "type" : "audio_features"

Here’s what the resulting graph looks like: headergds

2. Generating the playlists

My ultimate goal is to build playlists with similar sounding music. My first instinct was to group songs by genre, so let’s see how many we have:


The result? My 1900 songs are spread out over a total of 740 genres. (Fun fact: Spotify contains over 1000 total genres) Obviously we don’t want 740 playlists, so let’s try something new.

Genre clustering with Graph Data Science

My next step was to group the 740 genres into clusters, so I can create a playlist for each cluster. From my graph I see that artists often have multiple genres associated with them, so I can use the artist-genre relationship to build my clusters of similar genres. Using the Neo4j graph data science library, we can use the node similarity algorithm to calculate similarity scores between genres:

CALL gds.graph.create.cypher('genre-has-artist',
    'MATCH (p) WHERE p:Artist OR p:Genre RETURN id(p) as id',
    'MATCH (t:Artist)-[:HAS_GENRE]->(g:Genre) 
     RETURN id(g) AS source, id(t) AS target')
CALL gds.nodeSimilarity.write('genre-has-artist', 
    {writeRelationshipType: 'SIMILAR_TO', writeProperty: 'score'})

We now have an interconnected (weighted) graph of genres: headergds

Next, we can cluster genres using a Louvain algorithm:

CALL gds.graph.create('genre-similar-to-genre', 'Genre',
    {SIMILAR_TO: {orientation: 'NATURAL'}}, 
    { relationshipProperties: 'score'})
CALL gds.louvain.write('genre-similar-to-genre', 
    { relationshipWeightProperty: 'score', 
    writeProperty: 'community'})

This gives me a set of 15 genre clusters. Each cluster is essentially representing a ‘super genre’, an overall grouping of genres that appear together. I create these ‘super genres’ as nodes in my graph.

MATCH (g:Genre)
WITH DISTINCT as community
CREATE (s:SuperGenre{id: community})
MATCH (g:Genre{community:})
CREATE (g)-[:PART_OF]->(s)

The result - each super genre groups similar genres together:

Supergenre Genres    
591 [“psychedelic rock”, “british folk”, “folk rock”, “glam rock”, “blues rock”, “traditional folk”, “country rock”, “mellow gold”, “scottish singer-songwriter”, “british soul”, “glam metal”, “metal”, “hard rock”, “canadian psychedelic”, “women’s music”, “canadian singer-songwriter”, “symphonic rock”, “lilith”, “heartland rock”, “classic rock”, “rock drums”, “supergroup”, “yacht rock”, “american folk revival”, “singer-songwriter”, “beatlesque”, “soft rock”]    
260 [“vocal jazz”, “swing”, “lounge”, “vocal harmony group”, “soul jazz”, “adult standards”, “ballroom”, “easy listening”, “torch song”, “jazz trombone”, “jazz blues”]    
510 [“indie soul”, “filter house”, “vocal house”, “new french touch”, “uk contemporary r&b”, “electrofox”, “nu disco”, “deep house”, “disco house”, “bboy”, “tribal house”]    
217 [“early modern classical”, “italian opera”, “post-romantic era”, “american modern classical”, “early avant garde”, “impressionism”, “classical”]    

Splitting the super genres by audio properties

We’re getting there. Creating a playlist for each super genre is a possible way forward, but there’s more we can do. Some of the super genres are still a little big for my taste, as they can contain up to 400 different tracks. Personally, I like my playlists to have roughly 100-200 songs, so we’ll have to divide the super genres up into smaller pieces.

My next step is to use the track’s audio properties (energy, valence, danceability, etc.) to break up each super genre. There has been a lot of research on categorizing music based on valence & energy, so I’ll take a similar approach.

headergds (Image credit - Patrick Helmholz, Technische Universität Braunschweig)

The Spotify API returns values between 0.0 and 1.0 for audio properties:

  • Valence: 0.0 is sad and depressed, 1.0 is cheerful and happy.
  • Energy: 0.0 is very calm, 1.0 is highly energetic.

As an example, I load all tracks in my biggest super genre. Then, I can plot each song as a dot on an energy/valence scale using matplotlib:


Using a kmeans clustering algorithm, we can create 4 clusters across the valence and energy dimensions:

result ="""
MATCH (s:SuperGenre{id: $superGenre})--(t:Track)
RETURN, t.valence as valence, as energy""", parameters={'superGenre': super_genre_id}).data()

x = pd.DataFrame.from_records(result)
kmeans = KMeans(n_clusters=int(len(result) / 100) + 1, random_state=0).fit(x[['energy', 'valence']])

headergds Now, we can group our songs based on both the super genre and their sound. I then create a playlist node for each (super genre, kmeanscluster) combination. Here’s the final data model after creating the playlists:


The result: we’ve brought down 1900 tracks into 24 playlists.

3. Dynamic playlist names

The last thing I want to do is give my playlists some nice names. I want the name to be based on the song’s genres, as well as the audio features of the songs.

Starting with the genres, I use a keyword-based method to generate a name. Using Cypher, I query Neo4j to get all the song genres in the playlist. I then do a count on keywords in this list of names.

MATCH (g:Genre)<-[:HAS_GENRE]-(a:Artist)<-[:HAS_ARTIST]-(t:Track)-[:IN_PLAYLIST]->(p:Playlist)
WITH p, collect(split(, ' ')) as names
WITH p, reduce(allwords = [], n IN names | allwords + n) AS keywords
UNWIND keywords as keyword
WITH p, keyword, count(*) as wordcount order by wordcount desc
RETURN id(p), reduce(name = '', n IN collect(keyword)[0..5]| name + ' ' + n) AS name

This approach allows me to find the common words among genres. The result is the five most common keywords for the playlists:

| playlist | name                                           | genres                                                                                                                                                                                                                                                                                                               |
| 5737     | "rock pop classic mellow gold"                | ["classic uk pop", "glam rock", "soft rock", "new wave", "art rock", "new wave pop", "post-punk", "new romantic", "uk post-punk", "rock", "dance rock", "madchester", "britpop", "synthpop", "piano rock", "singer-songwriter", "folk rock", "mellow gold", "classic rock", "album rock", "blues", "blues rock", "metal", "hard rock", "roots rock", "country rock", "heartland rock", "pop rock", "punk", "permanent wave", "europop", "electric blues", "uk dance", "deep groove house", "indie rock", "indie pop", "modern alternative rock", "modern rock", "dance pop", "ska", "ska revival", "new orleans blues", "soul", "funk", "louisiana blues", "jazz blues", "piano blues", "merseybeat", "bubblegum pop", "yacht rock", "adult standards", "alternative dance", "new rave", "alternative rock", "dance-punk", "electronica", "indietronica", "electronic rock", "traditional folk", "folk", "classic swedish pop", "swedish pop", "irish rock", "british invasion", "vintage french electronic", "laboratorio", "big room", "rock drums", "neo mellow", "modern blues rock", "punk blues", "garage rock", "motown", "disco", "northern irish indie", "baroque pop", "doo-wop", "rockabilly", ...]          |
| 5734     | "rock pop house indie modern"                 | ["future funk", "synthwave", "filter house", "pop", "neo mellow", "pop rock", "indie poptimism", "modern alternative rock", "modern rock", "edm", "electro house", "big room", "progressive house", "dutch house", "deep big room", "progressive electro house", "dance pop", "tropical house", "rock", "piano rock", "indie rock", "indie pop", "la indie", "swedish indie pop", "brostep", "moombahton", "complextro", "austrian pop", "salzburg indie", "quebec indie", "permanent wave", "canadian indie", "baroque pop", "dance-punk", "new rave", "alternative dance", "alternative rock", "indietronica", "trance", "dutch trance", "canadian pop", "classic swedish pop", "australian dance", "electropop", "metropopolis", "norwegian pop", "new french touch", "nu gaze", "swedish alternative rock", "deep house", "deep tropical house", "indie soul", "nu disco", "shimmer pop", "pop edm", "hardcore techno", "big beat", "rave", "drum and bass", "intelligent dance music", "electronica", "breakbeat", "future bass", "vapor twitch", "electronic trap", "grunge", "alternative metal", "leicester indie", "britpop", "garage rock", "popwave", "nightrun", "zolo", "art pop", "pub rock", ...]        |
| 5746     | "rock adult standards lounge jazz"            | ["yacht rock", "soft rock", "mellow gold", "quiet storm", "new wave pop", "adult standards", "jazz quartet", "jazz", "cool jazz", "contemporary post-bop", "bebop", "lounge", "easy listening", "vocal harmony group", "swing", "vocal jazz", "ballroom", "exotica", "album rock", "art rock", "folk rock", "singer-songwriter", "country rock", "folk", "classic rock", "symphonic rock", "progressive rock", "brill building pop", "rock-and-roll", "soul", "rockabilly", "harlem renaissance", "jazz piano", "stride", "big band", "classic country pop", "bossa nova", "space age pop", "jazz organ", "disco", "bubblegum pop", "australian dance", "motown", "classic soul", "funk", "traditional folk", "doo-wop", "merseybeat", "jazz trombone", "vintage schlager", "chanson", "torch song", "british invasion", "soul jazz", "jazz blues", "new orleans jazz", "dixieland", "jazz trumpet", "acid house", "electronica", "new rave", "chicago house", "alternative dance", "electroclash", "new romantic", "vintage french electronic", "laboratorio", "comic", "moog", "vintage jazz", "jazz clarinet", "jazz orchestra", "northern soul", "new orleans blues", "blues", "electric blues", "blues rock", ...] |
| 5747     | "electro swing nu jazz house"                 | ["rebel blues", "steampunk", "electro swing", "nu jazz", "electro house", "complextro", "edm", "filthstep", "gaming edm", "bass trap", "downtempo", "chillstep", "gamecore", "big beat", "disco house", "tropical house", "dance pop", "pop", "liquid funk", "drum and bass", "jump up", "uk dance", "electronica", "ninja", "jazztronica", "turntablism", "trip hop", "south african pop", "cape town indie", "australian hip hop", "australian dance", "escape room", "australian indie", "australian electropop", "jazz boom bap", "slovenian electronic", "livetronica", "uk dnb", "soundtrack", "hardcore techno", "rave", "intelligent dance music", "breakbeat", "uk funky", "jungle", "indietronica", "dream pop", "deep tropical house", "post-rock", "compositional ambient", "neo-classical", "ambient", "nantes indie", "rock-and-roll", "rockabilly", "dutch pop", "balkan brass", "essex indie", "deep house", "filter house", "bboy", "indie soul", "new french touch", "nu disco", "vocaloid"]                                                                                                                                                                                                          |

Next, looking at the audio features, I calculate the average valence/mood for each playlist. Then, I name it based on the range the value is in:

| Value Range  |  Energy   | Valence       |
| 0.00-0.25    | serene    | heavy-hearted |
| 0.25-0.50    | calm      | low           |
| 0.50-0.75    | active    | lively        |
| 0.75-1.00    | energetic | cheerful      |

Combining these with the top three keywords from the genres, we get the following playlist names:

playlist energy valence name
1 0.758713414634 0.757304878048 “Indie House Alternative - energetic, cheerful”
2 0.789559523809 0.3747351190476 “House Indie Modern - energetic, low”
3 0.4565880000000 0.204347272727 “House Indie French - calm, heavy-hearted”
4 0.4158924731182 0.322329032258 “Classic Gold Soft - calm, low”
5 0.735782805429 0.793076923076 “Classic Gold Soft - active, cheerful”

Note: As it turns out, keywords ‘pop’, ‘rock’, or ‘new’ were so common in my genres that they showed up in almost all my playlists. As a result, my final naming function ignores these common keywords.

4. Creating the playlists in Spotify

Now that we’ve created and named our playlists, all that’s left to do is create them in Spotify. Here’s what to do:

  1. Connect to Spotify usingSpotifyOAuth, which allows you to manage playlists.
  2. Create the empty playlists with our generated names. I added an [NPB] prefix to make sure I can distinguish them from my existing playlists.
  3. Add the tracks to our playlists. Taking into account the Spotify API accepts 100 tracks at a time, I cut them up into pages.
result ="""
MATCH (n:Playlist)<-[:IN_PLAYLIST]-(t:Track) 
RETURN as name, as playlist, collect( as tracks

for item in result:
    playlist = spotify.user_playlist_create(user_id, item['name'])
    pages = cut_into_pages(item, page_size)
    for page in pages:
        spotify.user_playlist_add_tracks(user_id, playlist['id'], page)

If all has worked correctly, you’re now able to enjoy your brand new playlists. Below is a screenshot of my results. Happy listening!


Next Steps

Take a look the entire Python script here if you’re ready to dig deeper into the code. The Spotify API is super extensive, so there is tons of data available to build smarter playlists. Check out the Neo4j graph data science library for more on graph algorithms.

Now that your music is organized, why don’t you use your brand new music graph to impress your boss?