Python: What song was that?
09 Mar 2022
I was working out to Minnesota Public Radio’s indie station The Current sometime during daylight hours this winter, and I heard a song with an ascending arpeggio that reminded me of one in Anna Stine’s Threshold of You, a song produced by the guitarist whose web site I’ve helped with, Robert Bell.
I knew he was listening to a lot of indie music when he produced that album, so I was curious if the song I’d just worked out to had inspired him or if he’d put something out into the world that later went big for something else. He has a tendency to be prescient with trends – painting his house colors that later grace every interior design magazine, thinking of people for the first time in 10 years right before they contact him, etc.
It turns out that “Threshold of You” was released before the song I heard, so it was a fun moment of “Say Robert, what is up with your ESP?”
Thing is, I’ve since forgotten what song I heard. Data science to the rescue?
I know I started working out to the current this winter sometime after Halloween, and it’s been long enough to forget the song, so I scraped all songs between 1/1/21 and 3/8/22 from The Current’s website.
Here are the baseline global variables:
import datetime
import requests
import os
from bs4 import BeautifulSoup
import json
import pandas
thefolder = 'c:\\example\\thecurrent\\'
baseurl = 'https://www.thecurrent.org/playlist/the-current/'
startdate = datetime.datetime.strptime('2021-11-01', '%Y-%m-%d') # I started the workout no earlier than 11/8/21 -- that's when I downloaded it. Earliest playlist they have is 12/22/05.
enddate = datetime.datetime.strptime('2022-03-08', '%Y-%m-%d') #datetime.datetime.today()
dates = (startdate + datetime.timedelta(days=x) for x in range(0, (enddate-startdate).days + 1))
urlsbydatestring = {(x.strftime('%Y-%m-%d')):(baseurl + x.strftime('%Y-%m-%d')) for x in dates}
I wrote a function that would download a file to my computer for every playlist page on The Current’s website during that time range:
# Download HTML files
def downloadhtmlfiles():
for datestring,url in urlsbydatestring.items():
filepath = os.path.join(thefolder, datestring+'.html')
if os.path.exists(filepath):
continue # We must have already downloaded this one
r = requests.get(url)
with open(filepath, 'w', encoding=r.encoding) as f:
f.write(r.text)
print('Downloading of HTML done.')
#downloadhtmlfiles() # Comment this out except for the first run of my script, to avoid burdening MPR
I ran downloadhtmlfiles()
just once, to be a good web-scraping citizen (no need to hit MPR’s servers over and over).
Lucky me, MPR is using Next.js, so I didn’t have to parse a lot of HTML to get the data I wanted. It was all already in a <script>
tag as JSON with a list of songs under props.pageProps.data.songs
. Thanks, MPR!
I extracted each day’s songs
list and concatenated it with that of the others, writing out a big file 2021-11-01_2022-03-08.json
onto my hard drive with every song played for the last 4.5 months.
# Parse HTML files, write JSON file
def htmltojson():
all_songs = []
for datestring,url in urlsbydatestring.items():
with open(os.path.join(thefolder, datestring+'.html'), 'r', encoding='utf-8') as f:
x = f.read()
soup = BeautifulSoup(x)
scripttag = soup.find("script", {"id": "__NEXT_DATA__"})
pagedata = json.loads(scripttag.string)
songs = pagedata.get('props', {}).get('pageProps', {}).get('data', {}).get('songs', [])
all_songs.extend(songs)
with open(os.path.join(thefolder, startdate.strftime('%Y-%m-%d') + '_' + enddate.strftime('%Y-%m-%d') + '.json'), 'w', encoding='utf-8') as f:
f.write(json.dumps(all_songs))
print('HTML to JSON done.')
#htmltojson() # Comment this out once JSON has been built
Then it was just a matter of loading up that data into Pandas like this and figuring out what to put under the comment # DO SOMETHING COOL HERE
:
# Parse JSON file, analyze songs
def jsontoanalysis():
with open(os.path.join(thefolder, startdate.strftime('%Y-%m-%d') + '_' + enddate.strftime('%Y-%m-%d') + '.json'), 'r', encoding='utf-8') as f:
songs = json.load(f, encoding='utf-8')
songsdf = pandas.DataFrame(songs)
songsdf.dropna(how='all', axis='columns', inplace=True) # Drop empty columns
songsdf.drop(columns=['service_id', 'record_co', 'broadcast', 'art_url'], inplace=True)
# DO SOMETHING COOL HERE
jsontoanalysis() # Here's where the fun happens
Here’s what I’ve come up with so far: I’ve removed every song that wasn’t played during daylight. The code says “morning” but that’s just because I didn’t rename my label after deciding perhaps I had worked out in the afternoon sometimes.
# Parse JSON file, analyze songs
def jsontoanalysis():
def timeofday(played_at):
hour = pandas.to_datetime(played_at).hour
if (hour > 5) and (hour <= 18):
return 'morning'
else:
return None
with open(os.path.join(thefolder, startdate.strftime('%Y-%m-%d') + '_' + enddate.strftime('%Y-%m-%d') + '.json'), 'r', encoding='utf-8') as f:
songs = json.load(f, encoding='utf-8')
releaseyears = {}
releaseyearspath = os.path.join(thefolder, 'releaseyears.ndjson')
filereleaseyearslen = 0
if os.path.exists(releaseyearspath):
with open(releaseyearspath, 'r', encoding='utf-8') as f:
lines = f.readlines()
if (lines is not None and len(lines) > 0):
for line in lines:
releaseyears.update(json.loads(line))
def get_rate_limited(url):
global lastrequesttimestamp
if ( datetime.datetime.now() < (lastrequesttimestamp + datetime.timedelta(seconds=1.3)) ):
sleep(1)
lastrequesttimestamp = datetime.datetime.now()
response = requests.get(url)
return response
def firstreleased(album_mbid):
if (pandas.isnull(album_mbid)):
return None
elif (album_mbid in releaseyears.keys()):
return releaseyears.get(album_mbid)
else:
#if (len(releaseyears) > 30): # DEBUG LINE ONLY
# return None # DEBUG LINE ONLY
r = get_rate_limited('https://musicbrainz.org/ws/2/release-group/' + album_mbid + '?inc=aliases%2Bartist-credits%2Breleases&fmt=json')
if (r is None):
releaseyears[album_mbid] = None
else:
data = r.json()
if (data is None):
releaseyears[album_mbid] = None
elif ('first-release-date' in data.keys()):
releaseyears[album_mbid] = data['first-release-date']
else:
releaseyears[album_mbid] = None
with open(releaseyearspath, 'a', encoding='utf-8') as f:
f.write(json.dumps({album_mbid:releaseyears.get(album_mbid)})+'\n')
return releaseyears.get(album_mbid)
#print(len(songs))
songsdf = pandas.DataFrame(songs)
songsdf.dropna(how='all', axis='columns', inplace=True) # Drop empty columns
songsdf.drop(columns=['service_id', 'record_co', 'broadcast', 'art_url'], inplace=True)
songsdf['timeofday'] = songsdf['played_at'].apply(timeofday)
songsdf['releaseyear'] = songsdf['album_mbid'].apply(firstreleased)
#print(songsdf.head())
#print(list(songsdf.columns))
# ['title', 'artist', 'album', 'played_at', 'duration', 'service_id', 'song_id', 'play_id', 'composer', 'conductor', 'orch_ensemble',
# 'soloist_1', 'soloist_2', 'soloist_3', 'soloist_4', 'soloist_5', 'soloist_6', 'record_co', 'record_id', 'addl_text', 'broadcast', 'rating_info',
# 'songs_on_album', 'songs_by_artist', 'album_mbid', 'art_url']
#print(songsdf['songs_by_artist'].head().dropna()) # Empty -- this is what I first wrote that led me to decide to drop empty columns above
#print(songsdf['artist'].unique()) # Not very interesting
songsdf.dropna(axis='rows', subset=['timeofday'], inplace=True) # Drop rows that weren't in the morning
songsdf.drop(columns=['timeofday'], inplace=True)
songsdf.sort_values(by='played_at', inplace=True)
songsdf.to_excel(os.path.join(thefolder, startdate.strftime('%Y-%m-%d') + '_' + enddate.strftime('%Y-%m-%d') + '.xlsx'), index=False)
songsdf.to_csv(os.path.join(thefolder, startdate.strftime('%Y-%m-%d') + '_' + enddate.strftime('%Y-%m-%d') + '.csv'), index=False)
Defining a function timeofday()
helped me temporarily add a timeofday
column to my Pandas DataFrame with values that either read “morning” or were empty. I threw away all rows where timeofday
was empty, then threw away the timeofday
column itself once it was no longer needed.
I also defined a function called firstreleased()
that queries the MusicBrainz database to find out when a song was released. It in turn uses a function called get_rate_limited()
that helps ensure I’m not exceeding MusicBrainz published rate limits.
Adding this information to my spreadsheet should help me highlight all songs released after “Threshold of You”, so I don’t bother with older songs.
(Although I think I’m still going to leave the other songs in the output spreadsheet so that if I see a sequence of songs I remember hearing while working out, I can pay special attention to the highlighted songs near it.)
In 4.5 months, 23,121 songs got played, so downloading data from MusicBrainz is taking a while.
Note that my firstreleased()
logic always defers to the contents of releaseyears.ndjson
and to the in-memory releaseyears
dict before bothering to fetch things over the internet, and that it writes to disk immediately after fetching data from the internet, so that prior MusicBrainz requests won’t have to be performed again if the Python script errors out or I stop it mid-execution during its many-hour run (I’m thinking it’ll probably take about 10 hours).
If I were to run jsontoanalysis()
after fully building out releaseyears.ndjson
so that it includes every MBID found in 2021-11-01_2022-03-08.json
, it shouldn’t send any requests over the internet to MusicBrainz at all.