This post is the most involved one yet - covering data engineering, NLP, AutoEncoders for Dimensionality Reduction, kNN, Dash/Flask/Heroku based web UI and a FastAPI/AWS based ML API.
The product is a web app that takes free form text(containing partial or complete names of the track and/or artists) input identifying a track and does the following:
- Display a dropdown list of possible matches, highlighting the most likely match.
- Based on the chosen match track, display a table listing tracks that are similar.
The web app is built using the PlotlyDash/Flask
framework and deployed on Heroku
- available at https://nsriniva-spotifinder.herokuapp.com/
This app relies on 2 ML RESTful JSON APIs
provided by a FastAPI
based server running on AWS EC2
.
The work uses 3 datasets(tracks.csv
and artists.csv
were hosted on Kaggle but the original location that hosted them has vanished ):
tracks.csv
- a Spotify dataset of tracks(not just english) from 1921-2020 withid
andname
to identify the track, along withartists
to list the names of the artists responsible for the track. It also contains the following properties associated with each track:- acousticness (Ranges from 0 to 1)
- danceability (Ranges from 0 to 1)
- energy (Ranges from 0 to 1)
- duration_ms (Integer typically ranging from 200k to 300k)
- instrumentalness (Ranges from 0 to 1)
- valence (Ranges from 0 to 1)
- popularity (Ranges from 0 to 100)
- tempo (Float typically ranging from 50 to 150)
- liveness (Ranges from 0 to 1)
- loudness (Float typically ranging from -60 to 0)
- speechiness (Ranges from 0 to 1)
- mode (0 = Minor, 1 = Major)
- explicit (0 = No explicit content, 1 = Explicit content)
- key (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…)
artists.csv
- another Spotify dataset containing thename
of artists and thegenres
associated with that artist.spotify_songs.csv
from here on Kaggle - a much smaller dataset of tracks withtrack_id
andtrack_name
to identify the track, along with the song properties listed above but with an additionallyrics
column that contains the lyrics
Data Transformation
The data transformation steps involved
- removing entries for tracks that have no
name
information - adding the
genres
information fromartists.csv
andlyrics
information fromspotify_songs.csv
to thetracks.csv
dataset. - determining the language for the various tracks using the
langdetect
module’sdetect
function - extracting the
english
language tracks - dropping the tracks with no
genres
information
All the data transformation work is detailed in the Data Cleanup notebook
Due to the size of the dataset, we use the parallel_apply
method, which needs to be activated as follows
from pandarallel import pandarallel
pandarallel.initialize()
INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Remove entries with invalid name
The first step is to remove entries without a valid name
field.
# Filter out rows with null values for the 'name' column
tracks_df = tracks_df[tracks_df.name.isna() == False]
Adding lyrics
The next step is to add a lyrics
feature.
To do this, we create a new dataset with just the track_id
and lyrics
fields from spotify_songs.csv
, with track_id
renamed to id
.
spotify_lyrics_df = spotify_df.filter(items=['track_id', 'lyrics']).rename(columns={'track_id':'id'})
and then merge it with the tracks_df
dataset
lyrics_df = tracks_df.merge(spotify_lyrics_df, how='left', on='id')
Adding genres
We then add a genres
feature by combining the genres information associated with each of the artists for the track.
def get_genres(x):
ret = []
for artist in x.split(','):
try:
ret += artists_df.loc[artist].genres
except:
pass
ret = ','.join(set(ret))
return ret
lyrics_df['genres'] = lyrics_df.id_artists.parallel_apply(get_genres)
Identifying language
Adding a lang
feature, using the detect
function (from the langdetect
module) on the track’s name
field
from langdetect import detect
def lang_detect(x):
try:
ret = detect(x)
except:
ret = 'unknown'
return ret
lyrics_df['lang'] = lyrics_df.name.parallel_apply(lang_detect)
Bringing it all together
We finally create (and export as csv) a new dataset with just english language tracks and valid genres
information
lyrics_en_df = lyrics_df[lyrics_df.lang == 'en']
lyrics_en_df = lyrics_en_df[lyrics_en_df.genres.isna() == False]
lyrics_en_df.to_csv('../data/tracks_genres_lyrics_en.csv')
This is the dataset that is used for the machine learning and deployment stages.
Machine Learning
There are 2 aspects to the ML involved in this project
- Given partial free form textual information about the name of the song and/or its artists, how is the track in the dataset identified?
- Given the track, how do we identify “similar” songs that can be recommended to the user?
In both cases, the approach is to convert the relevant text valued features for the various tracks in the dataset into vector
s whose dimensionality
is reduced using a AutoEncoder
and then used to fit a NearestNeighbor
model.
The Transcript Vectorization
and Dimensionality reduction using AutoEncoder
sections of Scribble Stadium — Bridging the Analog and Digital should help understand how textual data
can be converted into a numerical vector
, the need for dimensionality reduction
and how that could be accomplished using AutoEncoder
s.
Identifying/Selecting a song
All of the data engineering/modeling work associated with this section is detailed in the Data Modeling Notebook.
Since we want to identify a song based on potentially incomplete information on the name and artists of a song, the first step is to engineer a new name_cmplx_tokens
feature combining the name
and artists
features.
from re import compile as rcompile
tracks_df['name_cmplx'] = tracks_df.name + tracks_df.artists.apply(lambda x: ' '+x+' ')
rex = rcompile('[^a-zA-Z 0-9]')
tokenize = lambda x: rex.sub('', x.lower().replace(',', ' ').replace('-',' '))
tracks_df['name_cmplx_tokens'] = tracks_df.name_cmplx.apply(tokenize)
Compute a DTM(Document Term Matrix) using the name_cmplx_tokens
feature
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
data = tracks_df.name_cmplx_tokens.to_list()
# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words='english',
min_df=7,
)
# Create a vocabulary and get word counts per document
dtm = tfidf.fit_transform(data)
# Get feature names to use as dataframe column headers
features = tfidf.get_feature_names()
# Create DTM with dense form of the vectors
dtm = pd.DataFrame(dtm.todense(), columns=features)
The computed vectors have 12426 dimensions
and we build/train an AutoEncoder
to reduce the dimensionality down to 64
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam, Nadam
ishape = dtm.shape[1]
# Create Model
input_img = Input(shape=(ishape, ))
x = Dense(1024)(input_img)
x = Dense(256)(x)
x = Dense(128)(x)
encoded = Dense(64)(x)
x = Dense(128)(encoded)
x = Dense(256)(x)
x = Dense(1024, activation='sigmoid')(x)
decoded = Dense(ishape, activation='sigmoid')(x)
rmodel = Model(input_img, decoded)
rmodel.compile(loss='mse', optimizer=Adam(learning_rate=0.01))
rmodel.fit(dtm, dtm, batch_size=512, epochs=2)
encoder = Model(input_img, encoded)
encoded_dtm = encoder.predict(dtm)
We now have a trained vectorizer tfidf
and dimensionality reducer encoder
, along with the dimensionality reduced DTM encoded_dtm
.
The final step is to pickle these models for deployment.
from joblib import dump
MODELS_DIR = '../models/'
TFIDF = MODELS_DIR+'tfidf.pkl'
ENCODED_DTM = MODELS_DIR+'encoded_dtm.pkl'
ENCODER = MODELS_DIR+'encoder.h5'
dump(tfidf, TFIDF)
encoder.save(ENCODER)
dump(encoded_dtm, ENCODED_DTM)
Recommending songs based on song selection
All of the data engineering/modeling work associated with this section is detailed in the Recommendation Modeling Notebook.
The first step is to engineer a new genres_tokens
feature from genres
from re import compile as rcompile
rex = rcompile('[^a-zA-Z 0-9]')
tokenize = lambda x: rex.sub('', x.lower().replace(',', ' '))
tracks_df['genres_tokens'] = tracks_df.genres.apply(tokenize)
The genres_tokens
feature is then used to fit a vectorizer and generate a DTM
genres_data = tracks_df.genres_tokens.to_list()
# Instantiate vectorizer object
genres_tfidf = TfidfVectorizer(stop_words='english',
ngram_range=(1,2),
min_df=3,
)
# Create a vocabulary and get word counts per document
genres_dtm = genres_tfidf.fit_transform(genres_data)
# Get feature names to use as dataframe column headers
genres = genres_tfidf.get_feature_names()
genres_dtm = pd.DataFrame(genres_dtm.todense(), columns=genres)
We then scale the relevant numerical features associated with each track and combine with genres_dtm
to create a new array of vectors.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
features = ['popularity', 'duration_ms', 'explicit', 'danceability', 'energy', 'key',
'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'time_signature']
features_df = tracks_df[features]
scaler = MinMaxScaler()
features_scaled_df = pd.DataFrame(scaler.fit_transform(features_df), columns=features)
fg_df = pd.concat([features_scaled_df, genres_dtm], axis=1)
The computed vectors have 10902 dimensions
and we build/train an AutoEncoder
to reduce the dimensionality down to 64
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam, Nadam
ishape = fg_df.shape[1] #10902
# Create Model
input_img = Input(shape=(ishape, ))
x = Dense(1024)(input_img)
x = Dense(256)(x)
x = Dense(128)(x)
fg_encoded = Dense(64)(x)
x = Dense(128)(fg_encoded)
x = Dense(256)(x)
x = Dense(1024, activation='sigmoid')(x)
fg_decoded = Dense(ishape, activation='sigmoid')(x)
fg_rmodel = Model(input_img, fg_decoded)
fg_rmodel.compile(loss='mse', optimizer=Adam(learning_rate=0.01))
fg_rmodel.fit(fg_df, fg_df, batch_size=512, epochs=6)
fg_encoder = Model(input_img, fg_encoded)
fg_encoded_df = fg_encoder.predict(fg_df)
We now have a trained vectorizer genres_tfidf
, minmax scaler scaler
, and dimensionality reducer fg_encoder
, along with the dimensionality reduced vector array fg_encoded_df
.
As before, the final step is to pickle these models for deployment.
from joblib import dump
MODELS_DIR = '../models/'
#MODELS_DIR = './'
GENRES_TFIDF = MODELS_DIR+'genres_tfidf.pkl'
SCALER = MODELS_DIR+'scaler.pkl'
FG_ENCODED_DF = MODELS_DIR+'fg_encoded_df.pkl'
FG_ENCODER = MODELS_DIR+'fg_encoder.h5'
dump(genres_tfidf, GENRES_TFIDF)
dump(scaler, SCALER)
dump(fg_encoded_df, FG_ENCODED_DF)
fg_encoder.save(FG_ENCODER)
ML deployment API
This API is implemented in find_songs.py.
FindSongs
This class is a facade
over the APIs that are used in the deployed apps and is used for testing them.
-
.find_song_entry(sugg_str, best_choice=True)
Given
sugg_str
(a string containing part/whole of the song’s name and/or artist) returns either- a dataframe of song entries that are the closest matches - when
best_choice
is set toFalse
or
- a single song entry - when
best_choice
is set toTrue
- a dataframe of song entries that are the closest matches - when
-
.find_song_entries(sugg_str)
This is equivalent to
find_song_entry(sugg_str, best_choice=False)
-
.get_recommendations(song_entry)
Given a song entry
song_entry
, returns a dataframe of similar songs. ``` All these APIs are tested by executing
python -m app.data_model_usage > app/data_model_usage.out
from the repository root.
Since app/data_model_usage.out
is a git tracked file, it’s easy to verify that be behavior of the API has not changed.
### Deployment APIs
The deployment APIs are provided via the `FindSongEntries`, `FindSongData` and `FindSongRecommendations` classes, as well as the
standalone `getBestChoice` function.
Since the imports from the `tensorflow`/`tensorflow-cpu` and `sklearn` packages are only required by the `FindSongEntries` and `FindSongRecommentations` classes, they are dynamically imported using the `import_from_sk_tf` function.
load_model = None NearestNeighbors = None
def import_from_sk_tf(): global load_model, NearestNeighbors
if load_model is None:
load_model = import_module('tensorflow.keras.models').load_model
NearestNeighbors = import_module('sklearn.neighbors').NearestNeighbors ```
FindSongEntries
The __init__
method loads in all the pickled model files from Identifying/Selecting a song and builds a NearestNeighhbor model
import_from_sk_tf()
# Extract encoder.h5 from encoder.h5.zip
with ZipFile(ENCODER_PATH, 'r') as zipObj:
zipObj.extractall()
# Load the model saved in ../../models/encoder.h5
self.encoder = load_model(ENCODER)
# Load the TfIDF vectorizer saved in tfidf.pkl
self.tfidf = load(TFIDF)
# Load the encoded DTM saved in encoded_dtm.pkl
encoded_dtm = load(ENCODED_DTM)
# Fit NearestNeighbors on encoded DTM
self.nn = NearestNeighbors(n_neighbors=10, algorithm='ball_tree')
self.nn.fit(encoded_dtm)
.find_matching_songs(hint)
Given hint
(a string containing part/whole of the song’s name and/or artist) returns a list of indices of
tracks that are the closest matches.
# Vectorize the sugg_str by running it through tfidf
vec = self.tfidf.transform([tokenize(hint)]).todense()
# Reduce dimensionality by running through encoder
encoded_vec = self.encoder.predict(vec)
# Get list of indices of entries that are closest to sugg_str
entries = self.nn.kneighbors(encoded_vec)[1][0].tolist()
FindSongData
The __init__
method loads in the tracks dataset from the zipped csv file.
# Load tracks_df from zipped csv file tracks_genres_lyrics_en.csv.zip
self.tracks_df = pd.read_csv(TRACKS)
.get_song_entries_data(entries, sorted=False)
Given entries
, a list of track indices , returns a dataframe of tracks corresponding to those indices.
The dataframe is sorted according to the value of the popularity
field(in descending order) if sorted
is set to True
.
if sorted:
entries = self.tracks_df.iloc[entries].popularity.\
sort_values(ascending=False).index.tolist()
# Return a dataframe containing the sorted selection of entries
return self.tracks_df.loc[entries]
# Return a dataframe containing the selected entries
return self.tracks_df.iloc[entries]
.get_df_entry(idx)
Given the index idx
of a track, return the track entry.
return self.tracks_df.loc[idx]
getBestChoice(hint, df)
Given hint
and a dataframe of tracks that are the closest match returns the index of the entry that is the best match.
# Convert sugg_str to a set of tokens
sugg_set = set(tokenize(sugg_str).split())
# Get the list of index values for the dataframe
choice = df.index.tolist()
# Given index value of a song entry row, returns a set of
# tokens from the combined name and artists columns.
# The array syntax ['name'] is used in place of the dot
# syntax .name because .name returns the value from the index
# column
name_artists = lambda x: set(tokenize(df.loc[x]['name']+' '+
df.loc[x].artists).split())
# Given a set of tokens, it returns the length of its
# intersection with the sugg_set
# This is used as a measure how similar the input is to the
# sugg_set - the larger the return value, the greater the
# similarity
score_func = lambda x: len(sugg_set.intersection(x))
choices = [(y, name_artists(y)) for y in choice]
best_idx = 0
best_score = score_func(choices[0][1])
for idx, nm_art in enumerate(choices[1:]):
score = score_func(nm_art[1])
#print(f'{nm_art[1]}/{choices[best_idx][1]}/{sugg_set}::{score}/{best_score}')
if score > best_score:
best_score = score
best_idx = idx+1
choice = choices[best_idx][0]
FindSongRecommendations
The __init__
method loads in all the pickled model files from Recommending songs based on song selection
and builds a NearestNeighhbor model
import_from_sk_tf()
# Numerical features associated with a song entry
self.features = [
'popularity', 'duration_ms', 'explicit', 'danceability',
'energy', 'key', 'loudness', 'mode', 'speechiness',
'acousticness', 'instrumentalness', 'liveness', 'valence',
'tempo', 'time_signature'
]
# Load the model saved in fg_encoder.h5
self.fg_encoder = load_model(FG_ENCODER_PATH)
# Load the TfIDF vectorizer for genres data saved in genres_tfidf.pkl
self.genres_tfidf = load(GENRES_TFIDF)
# The original DF is DTM generated by genres_tfidf from genres data
# in the dataset + Numerical features
# Load the encoded DF from fg_encoded_df.pkl
fg_encoded_df = load(FG_ENCODED_DF)
# Load the StandardScaler saved at scaler.pkl
self.scaler = load(SCALER)
# Fit NearestNeighbors on encoded DF
self.fg_nn = NearestNeighbors(n_neighbors=10, algorithm='ball_tree')
self.fg_nn.fit(fg_encoded_df)
.get_recommended_songs_json(entry_data)
Given entry_data
(the JSON form of a track entry) return a list of indices of similar tracks.
x = pd.read_json(entry_data, typ='Series')
# Convert the genres feature to a vector
gvec = self.genres_tfidf.transform([tokenize(x.genres)]).todense()
# Standardize the numerical features
fvec = self.scaler.transform([x[self.features]])
# Combine both vectors to create a single features vector
vec = [fvec.tolist()[0] + gvec.tolist()[0]]
# Perform dimensionality reduction by running through fg_encoder
encoded_vec = self.fg_encoder.predict(vec)
# Get the list of indices of entries that are closest to
# the input entry
entries = self.fg_nn.kneighbors(encoded_vec)[1][0].tolist()
Deployment
The original plan was to deploy everything via Heroku but thanks to its very tight space and runtime memory budgets that became impossible. The solution was to retain the UI/HCI functionality on Heroku and move the ML functionality over to an AWS EC2 instance and to expose it to the UI backend via a JSON RESTful API.
The ML API - FastAPI/AWS
The ML API server is implemented in find_song_api.py using the FastAPI framework.
The FastAPI server is deployed on AWS EC2 - Developing and Deploying a COMPLETE Project Using FastAPI, Jinja2, SQLAlchemy, Docker, and AWS was a useful resource
The matching_songs
endpoint
This endpoint is a wrapper around the FindSongEntries API and returns a list of indices of tracks that are possible matches for the hint
parameter of type str
.
findSongEntries = FindSongEntries()
@app.post('/matching_songs')
async def find_matching_songs(hint:str):
return findSongEntries.find_matching_songs(hint)
The recommended_songs
endpoint
This endpoint is a wrapper around the FindSongRecommendations API and returns a list of indices of tracks that are similar to the song
parameter of type SongEntry
(the definition of SongEntry
was generated using the online JSON to Pydantic Converter tool )
class SongEntry(BaseModel):
id: str
name: str
popularity: int
duration_ms: int
explicit: int
artists: str
id_artists: str
release_date: str
danceability: float
energy: float
key: int
loudness: float
mode: int
speechiness: float
acousticness: float
instrumentalness: float
liveness: float
valence: float
tempo: float
time_signature: int
lyrics: Any
genres: str
lang: str
findSongRecommendations = FindSongRecommendations()
@app.post('/recommended_songs')
async def get_recommmended_songs(song:SongEntry):
return findSongRecommendations.get_recommended_songs_json(song.json())
The HCI(Human Computer Interface) - Dash/Flask/Heroku
The UI app server is implemented in app.py.
The Plotly Dash
framework is used to implement the UI.
The Dash app is deployed on Heroku - Heroku for Sharing Public Dash apps for Free was a useful resource.
Initializing the app server
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
server = app.server
Dealing with user text input
We use the dash_core_components
module’s Input
object to read user text input
dcc.Input(
id='Hint',
type = 'text',
placeholder = 'Song Name and/or Artist(s)',
debounce=True,
style={
'width':'30%',
'text-align':'left',
'display':'inline-block'
}
)
The debounce
attribute specifies how we want user input to be passed back to the server:
- A value of
True
will result in user input only being transmitted back to the server after the user hits. - A value of
False
(the default) will result in data being sent back to the server after every key press.
Using the user supplied hint to identify the track
The user supplied hint
is used to identify the list of indices of possible matches using the matching_songs endpoint of the ML RESTful API.
The list of matching indices is used to generate a dataframe df
of tracks using the FindSongData.get_song_entries_data API, which is then used(along with the user supplied hint
) by the getBestChoice API to identify the index of the best choice among the matches.
FEATURES = ['name', 'artists']
get_song_info = lambda x: ' '.join(x[FEATURES].to_list())
findSongData = FindSongData()
def find_matching_songs(hint):
ret = post(FIND_MATCHING_SONGS, params={'hint':hint})
return extract_id_list(ret.text)
@app.callback(
Output('Songs', 'options'),
Output('Songs', 'value'),
[Input('Hint', 'value')]
)
def set_options(hint):
if hint is None:
raise PreventUpdate
entries = find_matching_songs(hint)
df = findSongData.get_song_entries_data(entries, sorted=True)
best_idx = getBestChoice(hint, df)
dicosongs = [{'label': get_song_info(row), 'value': idx} for idx,row in df.iterrows()]
return dicosongs, best_idx
The output from the set_options
callback is wired to the dash_core_components
module’s DropDown
component
dcc.Dropdown(id='Songs',
multi=False,
style={
'width':'70%',
'vertical-align':'middle',
'display':'inline-block'
}
)
The multi
attribute specifies
- only one choice is allowed, if
multi
is set toFalse
- multiple choices are allowed, if
multi
is set toTrue
Using the identified track to make recommendations
The user chosen track index song
is used to retrieve the track entry using the FindSongData.get_df_entry API, which is then used with the recommended_songs
endpoint to get a list of indices of tracks that are similar to the user chosen song
.
A dataframe with the similar tracks is generated using the FindSongData.get_song_entries_data API and then the song name
and artists
fields are returned as a list of dictionary entries.
def get_recommended_songs(selected_song):
ret = post(GET_RECOMMENDED_SONGS,data=selected_song.to_json())
return extract_id_list(ret.text)
@app.callback(
Output('rec-table', 'data'),
[Input('Songs', 'value')],
)
def predict(song):
if song is None:
raise PreventUpdate
selected_song = findSongData.get_df_entry(song)
entries = get_recommended_songs(selected_song)
result = findSongData.get_song_entries_data(entries)
return result[FEATURES].to_dict('records')
The output from the predict
callback is wired to the dash_table
’s DataTable
component.
dt.DataTable(
id='rec-table',
columns=[{"name": i.upper(), "id": i} for i in FEATURES],
style_cell={'textAlign': 'center'},
style_table={'minWidth': '360px','width': '360px','maxWidth': '360px', 'marginLeft':'auto', 'marginRight':'auto'}
)