This post is the most involved one yet - covering data engineering, NLP, AutoEncoders for Dimensionality Reduction, kNN, Dash/Flask/Heroku based web UI and a FastAPI/AWS based ML API.

The product is a web app that takes free form text(containing partial or complete names of the track and/or artists) input identifying a track and does the following:

  1. Display a dropdown list of possible matches, highlighting the most likely match.
  2. Based on the chosen match track, display a table listing tracks that are similar.

The web app is built using the PlotlyDash/Flask framework and deployed on Heroku - available at https://nsriniva-spotifinder.herokuapp.com/

This app relies on 2 ML RESTful JSON APIs provided by a FastAPI based server running on AWS EC2.

The work uses 3 datasets(tracks.csv and artists.csv were hosted on Kaggle but the original location that hosted them has vanished ):

  • tracks.csv - a Spotify dataset of tracks(not just english) from 1921-2020 with id and name to identify the track, along with artists to list the names of the artists responsible for the track. It also contains the following properties associated with each track:
    • acousticness (Ranges from 0 to 1)
    • danceability (Ranges from 0 to 1)
    • energy (Ranges from 0 to 1)
    • duration_ms (Integer typically ranging from 200k to 300k)
    • instrumentalness (Ranges from 0 to 1)
    • valence (Ranges from 0 to 1)
    • popularity (Ranges from 0 to 100)
    • tempo (Float typically ranging from 50 to 150)
    • liveness (Ranges from 0 to 1)
    • loudness (Float typically ranging from -60 to 0)
    • speechiness (Ranges from 0 to 1)
    • mode (0 = Minor, 1 = Major)
    • explicit (0 = No explicit content, 1 = Explicit content)
    • key (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…)
  • artists.csv - another Spotify dataset containing the name of artists and the genres associated with that artist.
  • spotify_songs.csv from here on Kaggle - a much smaller dataset of tracks with track_id and track_name to identify the track, along with the song properties listed above but with an additional lyrics column that contains the lyrics

Data Transformation

The data transformation steps involved

  • removing entries for tracks that have no name information
  • adding the genres information from artists.csv and lyrics information from spotify_songs.csv to the tracks.csv dataset.
  • determining the language for the various tracks using the langdetect module’s detect function
  • extracting the english language tracks
  • dropping the tracks with no genres information

All the data transformation work is detailed in the Data Cleanup notebook

Due to the size of the dataset, we use the parallel_apply method, which needs to be activated as follows

from pandarallel import pandarallel
pandarallel.initialize()
INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

Remove entries with invalid name

The first step is to remove entries without a valid name field.

# Filter out  rows with null values for the 'name' column 
tracks_df = tracks_df[tracks_df.name.isna() == False]

Adding lyrics

The next step is to add a lyrics feature. To do this, we create a new dataset with just the track_id and lyrics fields from spotify_songs.csv, with track_id renamed to id.

spotify_lyrics_df = spotify_df.filter(items=['track_id', 'lyrics']).rename(columns={'track_id':'id'})

and then merge it with the tracks_df dataset

lyrics_df = tracks_df.merge(spotify_lyrics_df, how='left', on='id')

Adding genres

We then add a genres feature by combining the genres information associated with each of the artists for the track.

def get_genres(x):
    ret = []
    for artist in x.split(','):
        try:
            ret += artists_df.loc[artist].genres
        except:
            pass
    ret = ','.join(set(ret))

    return ret

lyrics_df['genres'] = lyrics_df.id_artists.parallel_apply(get_genres)

Identifying language

Adding a lang feature, using the detect function (from the langdetect module) on the track’s name field

from langdetect import detect

def lang_detect(x):
    try:
        ret = detect(x)
    except:
        ret = 'unknown'
    return ret

lyrics_df['lang'] = lyrics_df.name.parallel_apply(lang_detect)

Bringing it all together

We finally create (and export as csv) a new dataset with just english language tracks and valid genres information

lyrics_en_df = lyrics_df[lyrics_df.lang == 'en']
lyrics_en_df = lyrics_en_df[lyrics_en_df.genres.isna() == False]
lyrics_en_df.to_csv('../data/tracks_genres_lyrics_en.csv')

This is the dataset that is used for the machine learning and deployment stages.

Machine Learning

There are 2 aspects to the ML involved in this project

  1. Given partial free form textual information about the name of the song and/or its artists, how is the track in the dataset identified?
  2. Given the track, how do we identify “similar” songs that can be recommended to the user?

In both cases, the approach is to convert the relevant text valued features for the various tracks in the dataset into vectors whose dimensionality is reduced using a AutoEncoder and then used to fit a NearestNeighbor model.

The Transcript Vectorization and Dimensionality reduction using AutoEncoder sections of Scribble Stadium — Bridging the Analog and Digital should help understand how textual data can be converted into a numerical vector, the need for dimensionality reduction and how that could be accomplished using AutoEncoders.

Identifying/Selecting a song

All of the data engineering/modeling work associated with this section is detailed in the Data Modeling Notebook.

Since we want to identify a song based on potentially incomplete information on the name and artists of a song, the first step is to engineer a new name_cmplx_tokens feature combining the name and artists features.

from re import compile as rcompile

tracks_df['name_cmplx'] = tracks_df.name + tracks_df.artists.apply(lambda x: ' '+x+' ') 

rex = rcompile('[^a-zA-Z 0-9]')

tokenize = lambda x: rex.sub('', x.lower().replace(',', ' ').replace('-',' '))

tracks_df['name_cmplx_tokens'] = tracks_df.name_cmplx.apply(tokenize)

Compute a DTM(Document Term Matrix) using the name_cmplx_tokens feature

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

data = tracks_df.name_cmplx_tokens.to_list()

# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words='english',
                        min_df=7,
                       )
# Create a vocabulary and get word counts per document
dtm = tfidf.fit_transform(data)

# Get feature names to use as dataframe column headers
features = tfidf.get_feature_names()

# Create DTM with dense form of the vectors
dtm = pd.DataFrame(dtm.todense(), columns=features)

The computed vectors have 12426 dimensions and we build/train an AutoEncoder to reduce the dimensionality down to 64

from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam, Nadam

ishape = dtm.shape[1]
# Create Model 
input_img = Input(shape=(ishape, ))

x = Dense(1024)(input_img)

x = Dense(256)(x)

x = Dense(128)(x)

encoded = Dense(64)(x)

x = Dense(128)(encoded)

x = Dense(256)(x)

x = Dense(1024, activation='sigmoid')(x)
decoded = Dense(ishape, activation='sigmoid')(x)


rmodel = Model(input_img, decoded)
rmodel.compile(loss='mse', optimizer=Adam(learning_rate=0.01))

rmodel.fit(dtm, dtm, batch_size=512, epochs=2)

encoder = Model(input_img, encoded)

encoded_dtm = encoder.predict(dtm)

We now have a trained vectorizer tfidfand dimensionality reducer encoder, along with the dimensionality reduced DTM encoded_dtm.

The final step is to pickle these models for deployment.

from joblib import dump
MODELS_DIR = '../models/'

TFIDF = MODELS_DIR+'tfidf.pkl'
ENCODED_DTM = MODELS_DIR+'encoded_dtm.pkl'
ENCODER = MODELS_DIR+'encoder.h5'

dump(tfidf, TFIDF)
encoder.save(ENCODER)
dump(encoded_dtm, ENCODED_DTM)

Recommending songs based on song selection

All of the data engineering/modeling work associated with this section is detailed in the Recommendation Modeling Notebook.

The first step is to engineer a new genres_tokens feature from genres

from re import compile as rcompile

rex = rcompile('[^a-zA-Z 0-9]')

tokenize = lambda x: rex.sub('', x.lower().replace(',', ' '))

tracks_df['genres_tokens'] = tracks_df.genres.apply(tokenize)

The genres_tokens feature is then used to fit a vectorizer and generate a DTM

genres_data = tracks_df.genres_tokens.to_list()

# Instantiate vectorizer object
genres_tfidf = TfidfVectorizer(stop_words='english',
                        ngram_range=(1,2),
                        min_df=3,
                       )
# Create a vocabulary and get word counts per document
genres_dtm = genres_tfidf.fit_transform(genres_data)

# Get feature names to use as dataframe column headers
genres = genres_tfidf.get_feature_names()

genres_dtm = pd.DataFrame(genres_dtm.todense(), columns=genres)

We then scale the relevant numerical features associated with each track and combine with genres_dtm to create a new array of vectors.

from sklearn.preprocessing import StandardScaler, MinMaxScaler

features = ['popularity', 'duration_ms', 'explicit', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'time_signature']

features_df = tracks_df[features]
scaler = MinMaxScaler()

features_scaled_df = pd.DataFrame(scaler.fit_transform(features_df), columns=features)

fg_df = pd.concat([features_scaled_df, genres_dtm], axis=1)

The computed vectors have 10902 dimensions and we build/train an AutoEncoder to reduce the dimensionality down to 64

from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam, Nadam

ishape = fg_df.shape[1] #10902
# Create Model 
input_img = Input(shape=(ishape, ))

x = Dense(1024)(input_img)

x = Dense(256)(x)

x = Dense(128)(x)

fg_encoded = Dense(64)(x)

x = Dense(128)(fg_encoded)

x = Dense(256)(x)

x = Dense(1024, activation='sigmoid')(x)
fg_decoded = Dense(ishape, activation='sigmoid')(x)


fg_rmodel = Model(input_img, fg_decoded)
fg_rmodel.compile(loss='mse', optimizer=Adam(learning_rate=0.01))

fg_rmodel.fit(fg_df, fg_df, batch_size=512, epochs=6)

fg_encoder = Model(input_img, fg_encoded)

fg_encoded_df = fg_encoder.predict(fg_df)

We now have a trained vectorizer genres_tfidf, minmax scaler scaler, and dimensionality reducer fg_encoder, along with the dimensionality reduced vector array fg_encoded_df.

As before, the final step is to pickle these models for deployment.

from joblib import dump

MODELS_DIR = '../models/'
#MODELS_DIR = './'
GENRES_TFIDF = MODELS_DIR+'genres_tfidf.pkl'
SCALER = MODELS_DIR+'scaler.pkl'
FG_ENCODED_DF = MODELS_DIR+'fg_encoded_df.pkl'
FG_ENCODER = MODELS_DIR+'fg_encoder.h5'

dump(genres_tfidf, GENRES_TFIDF)
dump(scaler, SCALER)
dump(fg_encoded_df, FG_ENCODED_DF)
fg_encoder.save(FG_ENCODER)

ML deployment API

This API is implemented in find_songs.py.

FindSongs

This class is a facade over the APIs that are used in the deployed apps and is used for testing them.

  • .find_song_entry(sugg_str, best_choice=True)

    Given sugg_str(a string containing part/whole of the song’s name and/or artist) returns either

    • a dataframe of song entries that are the closest matches - when best_choice is set to False

    or

    • a single song entry - when best_choice is set to True
  • .find_song_entries(sugg_str)

    This is equivalent to find_song_entry(sugg_str, best_choice=False)

  • .get_recommendations(song_entry)

    Given a song entry song_entry, returns a dataframe of similar songs. ``` All these APIs are tested by executing

python -m app.data_model_usage > app/data_model_usage.out

from the repository root. Since app/data_model_usage.out is a git tracked file, it’s easy to verify that be behavior of the API has not changed.


### Deployment APIs

The deployment APIs are provided via the `FindSongEntries`, `FindSongData` and `FindSongRecommendations` classes, as well as the
standalone `getBestChoice` function.

Since the imports from the `tensorflow`/`tensorflow-cpu` and `sklearn` packages are only required by the `FindSongEntries` and `FindSongRecommentations` classes, they  are dynamically imported using the `import_from_sk_tf` function.

load_model = None NearestNeighbors = None

def import_from_sk_tf(): global load_model, NearestNeighbors

if load_model is None:
    load_model = import_module('tensorflow.keras.models').load_model
    NearestNeighbors = import_module('sklearn.neighbors').NearestNeighbors  ```

FindSongEntries

The __init__ method loads in all the pickled model files from Identifying/Selecting a song and builds a NearestNeighhbor model

import_from_sk_tf()
   
# Extract encoder.h5 from encoder.h5.zip

with ZipFile(ENCODER_PATH, 'r') as zipObj:
      zipObj.extractall()
            
# Load the model saved in ../../models/encoder.h5
self.encoder = load_model(ENCODER)
        
# Load the TfIDF vectorizer saved in tfidf.pkl
self.tfidf = load(TFIDF)

# Load the encoded DTM saved in encoded_dtm.pkl
encoded_dtm = load(ENCODED_DTM)

# Fit NearestNeighbors on encoded DTM
self.nn = NearestNeighbors(n_neighbors=10, algorithm='ball_tree')
self.nn.fit(encoded_dtm)
    
.find_matching_songs(hint)

Given hint(a string containing part/whole of the song’s name and/or artist) returns a list of indices of tracks that are the closest matches.

# Vectorize the sugg_str by running it through tfidf
vec = self.tfidf.transform([tokenize(hint)]).todense()

# Reduce dimensionality by running through encoder
encoded_vec = self.encoder.predict(vec)
        
# Get list of indices of entries that are closest to sugg_str
entries = self.nn.kneighbors(encoded_vec)[1][0].tolist()

FindSongData

The __init__ method loads in the tracks dataset from the zipped csv file.

# Load tracks_df from zipped csv file tracks_genres_lyrics_en.csv.zip
self.tracks_df = pd.read_csv(TRACKS)
.get_song_entries_data(entries, sorted=False)

Given entries, a list of track indices , returns a dataframe of tracks corresponding to those indices. The dataframe is sorted according to the value of the popularity field(in descending order) if sorted is set to True.


if sorted:
    entries = self.tracks_df.iloc[entries].popularity.\
        sort_values(ascending=False).index.tolist()
        
# Return a dataframe containing the sorted selection of entries
    return self.tracks_df.loc[entries]

# Return a dataframe containing the selected entries
return self.tracks_df.iloc[entries]        
.get_df_entry(idx)

Given the index idx of a track, return the track entry.

        return self.tracks_df.loc[idx]

getBestChoice(hint, df)

Given hint and a dataframe of tracks that are the closest match returns the index of the entry that is the best match.

    # Convert sugg_str to a set of tokens
    sugg_set = set(tokenize(sugg_str).split())
    
    # Get the list of index values for the dataframe
    choice = df.index.tolist()
    
    
    # Given index value of a song entry row, returns a set of
    # tokens from the combined name and artists columns.
    # The array syntax ['name'] is used in place of the dot
    # syntax .name because .name returns the value from the index
    # column
    name_artists = lambda x: set(tokenize(df.loc[x]['name']+' '+
                                          df.loc[x].artists).split())
    
    # Given a set of tokens, it returns the length of its
    # intersection with the sugg_set
    # This is used as a measure how similar the input is to the
    # sugg_set - the larger the return value, the greater the
    # similarity
    score_func = lambda x: len(sugg_set.intersection(x))
    
    choices = [(y, name_artists(y)) for y in choice]
    best_idx = 0
    best_score = score_func(choices[0][1])
    for idx, nm_art in enumerate(choices[1:]):
        score = score_func(nm_art[1])
        #print(f'{nm_art[1]}/{choices[best_idx][1]}/{sugg_set}::{score}/{best_score}')
        if score > best_score:
            best_score = score
            best_idx = idx+1
            
    choice = choices[best_idx][0]

FindSongRecommendations

The __init__ method loads in all the pickled model files from Recommending songs based on song selection and builds a NearestNeighhbor model

import_from_sk_tf()

# Numerical features associated with a song entry

self.features = [
            'popularity', 'duration_ms', 'explicit', 'danceability',
            'energy', 'key', 'loudness', 'mode', 'speechiness',
            'acousticness', 'instrumentalness', 'liveness', 'valence',
            'tempo', 'time_signature'
        ]

# Load the model saved in fg_encoder.h5
self.fg_encoder = load_model(FG_ENCODER_PATH)

# Load the TfIDF vectorizer for genres data saved in genres_tfidf.pkl
self.genres_tfidf = load(GENRES_TFIDF)

# The original DF is DTM generated by genres_tfidf from genres data
# in the dataset + Numerical features
# Load the encoded DF from fg_encoded_df.pkl
fg_encoded_df = load(FG_ENCODED_DF)
      
# Load the StandardScaler saved at scaler.pkl
self.scaler = load(SCALER)

# Fit NearestNeighbors on encoded DF
self.fg_nn = NearestNeighbors(n_neighbors=10, algorithm='ball_tree')
self.fg_nn.fit(fg_encoded_df)

Given entry_data(the JSON form of a track entry) return a list of indices of similar tracks.

x = pd.read_json(entry_data, typ='Series')
        
# Convert the genres feature to a vector
gvec = self.genres_tfidf.transform([tokenize(x.genres)]).todense()
         
# Standardize the numerical features 
fvec = self.scaler.transform([x[self.features]])
          
# Combine both vectors to create a single features vector
vec = [fvec.tolist()[0] + gvec.tolist()[0]]
        
# Perform dimensionality reduction by running through fg_encoder
encoded_vec = self.fg_encoder.predict(vec)
        
# Get the list of indices of entries that are closest to
# the input entry
entries = self.fg_nn.kneighbors(encoded_vec)[1][0].tolist()     

Deployment

The original plan was to deploy everything via Heroku but thanks to its very tight space and runtime memory budgets that became impossible. The solution was to retain the UI/HCI functionality on Heroku and move the ML functionality over to an AWS EC2 instance and to expose it to the UI backend via a JSON RESTful API.

The ML API - FastAPI/AWS

The ML API server is implemented in find_song_api.py using the FastAPI framework.

The FastAPI server is deployed on AWS EC2 - Developing and Deploying a COMPLETE Project Using FastAPI, Jinja2, SQLAlchemy, Docker, and AWS was a useful resource

The matching_songs endpoint

This endpoint is a wrapper around the FindSongEntries API and returns a list of indices of tracks that are possible matches for the hint parameter of type str.

findSongEntries = FindSongEntries()

@app.post('/matching_songs')
async def find_matching_songs(hint:str):
    return findSongEntries.find_matching_songs(hint)

This endpoint is a wrapper around the FindSongRecommendations API and returns a list of indices of tracks that are similar to the song parameter of type SongEntry(the definition of SongEntry was generated using the online JSON to Pydantic Converter tool )

class SongEntry(BaseModel):
    id: str
    name: str
    popularity: int
    duration_ms: int
    explicit: int
    artists: str
    id_artists: str
    release_date: str
    danceability: float
    energy: float
    key: int
    loudness: float
    mode: int
    speechiness: float
    acousticness: float
    instrumentalness: float
    liveness: float
    valence: float
    tempo: float
    time_signature: int
    lyrics: Any
    genres: str
    lang: str
    
findSongRecommendations = FindSongRecommendations()

@app.post('/recommended_songs')
async def get_recommmended_songs(song:SongEntry):
    return findSongRecommendations.get_recommended_songs_json(song.json())

The HCI(Human Computer Interface) - Dash/Flask/Heroku

The UI app server is implemented in app.py.

The Plotly Dash framework is used to implement the UI.

The Dash app is deployed on Heroku - Heroku for Sharing Public Dash apps for Free was a useful resource.

Initializing the app server

app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
server = app.server

Dealing with user text input

We use the dash_core_components module’s Input object to read user text input

       dcc.Input(
            id='Hint',
            type = 'text',
            placeholder = 'Song Name and/or Artist(s)',
            debounce=True,
            style={
                'width':'30%',
                'text-align':'left',
                'display':'inline-block'
            }
        )           

The debounce attribute specifies how we want user input to be passed back to the server:

  • A value of True will result in user input only being transmitted back to the server after the user hits .
  • A value of False(the default) will result in data being sent back to the server after every key press.

Using the user supplied hint to identify the track

The user supplied hint is used to identify the list of indices of possible matches using the matching_songs endpoint of the ML RESTful API. The list of matching indices is used to generate a dataframe df of tracks using the FindSongData.get_song_entries_data API, which is then used(along with the user supplied hint) by the getBestChoice API to identify the index of the best choice among the matches.

FEATURES = ['name', 'artists']
 
get_song_info = lambda x:  ' '.join(x[FEATURES].to_list())
 
findSongData = FindSongData()

def find_matching_songs(hint):
    ret = post(FIND_MATCHING_SONGS, params={'hint':hint})
    return extract_id_list(ret.text)

@app.callback(
    Output('Songs', 'options'),
    Output('Songs', 'value'),
    [Input('Hint', 'value')]
)
def set_options(hint):
    if hint is None:
        raise PreventUpdate
    entries = find_matching_songs(hint)
    df = findSongData.get_song_entries_data(entries, sorted=True)
    best_idx = getBestChoice(hint, df)
    dicosongs = [{'label': get_song_info(row), 'value': idx} for idx,row in df.iterrows()]
    return dicosongs, best_idx

The output from the set_options callback is wired to the dash_core_components module’s DropDown component

        dcc.Dropdown(id='Songs',
                     multi=False,
                     style={
                         'width':'70%',
                         'vertical-align':'middle',
                         'display':'inline-block'
                     }
        )

The multi attribute specifies

  • only one choice is allowed, if multi is set to False
  • multiple choices are allowed, if multi is set to True

Using the identified track to make recommendations

The user chosen track index song is used to retrieve the track entry using the FindSongData.get_df_entry API, which is then used with the recommended_songs endpoint to get a list of indices of tracks that are similar to the user chosen song. A dataframe with the similar tracks is generated using the FindSongData.get_song_entries_data API and then the song name and artists fields are returned as a list of dictionary entries.

def get_recommended_songs(selected_song):
    ret = post(GET_RECOMMENDED_SONGS,data=selected_song.to_json())
    return extract_id_list(ret.text)

@app.callback(
    Output('rec-table', 'data'),
    [Input('Songs', 'value')],
)
def predict(song):
    if song is None:
        raise PreventUpdate
    selected_song = findSongData.get_df_entry(song)
    entries = get_recommended_songs(selected_song)
    result = findSongData.get_song_entries_data(entries)

    return result[FEATURES].to_dict('records')

The output from the predict callback is wired to the dash_table’s DataTable component.

        dt.DataTable(
            id='rec-table',
            columns=[{"name": i.upper(), "id": i} for i in FEATURES],
            style_cell={'textAlign': 'center'},
            style_table={'minWidth': '360px','width': '360px','maxWidth': '360px', 'marginLeft':'auto', 'marginRight':'auto'}
        )