Music library clustering
Clustering a music library into playlists while only using extracted audio features. Information retrieval methods such as PCA, Autoencoder, and k-Means are evaluated.
Dynamic playlists, “radio modes”, or music recommendation systems often rely on listening habits or human annotation (artists, genres). Information retrieval on a huge listener database might not be possible and tag labels might be misleading, though. For example, the latter falls short when it comes to unusually “soft/hard” pieces of music from otherwise different artists/genres that are still perceived similar.
This is an attempt to explore whether pure music information retrieval yields audio properties that correlate to a similar “mood”, while explicitly excluding metadata from the whole process.
Samples
For demonstration purposes, as dataset (that does not push browser bandwidth and SVG rendering limits) we pick 50 random songs from 20 genres.
In the following, the genre information (as obtained by mutagen) is never encoded as feature dimension and only used for clustering evaluation and plotting.
Categorical Sorting
Note that the genre list is not semantically sorted – nearby colors are not necessarily “related”. Seems this is not an easy problem to solve. (Side-Quest.)
Re-using the end result or at least some string-based metric such as Levenshtein of from difflib would give a similarity matrix between genres. Finding the optimal ordering that minimizes the sum of individual distances can then be reduced to the Travelling Salesman Problem which is (un-)surprisingly vastly out of scope: For example 100 genres already result in about 2524 permutations (100!) to try.
Thus, the local search solver from python-tsp is used as heuristic instead. This approach might however still be slightly better than assigning random (hash-based) or sorted/discrete (predefined colormap) colors.
Feature Extraction
Various audio features are extracted using librosa. In order to avoid encoding timeseries, each file is assigned single values, namely the overall average, median, and standard deviation – after clamping 1% vast outliers. For example, the spectral centroid, sorted by median and colored by genre:
In especially when re-sorting by genre, there seems indeed some pattern emerging that could help telling them apart:
- Centroid
- The mean frequency of the magnitude spectrogram. As there is little sense in averaging over per-frame frequencies, the results are converted to a mel scale, which behaves more linearly as it represents the “perceived” frequency rather than a logarithmic measurement.
- Bandwidth
- Spectral bandwidth as a measure of how narrow or wide the signal is around the centroid, or how much this varies over time.
- Flatness
- Spectral flatness as additional measure of noisiness.
- Zero Crossings
- How often the signal crosses zero, i.e., flips signs.
- Roll-On/-Off
- Where the lowest 10% and top 10% energy is contained.
- Spectrum
- The absolute magnitude for six bark band bins (100, 510, 1080, 2000, 3700, and 15500 Hz).
- Tempogram
- The overall detected BPM value seems to be not telling much. Instead, something like the “flatness” on the self-recurrence beat detection is used, which indicates a more varying or constant tempo – as simpler replacement for high frequencies in the linear tempogram itself.
- HPSS
- RMS difference of the decomposition into harmonic and percussive components.
As feature vector, the 16 most promising values are picked. After independent normalization, the full matrix of all features for all input songs can be visualized as:
Note that some values are negated: Flipping a dimension and thereby simply moving to another quadrant should not be important for encoding or clustering. However, this avoids negative PCA factors lateron, which can be interpreted as “contradicting axes”.
Encoding
Treating each feature as independent dimension is very likely to skew results due to redundancy, correlation, or differing variance. Also, the contained information (or entropy) in the dataset is most likely less than the full 16D input. Such dependencies inside the dataset can be accounted for by extracting and encoding the most significant information bits instead.
Principal Component Analysis
The well-defined PCA serves as baseline for the dimensionality reduction from 16D into 1D, 2D, 3D, and 8D.
Here, further scaling seems to be recommended. The input features thus additionally pass StandardScaler adjustments, which centers at zero and divides by variance.
Sorting by the resulting major component already shows some grouping by genre. This one-dimensional ordering could even serve as a crude linear playlist.
Two dimensions reveal even more interesting clusters (same data, flipped axes for different aspects and histograms):
Plotting the first three dimensions yields, also just from different angles:
While these three components together cover 65% of the original variance, the (not shown) 8D representation usually accounts for over 90%.
Autoencoder
As alternative to PCA, the normalized feature vectors are fed into an autoencoder. One of the most straight-forward 16-to-3 dimensional neural network (consisting of sigmoid-activated dense layers) would look like:
Using only a single encoding node for a linear ordering yields:
Again, two dimensions already show even more promising patterns:
Given the oblivious plotting by genre, the training for three dimensions produces:
Overall, the results look equally acceptable to PCA at first glance. This is not overly surprising, though, given the autoencoder implicitly optimizes for a similar goal.
Clustering
Even without clustering, the original features or any reduced dimension embedding can already serve as basis for recommendation systems, e.g., using local neighborhood search by simple euclidean distance. For grouping into different meta-genres, moods, or playlists, discrete cluster identifiers need to be assigned for each measurement.
Note that while previous plots seemingly already produced good clustering, the “external” coloring might have been misleading: The nearby values barely form the typical convex shapes of separation and varying density.
k-Means
For example, splitting the three-dimensional PCA space into 16 clusters using the well-established k-Means clustering method gives:
Note that these are the same plots from the encoding step, but colored by assigned cluster identifier and not genre.
Hierarchical
Alternatively, Agglomerative as hierarchical method on the same data (targeting 16 clusters of the three-dimensional PCA) produces:
The advantage of hierarchical approaches is that they can be extended to use the original genres as additional categorical information, without resorting to one-hot or linear encoding. Also, irregular non-euclidean spaces are supported, as only pairwise distances are needed – for example introducing a per-category bias. This more “supervised” approach is not further explored here, though.
Evaluation Metrics
For clustering performance evaluation, several readily available metrics are depicted, namely: Adjusted Rand Index, Adjusted Mutual Information, Homogeneity, Completeness, and the Fowlkes-Mallows Index. However, exact reconstruction of the genre labeling was not the goal, but instead bringing similar songs with different genres together. A large overlap should be a good sign, still.
Evaluated are the combinations of the two different clustering methods on each of the six encoded spaces: 3D PCA and autoencoder, 8D PCA and autoencoder, full 16D normalized autoencoder input features, and full 16D PCA input features scaled by deviation. The results from splitting into 2–20 different clusters are taken into account.
By eyeballing, seems that k-Means on the eight-dimensional autoencoded values performs best. The corresponding playlists (clusters) can be mapped to genre tags as follows:
- The generated playlists as end result seem not perfectly convincing, but could become viable after a few manual tweaks.
- The intermediate one-dimensional encodings can already serve as a crude linear playlist for themselves.
- The intermediate raw or embedded data as distance/similarity matrix could be used for recommendation systems, such as dynamic playlists for MPD.
- The supervised approach could be expanded on, making use of available categorical information such as genres, artists, or folders.
Installation & Usage
python3 -m venv venv
./venv/bin/pip install --require-virtualenv -U -r requirements.txt
./venv/bin/python3 -m mir_cluster_playlist --extract-in-playlist my-library.m3u --out-dir out/
usage: __main__.py [-h] --extract-in-playlist M3U --out-dir DIR [--debug]
[--extract-cache 0|1] [--extract-limit NUM] [--epoch-limit NUM]
Extract audio features and metadata, encode using PCA or Autoencoder, run clustering, and plot results.
options:
-h, --help show this help message and exit
--extract-in-playlist M3U main input of files to process (default: None)
--out-dir DIR output directory for various files (default: None)
--debug enable debug logging, extra plotting, and additional checks (default: False)
--concurrency NUM number of threads for audio processing (default: 1)
--extract-cache 0|1 force re-usage of audio feature extraction (default: auto)
--extract-limit NUM process at most this number of input audio files (default: None)
--epoch-limit NUM maximum number of autoencoder epochs, might stop earlier (default: 2000)