Music library clustering

Clustering a music library into playlists while only using extracted audio features. Information retrieval methods such as PCA, Autoencoder, and k-Means are evaluated.

Dynamic playlists, “radio modes”, or music recommendation systems often rely on listening habits or human annotation (artists, genres). Information retrieval on a huge listener database might not be possible and tag labels might be misleading, though. For example, the latter falls short when it comes to unusually “soft/hard” pieces of music from otherwise different artists/genres that are still perceived similar.

This is an attempt to explore whether pure music information retrieval yields audio properties that correlate to a similar “mood”, while explicitly excluding metadata from the whole process.

Samples

For demonstration purposes, as dataset (that does not push browser bandwidth and SVG rendering limits) we pick 50 random songs from 20 genres.

Input genre sorting and colorization

In the following, the genre information (as obtained by mutagen) is never encoded as feature dimension and only used for clustering evaluation and plotting.

Categorical Sorting

Note that the genre list is not semantically sorted – nearby colors are not necessarily “related”. Seems this is not an easy problem to solve. (Side-Quest.)

Re-using the end result or at least some string-based metric such as Levenshtein of from difflib would give a similarity matrix between genres. Finding the optimal ordering that minimizes the sum of individual distances can then be reduced to the Travelling Salesman Problem which is (un-)surprisingly vastly out of scope: For example 100 genres already result in about 2524 permutations (100!) to try.

Thus, the local search solver from python-tsp is used as heuristic instead. This approach might however still be slightly better than assigning random (hash-based) or sorted/discrete (predefined colormap) colors.

Feature Extraction

Various audio features are extracted using librosa. In order to avoid encoding timeseries, each file is assigned single values, namely the overall average, median, and standard deviation – after clamping 1% vast outliers. For example, the spectral centroid, sorted by median and colored by genre:

Per-file centroids, sorted by median

In especially when re-sorting by genre, there seems indeed some pattern emerging that could help telling them apart:

Per-file centroids, sorted by genre

Centroid
The mean frequency of the magnitude spectrogram. As there is little sense in averaging over per-frame frequencies, the results are converted to a mel scale, which behaves more linearly as it represents the “perceived” frequency rather than a logarithmic measurement.
Bandwidth
Spectral bandwidth as a measure of how narrow or wide the signal is around the centroid, or how much this varies over time.
Flatness
Spectral flatness as additional measure of noisiness.
Zero Crossings
How often the signal crosses zero, i.e., flips signs.
Roll-On/-Off
Where the lowest 10% and top 10% energy is contained.
Spectrum
The absolute magnitude for six bark band bins (100, 510, 1080, 2000, 3700, and 15500 Hz).
Tempogram
The overall detected BPM value seems to be not telling much. Instead, something like the “flatness” on the self-recurrence beat detection is used, which indicates a more varying or constant tempo – as simpler replacement for high frequencies in the linear tempogram itself.
HPSS
RMS difference of the decomposition into harmonic and percussive components.

As feature vector, the 16 most promising values are picked. After independent normalization, the full matrix of all features for all input songs can be visualized as:

16 features per input file, sorted by genre

Note that some values are negated: Flipping a dimension and thereby simply moving to another quadrant should not be important for encoding or clustering. However, this avoids negative PCA factors lateron, which can be interpreted as “contradicting axes”.

Encoding

Treating each feature as independent dimension is very likely to skew results due to redundancy, correlation, or differing variance. Also, the contained information (or entropy) in the dataset is most likely less than the full 16D input. Such dependencies inside the dataset can be accounted for by extracting and encoding the most significant information bits instead.

Principal Component Analysis

The well-defined PCA serves as baseline for the dimensionality reduction from 16D into 1D, 2D, 3D, and 8D.

Here, further scaling seems to be recommended. The input features thus additionally pass StandardScaler adjustments, which centers at zero and divides by variance.

One-dimensional PCA, colored by genre

Sorting by the resulting major component already shows some grouping by genre. This one-dimensional ordering could even serve as a crude linear playlist.

Two dimensions reveal even more interesting clusters (same data, flipped axes for different aspects and histograms):

Two-dimensional PCA, colored by genre

Plotting the first three dimensions yields, also just from different angles:

Three-dimensional PCA, colored by genre

While these three components together cover 65% of the original variance, the (not shown) 8D representation usually accounts for over 90%.

Autoencoder

As alternative to PCA, the normalized feature vectors are fed into an autoencoder. One of the most straight-forward 16-to-3 dimensional neural network (consisting of sigmoid-activated dense layers) would look like:

Typical three-dimensional autoencoder layout

Using only a single encoding node for a linear ordering yields:

One-dimensional autoencoded values, colored by genre

Again, two dimensions already show even more promising patterns:

Two-dimensional autoencoded values, colored by genre

Given the oblivious plotting by genre, the training for three dimensions produces:

Three-dimensional autoencoded values, colored by genre

Overall, the results look equally acceptable to PCA at first glance. This is not overly surprising, though, given the autoencoder implicitly optimizes for a similar goal.

Clustering

Even without clustering, the original features or any reduced dimension embedding can already serve as basis for recommendation systems, e.g., using local neighborhood search by simple euclidean distance. For grouping into different meta-genres, moods, or playlists, discrete cluster identifiers need to be assigned for each measurement.

Note that while previous plots seemingly already produced good clustering, the “external” coloring might have been misleading: The nearby values barely form the typical convex shapes of separation and varying density.

k-Means

For example, splitting the three-dimensional PCA space into 16 clusters using the well-established k-Means clustering method gives:

16 k-Means clusters on 3D PCA

Note that these are the same plots from the encoding step, but colored by assigned cluster identifier and not genre.

Hierarchical

Alternatively, Agglomerative as hierarchical method on the same data (targeting 16 clusters of the three-dimensional PCA) produces:

16 agglomerative clusters on 3D PCA

The advantage of hierarchical approaches is that they can be extended to use the original genres as additional categorical information, without resorting to one-hot or linear encoding. Also, irregular non-euclidean spaces are supported, as only pairwise distances are needed – for example introducing a per-category bias. This more “supervised” approach is not further explored here, though.

Evaluation Metrics

For clustering performance evaluation, several readily available metrics are depicted, namely: Adjusted Rand Index, Adjusted Mutual Information, Homogeneity, Completeness, and the Fowlkes-Mallows Index. However, exact reconstruction of the genre labeling was not the goal, but instead bringing similar songs with different genres together. A large overlap should be a good sign, still.

Evaluated are the combinations of the two different clustering methods on each of the six encoded spaces: 3D PCA and autoencoder, 8D PCA and autoencoder, full 16D normalized autoencoder input features, and full 16D PCA input features scaled by deviation. The results from splitting into 2–20 different clusters are taken into account.

Evaluation metrics for 2-20 clusters using different methods

By eyeballing, seems that k-Means on the eight-dimensional autoencoded values performs best. The corresponding playlists (clusters) can be mapped to genre tags as follows:

Genre distribution in selected clusters

Installation & Usage

python3 -m venv venv
./venv/bin/pip install --require-virtualenv -U -r requirements.txt
./venv/bin/python3 -m mir_cluster_playlist --extract-in-playlist my-library.m3u --out-dir out/
usage: __main__.py [-h] --extract-in-playlist M3U --out-dir DIR [--debug]
                   [--extract-cache 0|1] [--extract-limit NUM] [--epoch-limit NUM]

Extract audio features and metadata, encode using PCA or Autoencoder, run clustering, and plot results.

options:
  -h, --help                 show this help message and exit
  --extract-in-playlist M3U  main input of files to process (default: None)
  --out-dir DIR              output directory for various files (default: None)
  --debug                    enable debug logging, extra plotting, and additional checks (default: False)
  --concurrency NUM          number of threads for audio processing (default: 1)
  --extract-cache 0|1        force re-usage of audio feature extraction (default: auto)
  --extract-limit NUM        process at most this number of input audio files (default: None)
  --epoch-limit NUM          maximum number of autoencoder epochs, might stop earlier (default: 2000)

Example output screenshot

Code & Download