Controllable Embedding Transformation
for Mood-Guided Music Retrieval


Abstract

Music representations are the backbone of modern recommendation systems, powering playlist generation, similarity search, and personalized discovery. Yet most embeddings offer little control for adjusting a single musical attribute, e.g., changing only the mood of a track while preserving its genre or instrumentation. In this work, we address the problem of controllable music retrieval through embedding-based transformation, where the objective is to retrieve songs that remain similar to a seed track but are modified along one chosen dimension. We propose a novel framework for mood-guided music embedding transformation, which learns a mapping from a seed audio embedding to a target embedding guided by mood labels, while preserving other musical attributes. Because mood cannot be directly altered in the seed audio, we introduce a sampling mechanism that retrieves proxy targets to balance diversity with similarity to the seed. We train a lightweight translation model using this sampling strategy and introduce a novel joint objective that encourages transformation and information preservation. Extensive experiments on two datasets show strong mood transformation performance while retaining genre and instrumentation far better than training-free baselines, establishing controllable embedding transformation as a promising paradigm for personalized music retrieval.

Music representations, audio embeddings, embedding transformation, music retrieval, music recommendation

1 Introduction↩︎

Music consumption behavior on streaming platforms can range from passive background listening to active playlist creation and explicit recommendation feedback [1]. A promising direction within this continuum targets the discovery of music which shares many underlying musical properties of some seed track(s), but differs in one or two targeted dimensions. While most modern music recommendation systems are built upon learned music representations that capture high-level semantic properties and enable efficient retrieval across millions of tracks [1][4], these representations lack mechanisms for fine-grained control of specific properties such as mood or genre. In this study, we focus specifically on musical mood, addressing a novel use case to enable listeners to guide the retrieval of content which is e.g., “similar, but happier,” or “similar, but more energetic”.

Music style transfer has been well-explored in generative frameworks, where models learn to alter a song given a guidance signal (e.g., emotion or instrumentation) while preserving core musical content [5][8]. These approaches are useful for small-scale creative applications [9], [10] but less so in a retrieval setting at scale due to costly computation associated with generating new audio. Recent works have begun to bridge the gap between generative and representation learning approaches for style transfer and retrieval tasks; [11] leverages an audio-text embedding space to manipulate audio effects using natural language prompts, and [12] uses diffusion to generate audio queries conditioned on text for text-music retrieval.

There is little prior work on manipulating music embeddings in an audio-only latent space for semantically guided retrieval tasks. Disentanglement-based approaches learn separate latent subspaces for distinct musical attributes to use in targeted downstream applications [13][16], but do not allow for manipulation of an input audio embedding along a specific axis within these subspaces. The most closely related work is [17], where the model learns a tempo-controlled latent transformation that can be used to retrieve tracks that are similar but of a different tempo. In this approach, translated target embeddings are generated by directly modifying the tempo of the input audio signal. While this demonstrates the potential of embedding space manipulations for guided musical retrieval, for more abstract musical attributes such as mood, directly transforming the input signal to generate positive training pairs is non-trivial.

To this end, in this work we introduce the task of mood-guided embedding transformation for music retrieval. Our contributions are:

  • We present a controllable music embedding transformation framework that translates a seed track to an embedding aligned with a target mood, enabling track retrieval in the new mood while preserving other musical qualities (e.g., genre and instrumentation).

  • We introduce a nearest-neighbor sampling scheme that yields seed–target pairs differing in mood but otherwise similar, for cases where signal-based augmentations are infeasible.

  • We design complementary loss functions for training a lightweight transformation model and provide an extensive ablation study highlighting their individual contributions.

  • We show that our approach significantly outperforms baselines in mood transformation and information preservation on both a large-scale proprietary dataset and MTG-Jamendo [18].

Figure 1: A MULE embedding of a seed song is transformed, guided by a target mood label, into an embedding that can retrieve similar songs modified along the mood dimension. We evaluate in terms of mood transformation and seed information preservation.

2 Method↩︎

We propose a novel framework for controllable music embedding transformation. The goal of our system is to learn a transformation purely in the embedding space that shifts a single, controllable attribute of an input audio track, while preserving other musical attributes. We use mood as the transformation attribute and genre and instrumentation as measurable attributes that should be preserved. We use “seed” throughout to refer to the original song and its associated embedding and musical attributes, and “target” as the desired mood and an associated song embedding to use as the transformation goal. We operate within the open-source MULE [19] embedding space for this study, which has shown state-of-the-art (self-supervised) performance on musical representation learning tasks including mood and genre prediction, and thus provides a suitable foundation for embedding-based transformation.

2.1 Mood-Guided Embedding Transformation Framework↩︎

Our method learns a mapping \(f(\cdot)\) from a MULE embedding of a seed audio track to the embedding of a target audio track of a different mood, while preserving attributes besides mood in the seed track. To this end, we must select target embeddings that represent a mood-adjusted version of the seed. Because we cannot directly manipulate the mood of the input track to generate the true mood-adjusted embedding, we introduce a novel data sampling mechanism to retrieve proxy target embeddings to use in training:

Let \(\mathbf{x_s} \in \mathbb{R}^d\) denote the MULE embedding of a seed audio track with mood label \(\mathbf{y_s} \in \mathcal{M}^{m}\), where \(d\) is the embedding dimension, \(\mathcal{M}\) is the set of possible moods, and \(m\) is the dimension of the one-hot mood label vectors. For every seed, we compute the top-\(100\) most similar tracks per mood to that track via cosine similarity in the MULE embedding space. We store this mapping and use it to draw target tracks at training time. For each seed \(\mathbf{x_s}\), we first select a target mood \(\mathbf{y_t} \in \mathcal{M}^m\) at random. Given \(\mathbf{y_t}\) and \(\mathbf{x_s}\), we sample a target track embedding \(\mathbf{x_t}\) from the similarity map. To maximize diversity of tracks included and avoid embedding hubs, we sample target tracks at random from the top-100 most-similar tracks per seed. When the seed and target moods match, the seed embedding is used as the target to incentivize the model to learn an identity mapping.

The full training framework is shown in Figure 1. We use the sampling mechanism to gather three inputs to the model for training: (1) seed track embedding (\(\mathbf{x_s}\)), (2) seed mood (\(\mathbf{y_s}\)), encoded as a one-hot label vector, and (3) target mood (\(\mathbf{y_t}\)), similarly encoded. The MULE embeddings (\(\mathbf{x_s}\), \(\mathbf{x_t}\), \(\mathbf{\hat{x}_t}\)) have dimension \(d\) = \(1728\), and the one-hot encoded mood label vectors, \(m\) = \(4\). The seed embedding is projected through an MLP (\(p_s(\cdot)\)) to a \(512\)-dimensional vector. We subtract the seed mood vector from the target mood vector as a guidance signal and project this through a separate MLP \(p_y(\cdot)\) to a \(128\)-dimensional vector to increase the capacity of the guidance signal so that the dimensionality of the seed embedding does not dominate. The projected seed and label difference vectors are concatenated before passing through \(p_f(\cdot)\) to move back to the target dimension of \(1728\). The model output is the transformed embedding, \(\mathbf{\hat{x}_t} = f(\mathbf{x_s}, \mathbf{y_s}, \mathbf{y_t})\), which is then compared to the target mood embedding (\(\mathbf{x_t}\)) through a joint objective described next.

2.2 Objective Design↩︎

To ensure that the transformed embedding is aligned with the target mood while preserving properties of the seed, we use three complementary loss terms. Cosine similarity: we encourage the predicted embedding \(\mathbf{\hat{x}_t}\) to be similar to the target embedding \(\mathbf{x_t}\) by maximizing their cosine similarity, thus minimizing \(1\) minus this quantity:

\[\mathcal{L}_{\text{cosine}} = \frac{1}{B}\sum_{i=1}^{B}(1-cos(\mathbf{\hat{x}_t}^{(i)}, \mathbf{x_t}^{(i)})),\] where \(B\) is the batch size and \(\mathbf{\hat{x}_t}\) and \(\mathbf{x_t}\) are normalized.

Triplet loss: to ensure that the transformed embedding is close to the target, but distinct from the seed embedding, we use a triplet-style hinge loss [14], [20]. For each sample, the transformed embedding \(\mathbf{\hat{x}_t}\) is considered the “anchor”, the target embedding \(\mathbf{x_t}\) is the positive, and the seed embedding \(\mathbf{x_s}\), is the negative:

\[\mathcal{L}_{\text{triplet}} = \frac{1}{B} \sum_{i=1}^B \max \Bigg( 0, \; \alpha + \cos(\hat{\mathbf{x}}_t^{(i)}, \mathbf{x}_s^{(i)}) - \cos(\hat{\mathbf{x}}_t^{(i)}, \mathbf{x}_t^{(i)}) \Bigg),\] where \(\alpha\) is a margin hyperparameter. This loss penalizes similarity between seed and target embeddings but rewards similarity between predicted and target embeddings, thereby encouraging movement / transformation in the embedding space.

Cosine BCE: we use a contrastive-style loss [21] that emphasizes the difference between transformations in which the seed and target moods are different vs. the same. This is a more nuanced version of the vanilla \(\mathcal{L}_{\text{cosine}}\) above. If \(y_s = y_t\), the similarity between the predicted and target embedding should be \(1\), strongly enforcing the identity mapping. If \(y_s \neq y_t\), we allow the predicted and target embedding more freedom, encouraging a cosine similarity of \(0.5\):

\[\mathcal{L}_{\text{cosBCE}} = \frac{1}{B} \sum_{i=1}^{B} \text{BCE}\Big( \sigma \big( \gamma \cdot \cos(\hat{\mathbf{x}}_t^{(i)}, \mathbf{x}_t^{(i)}) \big), \; t^{(i)} \Big),\] where \(\sigma\) is the sigmoid function, \(\gamma\) is a scaling factor, and \(t^{(i)} \in \{1, 0.5 \}\) is the target signal depending on mood match.

The full training objective is the weighted sum of the above: \(\mathcal{L} = \lambda_{\text{cosine}} \, \mathcal{L}_{\text{cosine}} \;+\; \lambda_{\text{triplet}} \, \mathcal{L}_{\text{triplet}} \;+\; \lambda_{\text{cosBCE}} \, \mathcal{L}_{\text{cosBCE}}\).

3 Experimental Design↩︎

3.1 Datasets↩︎

We use a large-scale proprietary music dataset for our study that contains \(1.3\) M songs with high-quality mood and genre annotations. This dataset contains songs from a set of four moods pertaining to high and low-energy and positive and negative sentiment, which approximately align with the main dimensions of Russell’s valence-arousal model [22], differing from the more widely used quadrants. The dataset also has genre annotations spanning \(20\) classes. As a secondary, publicly-available dataset, we use MTG-Jamendo [18]. In the absence of an exact match in mood taxonomies, we use the subset of the best matching mood labels, namely: “energetic”, “calm”, “happy”, and “sad”, totaling \(4\) k full-length songs. MTG-Jamendo provides good-quality audio recordings, but the annotations are derived from user tags which can inherently be noisy. In addition to mood and \(94\)-class genre labels, MTG-Jamendo has multi-label instrumentation labels across \(40\) instrument categories, allowing us to examine an additional property of the transformed embeddings, independent of mood and genre.

For both datasets, if tracks are multi-labeled in mood or genre, we choose a single label at random per track. For training splits, we enforce disjointness at the artist level to prevent leakage and stratify by mood across splits. We use an \(80/10/10\) training/validation/test split. We compute the MULE [19] embeddings as per the open-source implementation, taking \(3\) s windows, each comprised of \(300\) \(96\)-band Mel-Spectrogram frames centered \(2\) s apart, resulting in overlapping context windows. The resulting embedding timeline is then averaged to form a single embedding per song. For more details on the audio analysis parameters, see [19] and its accompanying codebase.

3.2 Training Setup↩︎

The projection modules used in training are 2-layer MLPs with ReLU activations. The seed projector \(p_s(\cdot)\) has a hidden layer of dimension \(1024\) and an output layer of dimension \(512\), and the label projector \(p_y(\cdot)\) has a hidden dimension of \(64\) and an output dimension of \(128\). For both \(p_s\) and \(p_y\), we apply dropout after the hidden layer with rates of \(0.3\) and \(0.4\), respectively. The final projector \(p_f(\cdot)\), applied to the concatenation of \(p_s(x_s)\) and \(p_y(y_t - y_s)\), consists of a single linear layer with dropout \(0.3\) before the output projection back to the original MULE dimensionality of \(1728\).

To understand the contribution of each loss component and select the optimal training configuration, we conduct an ablation study across both datasets for our three objectives described in Sec. 2.2. We perform full training and evaluation runs on both datasets for each loss term individually and in combination with each other term with the \(\lambda\) weighting set to either \(1\) or \(0\) if the loss is included or not. We choose the best loss configuration based on a weighted average of the validation mood and genre evaluation metrics described in Sec. 3.3 and use this as our best model per dataset, where we place a slightly higher weight on the mood component. In our final training configuration all \(\lambda\) weights are set to \(1\), \(\alpha = 0.3\) in \(\mathcal{L}_{\text{triplet}}\), and \(\gamma=3\) in \(\mathcal{L}_{\text{cosBCE}}\).

Models trained on the large dataset are trained for \(100\) epochs using a batch size of \(1024\) and AdamW optimization with a linear learning rate of \(1e-5\). MTG-Jamendo models are trained for \(500\) epochs with a batch size of \(1024\), AdamW, and a linear learning rate of \(5e-4\). Learning rates are determined through hyperparameter tuning during the loss ablation. For the MTG-Jamendo models, due to the small scale of the dataset, we use \(3\)-fold cross validation (random \(8\):\(1\):\(1\) splits within each fold) and report metrics as an average of folds across test sets for all results including baselines. We perform model selection based on the best mood precision score in validation.

3.3 Evaluation Metrics↩︎

The two core qualities we measure are (1) mood transformation, and (2) seed information preservation, evaluated via genre and instrumentation in this study. In the absence of associated labels for the transformed embedding, we examine the labels of the nearest neighbor embeddings. For mood transformation, we evaluate the mood label of the nearest neighbor to the transformed embedding versus the target mood label and measure precision at 1, denoted Mood P@1. For genre information preservation, we compare the nearest neighbor genre label of the transformed embedding to the seed genre label to measure seed-genre precision at 1, denoted Genre P@1. For the multi-label instrumentation consistency evaluation, we use Jaccard score between the nearest neighbor instrumentation labels of the transformed embedding versus the seed labels, and denote this as Inst. J@1.

Figure 2: Loss ablation using the large-scale dataset and MTG-Jamendo. Results shown are in percentage-points (pp) difference from the random baseline, scaled by 100. For both datasets, the random baseline for Mood P@1 is 0.25. For Genre P@1, the baseline is 0.05 for the large-scale dataset and 0.011 for MTG-Jamendo.

5pt

Table 1: Main results on the large-scale dataset and MTG-Jamendo against baselines. We measure Mood P@1 to evaluate mood transformation and Genre P@1 and Inst. J@1 to ensure information from the seed embedding is preserved through the mood transformation.
Large-scale Dataset Mood P@1 Genre P@1
\((N=4)\) \((N=20)\)
Ours: \(\mathcal{L}_{\text{cosine}} + \mathcal{L}_{\text{triplet}} + \mathcal{L}_{\text{cosBCE}}\) 0.96 0.32
Baseline: Random 0.25 0.05
Baseline: Average mood vector 1.0 0.10
Baseline: Oracle (top-\(1\)) 1.0 0.54
Baseline: Oracle (top-\(100\)) 1.0 0.38
MTG-Jamendo [18] Mood P@1 Genre P@1 Inst. J@1
\((N=4)\) \((N=94)\) \((N=40)\)
Ours: \(\mathcal{L}_{\text{cosine}} + \mathcal{L}_{\text{triplet}} + \mathcal{L}_{\text{cosBCE}}\) 0.83 0.29 0.45
Zero-shot from large-scale model 0.66 0.30 0.45
Fine-tuned on large-scale model 0.68 0.29 0.41
Baseline: Random 0.25 0.01 0.04
Baseline: Average mood vector 0.82 0.05 0.18
Baseline: Oracle (top-\(1\)) 1.0 0.16 0.48
Baseline: Oracle (top-\(100\)) 1.0 0.07 0.24

3.4 Baseline Methods↩︎

We compare our system against the following training-free methods:

Random: as a lower bound, we provide a random baseline for mood and Genre P@1 based on class cardinality. The Mood P@1 baseline is the same for both datasets as both contain \(4\) moods, but MTG-Jamendo has a larger genre taxonomy than the large-scale dataset so the genre baseline is lower. We compute the random baseline for multi-label instrumentation Jaccard score in expectation, using an average of \(2.77\) labels per sample in a Bernoulli formulation.

Average mood embedding: we compute the average vector (e.g., centroid) per mood using all embeddings in the test set. Given a seed embedding and a target mood, we use the target mood average vector as our “generated” embedding for evaluation. This version should, by design, be near-perfect for “transformed” Mood P@1, but lacks any notion of persisting seed information so genre and instrumentation scores should be low.

Oracle: the upper bound Oracle baseline exploits our sampling mechanism directly; here we evaluate using the sampled target embedding directly for evaluation instead of learning a transformed embedding to approximate the target. We provide an Oracle baseline in two settings: the first uses the top-1 most similar sample to the seed, and the second uses a random sample from the top-100 most similar tracks to a seed. The former is an optimistic setting, while the latter matches the sampling mechanism we use in training.

Because the Oracle samples are explicitly chosen from the target mood, they will have a trivial Mood P@1 score of \(1\). The genre and instrumentation scores should also be well-above the random and average baseline (especially for top-1), since the chosen target is known to be close to the seed through the sampling mechanism.

4 Results and Discussion↩︎

4.1 Core Results↩︎

Our key results, shown in Table  1, demonstrate that our method consistently outperforms random baselines by a wide margin, achieving high mood transformation accuracy while simultaneously preserving genre and instrumentation.

On the large-scale dataset, our approach reaches Mood P@1 of \(0.96\) and Genre P@1 of \(0.32\), far exceeding random in both mood transformation and seed preservation. Against the average mood vector baseline, we show over a \(3\)x improvement in Genre P@1; this proves that our model is learning more nuanced transformations versus converging to the mood centroids alone. While the overly optimistic Oracle (top-1) outperforms our system, our best method closely approximates the Oracle (top-100) in both metrics, illustrating that for a dataset with high-quality labeling, the nearest-neighbor sampling mechanism for training data pairs effectively provides an upper bound with regard to attribute (e.g., genre) preservation. Further, the Oracle requires all labels at inference and the computation of mood similarity scores across all embeddings for a given target mood, whereas our model operates without labeled targets or similarity information at inference.

For MTG-Jamendo, in addition to improving upon the random baseline significantly across all metrics, our best model performs on-par with the average mood baseline in terms of Mood P@1 while outperforming Genre P@1 and Inst. J@1 by \(25\)pp and \(27\)pp, respectively. Our strong results for instrumentation validate that our system preserves nuanced qualities of the input that are independent of the transformation axis. Surprisingly, our method improves upon both Oracle baselines in Genre P@1; we hypothesize that this may be due to the noisier labeling and the fact that the small yet diverse nature of the dataset leads to sparser embeddings in the MULE space, and thus lower similarity between embeddings within a certain mood. This would indicate that, despite an inherently weaker sampling mechanism for training with this dataset, our model is still able to learn a robust transformation.

We also show results for the large-scale dataset evaluated in a zero-shot manner on MTG-Jamendo, as well as a version that is fine-tuned on MTG-Jamendo. The zero-shot result is strong (\(0.67\) vs. our best model at \(0.87\) for Mood P@1), which may suggest both that our large-scale model can generalize relatively well to unseen data, and that MULE provides a robust backbone regardless of data shift. The lack of a significant improvement for fine-tuning versus zero-shot implies a domain gap between the datasets in terms of audio and labeling, but may be due to the small size of MTG-Jamendo.

4.2 Loss Ablation Study↩︎

Figure 2 shows the outcome of our loss ablation, with results presented as percentage-point (pp) differences relative to the random baseline for each task. We observe generally consistent trends across both datasets, observing that relying on any single loss leads to relatively imbalanced outcomes. Using \(\mathcal{L}_{\text{cosine}}\) alone provides moderate gains in genre preservation, but is limited in mood transformation. In contrast, the \(\mathcal{L}_{\text{triplet}}\) drives improvements in mood transformation, as the only term explicitly incentivizing movement in the embedding space. Yet, this clearly comes at the expense of seed preservation in which metrics are close to random. \(\mathcal{L}_{\text{cosBCE}}\) performs in the opposite way, achieving the highest Genre P@1 scores (\(80.4\)pp and \(94.2\)pp improvement from random on the large-scale and MTG-Jamendo datasets, respectively) but with negligible mood transformation.

Pairwise combinations of the losses partially mitigate these trade-offs, but the full combination of all three loss terms most improves mood transformation, while still backed by a large increase in genre over random. On the large-scale dataset, the combined loss yields \(70.8\)pp and \(27.5\)pp improvements in Mood P@1 and Genre P@1, respectively, over the random baseline, while on MTG-Jamendo it achieves \(55.7\) and \(26.7\)pp improvements in the same metrics. These results confirm that the objectives are complementary: cosine similarity encourages alignment, triplet loss enforces separation across moods, and cosine BCE acts as a stabilizer between the previous two terms. When used together they provide robust mood transformation while preserving seed information.

5 Conclusion↩︎

In this work, we introduce a framework for controllable music embedding transformation, enabling retrieval of tracks of a different mood but similar in other musical dimensions such as genre and instrumentation. We utilize a novel nearest-neighbor data sampling scheme to create seed-target embedding pairs to train our transformation model, and demonstrate in our evaluation on both a large-scale proprietary dataset and MTG-Jamendo that we are able to significantly outperform training-free baselines using this strategy. We show through an extensive loss ablation that success in both mood transformation and information preservation hinges on the design of complementary objective functions. In the future, we plan to investigate the interaction between multiple mood dimensions simultaneously (i.e., where a track can be both energetic and happy), and explore using an audio-text embedding space in our framework, where text embeddings could act as guidance instead of encoded labels.

References↩︎

[1]
M. Schedl, P. Knees, B. McFee, and D. Bogdanov, “Music recommendation systems: Techniques, use cases, and challenges,” in Recommender systems handbook, pp. 927–971. Springer, 2021.
[2]
M. Schedl, H. Zamani, C.-W. Chen, Y. Deldjoo, and M. Elahi, “Current challenges and visions in music recommender systems research,” International Journal of Multimedia Information Retrieval, vol. 7, no. 2, pp. 95–116, 2018.
[3]
B. Amiri, N. Shahverdi, A. Haddadi, and Y. Ghahremani, “Beyond the trends: Evolution and future directions in music recommender systems research,” IEEE Access, vol. 12, pp. 51500–51522, 2024.
[4]
Y. Deldjoo, M. Schedl, and P. Knees, “Content-driven music recommendation: Evolution, state of the art, and challenges,” Computer Science Review, vol. 51, pp. 100618, 2024.
[5]
S. Dai, Z. Zhang, and G. G. Xia, “Music style transfer: A position paper,” arXiv preprint arXiv:1803.06841, 2018.
[6]
S. Li, Y. Zhang, F. Tang, C. Ma, W. Dong, and C. Xu, “Music style transfer with time-varying inversion of diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2024, vol. 38, pp. 547–555.
[7]
Z. Hu, Y. Liu, G. Chen, S.-h. Zhong, and A. Zhang, “Make your favorite music curative: Music style transfer for anxiety reduction,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 1189–1197.
[8]
J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 47704–47720, 2023.
[9]
J. Nistal, M. Pasini, C. Aouameur, M. Grachten, and S. Lattner, “Diff-a-riff: Musical accompaniment co-creation via latent diffusion models,” in Proceedings of the 25th International Society for Music Information Retrieval Conference. Nov. 2024, pp. 272–280, ISMIR.
[10]
O. Cı́fka, U. Şimşekli, and G. Richard, “Groove2groove: One-shot music style transfer with supervision from synthetic data,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2638–2650, 2020.
[11]
A. Chu, P. O’Reilly, J. Barnett, and B. Pardo, “Text2fx: Harnessing clap embeddings for text-guided audio effects,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5.
[12]
J. Guinot, E. Quinton, and G. Fazekas, GD-Retriever: Controllable generative text-music retrieval with diffusion models,” in Proceedings of the 26th International Society for Music Information Retrieval Conference (ISMIR), 2025.
[13]
J. Guinot, E. Quinton, and G. Fazekas, “Leave-one-equivariant: Alleviating invariance-related information loss in contrastive music representations,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5.
[14]
J. Lee, N. J. Bryan, J. Salamon, Z. Jin, and J. Nam, “Disentangled multidimensional metric learning for music similarity,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6–10.
[15]
J. Wilkins, S. Ding, M. Fuentes, and J. P. Bello, “Balancing information preservation and disentanglement in self-supervised music representation learning,” arXiv preprint arXiv:2507.22995, 2025.
[16]
K. Tanaka, K. Yoshii, S. Dixon, and S. Morishima, “Unsupervised pitch-timbre-variation disentanglement of monophonic music signals based on random perturbation and re-entry training,” APSIPA Transactions on Signal and Information Processing, 2025.
[17]
M. C. McCallum, F. Henkel, J. Kim, S. E. Sandberg, and M. E. P. Davies, “Similar but faster: Manipulation of tempo in music audio embeddings for tempo prediction and search,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 686–690.
[18]
D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra, “The MTG-Jamendo dataset for automatic music tagging,” in Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States, 2019.
[19]
M. C. McCallum, F. Korzeniowski, S. Oramas, F. Gouyon, and A. Ehmann, “Supervised and unsupervised learning of audio representations for music understanding,” in Proceedings of the 23rd International Society for Music Information Retrieval Conference. Dec. 2022, pp. 256–263, ISMIR.
[20]
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
[21]
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PmLR, 2020, pp. 1597–1607.
[22]
J. A. Russell, “A circumplex model of affect,” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980.