Abstract

Fine-tuning pre-trained foundation models has made significant progress in music information retrieval. However, applying these models to beat tracking tasks remains unexplored as the limited annotated data renders conventional fine-tuning methods ineffective. To address this challenge, we propose HingeNet, a novel and general parameter-efficient fine-tuning method specifically designed for beat tracking tasks. HingeNet is a lightweight and separable network, visually resembling a hinge, designed to tightly interface with pre-trained foundation models by using their intermediate feature representations as input. This unique architecture grants HingeNet broad generalizability, enabling effective integration with various pre-trained foundation models. Furthermore, considering the significance of harmonics in beat tracking, we introduce harmonic-aware mechanism during the fine-tuning process to better capture and emphasize the harmonic structures in musical signals. Experiments on benchmark datasets demonstrate that HingeNet achieves state-of-the-art performance in beat and downbeat tracking.

beat tracking, music foundation model, music information retrieval, parameter-efficient fine-tuning

1 Introduction↩︎

As digital music continues to evolve, accurately identifying rhythmic patterns within compositions has become a key focus in Music Information Retrieval (MIR). Beat tracking is a critical component of this field due to its foundational role in defining the temporal framework for analyzing musical content. The goal of beat tracking is to detect the temporal positions of beats within a music signal. Accurate beat tracking is essential for advanced tasks such as music transcription [1], [2] and music structure analysis [3], [4], among others [5], [6]. However, beat tracking remains a challenging task, primarily due to the complex and varied nature of rhythmic structures in different musical genres.

Figure 1: Comparison between the proposed HingeNet and traditional fine-tuning methods, such as Adapter and LoRA. The blue blocks represent the frozen encoder of the pre-trained music foundation model, while the red blocks indicate the trainable fine-tuning modules.

With the rise of deep learning, methods have significantly evolved. Recurrent Neural Networks (RNNs) [7], [8] are used to capture temporal dependencies, while Convolutional Neural Networks (CNNs) [9], [10] are employed to learn local patterns from spectrograms, improving beat detection. Temporal Convolutional Networks (TCNs) [11], [12] are designed to provide better handling of long-range dependencies compared to RNNs. More recently, Transformer-based models [13]–[15] have emerged, leveraging self-attention to capture long-range rhythmic patterns, achieving state-of-the-art results in beat tracking.

While these methods have led to some performance improvements, they inevitably reach a performance plateau due to limited representation capacity. Fine-tuning pre-trained foundation models allows for the transfer of their extensive knowledge to specific downstream tasks, which has already proven effective across various domains [16]. Therefore, extending this strategy to beat tracking tasks to overcome current performance limitations is a natural and promising approach. As shown in Fig.1(a) and 1(b), traditional fine-tuning methods, such as Adapter[17] and LoRA [18], typically insert trainable fine-tuning modules into frozen foundation models, which alters the model’s architecture and representation space. Previous research [19] has shown that these methods are less effective in tasks with limited annotated data, as they can lead to overfitting and disrupt the generalizability of the pre-trained model’s features.

To address these issues, we propose a parameter-efficient fine-tuning method that visually resembles a hinge, which we therefore name HingeNet. As shown in Fig.1(c), HingeNet is designed as an independent architecture. This unique architecture tightly integrates with various pre-trained foundation models by taking intermediate feature representations as input, ensuring that the fine-tuning fully leverages robust feature representations learned by the pre-trained model. Furthermore, according to music theory, harmonic shifts are more likely to occur at beat positions rather than non-beat positions. To capture these shifts effectively, we introduce harmonic-aware mechanism within HingeNet, which enhances the model’s ability to distinguish beat-related features, further improving beat and downbeat tracking accuracy. In summary, our contributions are as follows:

We propose HingeNet, a lightweight fine-tuning method specifically designed for beat tracking tasks. Its separable architecture enables effective integration with various pre-trained foundation models, ensuring strong generalization and superior performance.
Considering the significance of harmonics, we introduce harmonic-aware mechanism designed to capture harmonic shifts at beat positions and emphasize harmonic structures, thereby enhancing the accuracy of beat and downbeat tracking.
Extensive experiments on multiple benchmark datasets demonstrate that HingeNet achieves state-of-the-art performance in both beat and downbeat tracking.

Figure 2: Overview of our proposed model, consisting of two parts: the pre-trained foundation model and the HingeNet.

2 Related Work↩︎

2.1 Beat and Downbeat Tracking↩︎

Early methods, including RNNs [7], [8], CNNs [9], [20], and TCNs [11], [12] , were used to capture temporal and spectral patterns. However, these models struggled with long-range dependencies and complex rhythmic structures, especially in music with varying tempos.

More recently, Huang et al. [13] introduced Transformer-based models for beat tracking, significantly improving the accuracy of the model. Following this, Zhao et al. [14] and Cheng et al. [15] proposed further improvements to Transformer-based models, optimizing them for better handling of complex rhythms and varying tempos. In addition, Heydari et al. [21] explored using pre-trained self-supervised speech representation models as feature extractors for singing beat tracking. Desblancs et al. [22] investigated unsupervised beat tracking methods, aiming to eliminate the need for labeled data in training. Meanwhile, Chiu et al. [23] and Foscarin et al. [24] explored improvements to, and even removal of, the DBN post-processing step, seeking more efficient alternatives for refining beat predictions. Despite these innovations, challenges remain in handling diverse musical styles and achieving accurate beat tracking in complex audio sources.

The emergence of music foundation models has brought new hope in addressing these challenges. Pre-training on large-scale music data endows these models with robust semantic representations and strong generalization capabilities. Fine-tuning for specific downstream tasks can further enhance the performance of these models. Motivated by the unique characteristics of beat tracking, we propose HingeNet, a novel harmonic-aware fine-tuning method. This approach significantly improves the performance of foundation models in beat tracking tasks by focusing on harmonic shifts, enhancing both accuracy and efficiency.

2.2 Foundation Models in MIR↩︎

The exploration of foundation models within MIR is still in its early stages, but some groundbreaking work has already been done. Music2Vce [25] employs a student-teacher framework to enhance its performance in music understanding tasks. In this setup, the student network learns from the output of the teacher network, allowing it to effectively capture complex musical patterns and improve generalization across diverse musical genres. Building on this approach, MERT [26] utilizes a Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) as the acoustic teacher and a Constant-Q Transform (CQT) as the musical teacher to guide the model in jointly learning both acoustic and musical knowledge. This combination enables MERT to effectively capture the tonal and pitched characteristics inherent in music, making it well-suited for local frame-level sequence labeling tasks.

Additionally, MusicFM [27] draws inspiration from the speech recognition model BEST-RQ [28], using a random projection quantizer during the tokenization phase. This approach significantly improves the performance of frame-level classification tasks that require long-term contextual understanding. To fully harness the potential of foundation models for beat tracking tasks, we propose a novel harmonic-aware fine-tuning method specifically designed for this purpose.

3 Method↩︎

3.1 The Overall Architecture of HingeNet↩︎

As shown in Fig.2, our proposed model consists of two parts: the upper part illustrates the pre-trained foundation model with frozen parameters, where intermediate feature representations are extracted from each encoder layer. The lower part shows HingeNet, a harmonic-aware fine-tuning network specifically designed for beat tracking and adaptable to various pre-trained foundation models. Specifically, HingeNet utilizes the intermediate feature representations from each encoder layer of the foundation model as inputs, enabling it to make predictions without altering the architecture of the foundation model. Additionally, HingeNet integrates harmonic-aware modules specifically designed for beat tracking, which directly address the unique challenges of this task by refining and emphasizing harmonic structures in musical signals. This focuses on both the complete preservation of the pre-trained model and task-specific adaptation distinguishes HingeNet from other similar approaches.

Music foundation models typically consist of a stem module \(\mathcal{S}\), which is usually composed of convolutional layers to convert input audio into time-frequency features. These features are then passed through \(N\) Transformer encoders, denoted as \(\{\mathcal{F}_i\}^N_{i=1}\), where each encoder captures different levels of musical semantics in the audio signal. After each encoder, we attach a corresponding HingeNet core layer, represented as \(\{\mathcal{P}_i, \mathcal{H}_i\}^N_{i=1}\). Here, \(\mathcal{P}\) represents the projection layer, which reduces the dimensionality of the output features to remove redundant information, and \(\mathcal{H}\) denotes the harmonic-aware module, which captures harmonic shifts in the music signal and improves the model’s ability to track beats and downbeats.

Given input audio x, we obtain its corresponding semantic feature representation as: \[\begin{align} h_0^\mathcal{F} = \mathcal{S}(x) ; h_i^\mathcal{F} = \mathcal{F}_i(h_{i-1}^\mathcal{F}) \end{align}\]

For the \(i\)-th HingeNet core layer, we first feed the \(i\)-th encoder’s output \(h_i^\mathcal{F} \in \mathbb{R}^{b*h*t}\) into \(i\)-th projection layer \(\mathcal{P}_i\) to obtain the projected features \(h_i^\mathcal{P} \in \mathbb{R}^{b*\frac{h}{r}*t}\): \[\begin{align} h_i^\mathcal{P} = \mathcal{P}_i(h_i^\mathcal{F}) \end{align}\]

Next, the projected features \(h_i^\mathcal{P}\) are fused with the output of the previous HingeNet core layer \(h_{i-1} \in \mathbb{R}^{b*\frac{h}{r}*t}\) through a learnable gating mechanism, producing the input for the harmonic-aware module: \[\begin{align} \tilde{h}_i = \mu_i*h_{i-1} + (1-\mu_i)*h_i^\mathcal{P}, \quad \forall i \in \{2, \dots, N\} \end{align}\] where \(\mu_i = sigmoid (\alpha_i)\), with \(\alpha_i\) as a learnable scalar initialized to zero. We also experimented with other fusion methods, such as element-wise addition and cross-attention, but found that the current design works the best. For the first layer (\(i\) = 1), there is no prior core layer output, so \(\tilde{h}_i\) is directly set to the projected features: \(\tilde{h}_1 = h_1^\mathcal{P}\).

The fused features \(\tilde{h}_i\) are processed through a harmonic-aware module, consisting of \(M\) parallel 1D convolutional layers with different dilation rates. The resulting feature are then concatenated and passed through an MLP layer, which maps them back to the original dimensional space, yielding the output of the \(i\)-th HingeNet core layer, \(h_i \in \mathbb{R}^{b*\frac{h}{r}*t}\):

\[\begin{align} h_i = MLP(Concat[\mathcal{H}_i^1(\tilde{h}_i), \mathcal{H}_i^2(\tilde{h}_i)..., \mathcal{H}_i^M(\tilde{h}_i)])) \end{align}\]

Finally, the last features \(h_{N}\) pass through a linear layer with sigmoid activation, followed by DBN post-processing [29], to produce the final beat and downbeat tracking results.

3.2 Lightweight HingeNet Core Layer↩︎

The HingeNet core layer is designed to efficiently integrate harmonic-aware information while maintaining a lightweight and flexible structure. The core idea behind this design is to retain essential musical patterns while reducing unnecessary parameterization to minimize the reliance on large amounts of annotated data. The layer consists of three main components: the projection layer, the learnable gating mechanism, and the harmonic-aware module.

Projection Layer. Music signals contain rich semantic information, some of which help identify the beat, such as harmonics. However, other elements may be unrelated or even introduce interference, making beat detection more difficult. The projection layer simplifies this complexity by introducing a projection factor \(r\) (e.g. \(r\) = 2, 4, 6, 8), which reduces the dimensionality of the input features to \(\tfrac{1}{r}\). This reduction is particularly crucial for beat tracking tasks with limited annotated data, as it helps control the number of trainable parameters and enhances the model’s scalability.

Learnable Gating Mechanism. The learnable gating mechanism introduces dynamic control over the integration of the outputs from the other two components: the projection layer and the previous harmonic-aware module. Initially, the gating parameter assigns equal weight to both components, ensuring that the contributions are balanced. However, as training progresses, the gating parameter adapts, allowing the model to dynamically adjust the relative importance of each component based on the task and data characteristics. This flexibility enables the model to focus more on harmonic features or projection features as needed, optimizing performance for downstream beat and downbeat tracking tasks.

Harmonic-Aware Module. Harmonics play a crucial role in music perception, as harmonic shifts are more likely to occur at beat positions than at non-beat positions, providing valuable cues for beat detection. In addition, harmonics follow predictable patterns in the time-frequency representation of music, which can be exploited to capture these harmonic shifts more effectively. For example, in the CQT spectrum, adjacent harmonics of a fundamental frequency \(f_0\) maintain a constant frequency interval when the \(Q\) is appropriately chosen. The interval \(d_k\) between adjacent harmonics in the harmonic series can be calculated using: \[\begin{align} d_k &= \log_{2^{1/Q}}(f_0 \cdot (k + 1)) - \log_{2^{1/Q}}(f_0 \cdot k), \\ &= Q \cdot \log_2\left(\frac{k + 1}{k}\right) \end{align}\] Where \(Q\) indicates the number of bins per octave, \(d_k\) denotes the interval of adjacent harmonics, \(k\) is the serial number of the harmonic series. When \(Q\) = 12 and the harmonic series \(N\) is 5, the intervals between harmonics become constant and can be rounded to integer values as follows: [12, 7, 5, 4]. By using multiple parallel 1D convolutions with carefully selected dilation rates that correspond to the harmonic intervals, the harmonic-aware module effectively captures harmonic patterns and detects beat-related shifts. We describe in Equation 4 how these features are further processed to obtain the final output.

4 experiments↩︎

4.1 Datasets and Metrics↩︎

Following previous work [15], [20], HingeNet is trained and evaluated on seven standard music datasets. Specifically, the Beatles [30], RWC Popular [31], and Harmonix [32] datasets are used exclusively for training. The Ballroom [33], Hainsworth [34], and SMC [35] datasets are used for both training and testing using an 8-fold cross-validation approach. GTZAN [36] dataset is reserved solely for testing, serving as an independent benchmark to evaluate the model’s generalization ability. These datasets cover diverse musical styles and genres, offering a robust benchmark for beat and downbeat tracking.

To evaluate beat and downbeat tracking performance, we adopt three widely-used metrics: F1-measure, CMLt(Correct Metric Level), and AMLt (allowing for off-beat or double or half tempo), with a tolerance window of 70 ms [30]. The latter two metrics primarily evaluate the proportion of correctly predicted beat sequences matching the ground truth, highlighting the model’s ability to consistently predict rhythmic patterns.

4.2 Experimental Setup↩︎

Our model is based on a multi-task learning framework, where beat and downbeat tracking are predicted simultaneously. Binary cross-entropy loss is used to supervise the training process. Following [12], we apply the same label broadening technique to annotations. Specifically, frames adjacent to the annotated beat frames (±2 frames) are also marked as beats, but with lower weights of 0.5 and 0.25, respectively.

During training, we use time stretching without altering pitch as the data augmentation technique to enhance the robustness and generalization of our model. The validation and test data are left untouched. We train our model using the Adam optimizer with a learning rate of 1e-3 and a batch size of 16. The training is stopped when the validation loss does not decrease for 20 epochs. The total number of trainable parameters in our model is 4.74M, which constitutes only 1.4% of the total parameters in the pre-trained model.

4.3 Results and Analysis↩︎

We compare our model with several state-of-the-art (SOTA) models, including Beat trans [14], LH trans [15], and Beat This [24]. Additionally, we establish two baseline models based on pre-trained music foundation models: MERT [26] and MusicFM [27], which only connect a beat and downbeat classifier along with DBN on top of the frozen pre-trained models. To ensure a fair comparison, we reproduced MERT’s results on the GTZAN dataset, as the originally reported results did not adhere to the standard conventions of the beat tracking community. Likewise, for the Beat This model, which was trained using additional data, we report its results on the GTZAN dataset under the standard training setup.

Table 1: Comparison with other state-of-the-art beat tracking models and two baseline models on the *GTZAN* dataset.
		Beat Accuracy			Downbeat Accuracy
3-8 Dataset	Model	F-Measure	CMLt	AMLt	F-Measure	CMLt	AMLt
GTZAN	Beat trans [14]	88.5	80.0	92.2	71.4	66.5	84.4
	LH trans [15]	88.4	80.8	94.0	-	-	-
	Beat This [24]	88.9	79.9	89.4	75.5	60.8	75.5
	MERT [26]	87.3	78.4	90.7	74.8	69.3	86.1
	MusicFM [27]	86.1	-	-	78.5	-	-
	MERT+HingeNet	89.7	81.4	94.3	77.4	71.6	87.2
	MusicFM+HingeNet	89.2	80.9	93.7	79.8	73.2	89.5

Table 2: Comparison with other state-of-the-art beat tracking models on datasets used in an 8-fold cross-validation setup.
Dataset	Model	Beat F1	Downbeat F1
Ballroom	Beat trans [14]	96.8	94.1
	LH trans [15]	95.0	-
	MERT [26]	95.7	93.2
	MusicFM [27]	95.1	94.3
	MERT+HingeNet	97.3	94.5
	MusicFM+HingeNet	97.0	95.8
Hainsworth	Beat trans [14]	90.2	74.8
	LH trans [15]	87.0	-
	MERT [26]	89.6	74.5
	MusicFM [27]	89.2	75.7
	MERT+HingeNet	91.4	76.7
	MusicFM+HingeNet	90.8	78.3
SMC	Beat trans [14]	59.6	-
	LH trans [15]	55.4	-
	MERT [26]	60.1	-
	MusicFM [27]	59.2	-
	MERT+HingeNet	61.7	-
	MusicFM+HingeNet	60.6	-

Table 3: Ablation study on beat tracking with different fine-tuning method on the GTZAN dataset.
Model	fine-tuning method	Beat F1	Downbeat F1
MERT	Adapter	73.4	64.9
	LoRA	81.2	69.8
	HingeNet	89.7	77.4
MusicFM	Adapter	77.3	65.5
	LoRA	80.6	68.7
	HingeNet	89.2	79.8

Table 1 shows the results of HingeNet compared to several SOTA models and two baseline models on the GTZAN dataset.

Stable Improvement with HingeNet. Compared to the baseline models, fine-tuning with HingeNet consistently results in significant improvements across all metrics. This demonstrates the effectiveness of our harmonic-aware fine-tuning method, positioning HingeNet as a robust method for enhancing pre-trained foundation models in downstream beat and downbeat tracking tasks.

Powerful Potential of Foundation Models. Compared to the latest SOTA model Beat This [24], our proposed method outperforms it across both beat and downbeat accuracy. Specifically, in downbeat F-Measure, our method achieves 79.8%, surpassing it by as much as 4.3%. This highlights the significant potential of fine-tuning music foundation models, which can significantly improve performance in downstream tasks.

Table [other95dataset] compares the performance of HingeNet with several SOTA models on the Ballroom, Hainsworth, and SMC datasets. Our proposed HingeNet outperforms previous SOTA models in both beat and downbeat tracking across all datasets.

Additionally, we observed that MERT performs better in beat tracking, while MusicFM excels in downbeat tracking. This difference can be attributed to the distinct strengths of each model: MERT’s unique music teacher design allows it to capture local musical details well, while MusicFM, with its random projection-based tokenization method, captures long-term contextual dependencies effectively.

Figure 3: Ablation study on beat tracking with different projection factor r and HAM on the GTZAN dataset. — Figure 3: Ablation study on beat tracking with different projection factor \(r\) and HAM on the GTZAN dataset.

4.4 Ablation Study↩︎

We conduct two ablation studies to analyze the effectiveness of our proposed method: one focuses on the harmonic-aware module (HAM) and projection factor \(r\), and the other evaluates different fine-tuning methods (Adapter, LoRA, and HingeNet).

Effect of HAM and Projection Factor. From Fig. 3, we can draw three key conclusions: 1) The inclusion of HAM consistently improves performance across all projection factors; 2) The model with HAM achieves the highest beat F-measure when the projection factor \(r\) is set to 6; 3) Interestingly, the performance gains from HAM gradually diminish as the projection factor \(r\) increases. This diminishing effect is likely due to the distortion of harmonic structures, and this distortion becomes more pronounced as the projection factor \(r\) increases, which limits the effectiveness of the HAM in capturing harmonic information. Conversely, with a lower \(r\), the feature space remains highly complex, making it challenging to fine-tune the model effectively with limited annotated data. A projection factor of 6 strikes the optimal balance, simplifying the feature space while preserving the necessary harmonic detail for accurate beat tracking. It is worth mentioning that experiments on other datasets also yielded the best results when the projection factor \(r\) is set to 6.

Comparison of Fine-Tuning Methods. Table 3 presents the results of our ablation study, comparing different fine-tuning methods on two foundation models for beat and downbeat tracking. The results clearly demonstrate that HingeNet consistently outperforms other fine-tuning methods. During the experiments, we observed that both Adapter and LoRA exhibited varying degrees of overfitting, which led to their suboptimal performance. In contrast, HingeNet leverages its lightweight and independent structure to fully exploit the capabilities of pre-trained foundation models.

5 conclusion↩︎

In this paper, we propose HingeNet, a novel and general fine-tuning method specifically designed for beat tracking tasks. HingeNet is a lightweight and separable network with a unique architecture that grants it broad generalizability, enabling effective integration with various pre-trained foundation models. Furthermore, considering the significance of harmonics in beat tracking, we design the harmonic-aware modules based on harmonic principles to refine and emphasize the harmonic structures in musical signals. Experiments on benchmark datasets demonstrate that HingeNet achieves state-of-the-art performance in both beat and downbeat tracking.

6 Acknowledgement↩︎

This work was supported by NSFC(62171138).

References↩︎

[1]

Kazuki Ochiai, Hirokazu Kameoka, and Shigeki Sagayama, “Explicit beat structure modeling for non-negative matrix factorization-based multipitch analysis,” in ICASSP. IEEE, 2012, pp. 133–136.

[2]

Ryo Nishikimi, Eita Nakamura, Masataka Goto, Katsutoshi Itoyama, and Kazuyoshi Yoshii, “Bayesian singing transcription based on a hierarchical generative model of keys, musical notes, and f0 trajectories,” IEEE/ACM TASLP, vol. 28, pp. 1678–1691, 2020.

[3]

Go Shibata, Ryo Nishikimi, and Kazuyoshi Yoshii, “Music structure analysis based on an lstm-hsmm hybrid model.,” in ISMIR, 2020, pp. 23–29.

[4]

Oriol Nieto, Gautham J. Mysore, Cheng-i Wang, Jordan B. L. Smith, Jan Schlüter, Thomas Grill, and Brian McFee, “Audio-based music structure analysis: Current trends, open challenges, and applications,” Transactions of the International Society for Music Information Retrieval, p. 246–263, Dec 2020.

[5]

Ganghui Ru, Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao, “Improving music genre classification from multi-modal properties of music and genre correlations perspective,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.

[6]

Wei Duan, Yi Yu, Xulong Zhang, Suhua Tang, Wei Li, and Keizo Oyama, “Melody generation from lyrics with local interpretability,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 19, no. 3, pp. 1–21, 2023.

[7]

Sebastian Böck, Florian Krebs, and Gerhard Widmer, “Joint beat and downbeat tracking with recurrent neural networks.,” in ISMIR. New York City, 2016, pp. 255–261.

[8]

Yi-Chin Chuang and Li Su, “Beat and downbeat tracking of symbolic music data using deep recurrent neural networks,” in APSIPA ASC. IEEE, 2020, pp. 346–352.

[9]

Simon Durand, Juan Pablo Bello, Bertrand David, and Gael Richard, “Robust downbeat tracking using an ensemble of convolutional networks,” IEEE/ACM TASLP, p. 76–89, Jan 2017.

[10]

Tian Cheng, Satoru Fukayama, and Masataka Goto, “Joint beat and downbeat tracking based on crnn models and a comparison of using different context ranges in convolutional layers,” in Proc. ICMC, 2020.

[11]

Sebastian Böck, Matthew EP Davies, and Peter Knees, “Multi-task learning of tempo and beat: Learning one to improve the other.,” in ISMIR, 2019, pp. 486–493.

[12]

Sebastian Böck and Matthew EP Davies, “Deconstruct, analyse, reconstruct: How to improve tempo, beat, and downbeat estimation.,” in ISMIR, 2020, pp. 574–582.

[13]

Yun-Ning Hung, Ju-Chiang Wang, Xuchen Song, Wei-Tsung Lu, and Minz Won, “Modeling beats and downbeats with a time-frequency transformer,” in ICASSP. IEEE, 2022, pp. 401–405.

[14]

Jingwei Zhao, Gus Xia, and Ye Wang, “Beat transformer: Demixed beat and downbeat tracking with dilated self-attention,” arXiv preprint arXiv:2209.07140, 2022.

[15]

Tian Cheng and Masataka Goto, “Transformer-based beat tracking with low-resolution encoder and high-resolution decoder.,” in ISMIR, 2023, pp. 466–473.

[16]

Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik, “Side-tuning: a baseline for network adaptation via additive side networks,” in ECCV. Springer, 2020, pp. 698–714.

[17]

Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder, “Compacter: Efficient low-rank hypercomplex adapter layers,” Advances in Neural Information Processing Systems, vol. 34, pp. 1022–1035, 2021.

[18]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.

[19]

Yang Lin, Xinyu Ma, Xu Chu, Yujie Jin, Zhibang Yang, Yasha Wang, and Hong Mei, “Lora dropout as a sparsity regularizer for overfitting control,” arXiv preprint arXiv:2404.09610, 2024.

[20]

Tian Cheng and Masataka Goto, “U-beat: A multi-scale beat tracking model based on wave-u-net,” in ICASSP. IEEE, 2023, pp. 1–5.

[21]

Mojtaba Heydari and Zhiyao Duan, “Singing beat tracking with self-supervised front-end and linear transformers,” arXiv preprint arXiv:2208.14578, 2022.

[22]

Dorian Desblancs, Vincent Lostanlen, and Romain Hennequin, “Zero-note samba: Self-supervised beat tracking,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.

[23]

Ching-Yu Chiu, Meinard Müller, Matthew EP Davies, Alvin Wen-Yu Su, and Yi-Hsuan Yang, “Local periodicity-based beat tracking for expressive classical piano music,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.

[24]

Francesco Foscarin, Jan Schlüter, and Gerhard Widmer, “Beat this! accurate beat tracking without dbn postprocessing,” arXiv preprint arXiv:2407.21658, 2024.

[25]

Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Chenghua Lin, Xingran Chen, Anton Ragni, Hanzhi Yin, Zhijie Hu, Haoyu He, et al., “Map-music2vec: A simple and effective baseline for self-supervised music audio representation learning,” arXiv preprint arXiv:2212.02508, 2022.

[26]

LI Yizhi, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, et al., “Mert: Acoustic music understanding model with large-scale self-supervised training,” in ICLR, 2023.

[27]

Minz Won, Yun-Ning Hung, and Duc Le, “A foundation model for music informatics,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1226–1230.

[28]

Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, and Yonghui Wu, “Self-supervised learning with random-projection quantizer for speech recognition,” in International Conference on Machine Learning. PMLR, 2022, pp. 3915–3924.

[29]

Florian Krebs, Sebastian Böck, and Gerhard Widmer, “An efficient state-space model for joint tempo and meter tracking.,” in ISMIR, 2015, pp. 72–78.

[30]

Matthew EP Davies, Norberto Degara, and Mark D Plumbley, “Evaluation methods for musical audio beat tracking algorithms,” Queen Mary University of London, Centre for Digital Music, Tech. Rep. C4DM-TR-09-06, 2009.

[31]

Masataka Goto, Hiroki Hashiguchi, Takuichi Nishimura, and Ryuichi Oka, “Rwc music database: Popular, classical and jazz music databases.,” in ISMIR, 2002, vol. 2, pp. 287–288.

[32]

Stephen W. Hainsworth and Malcolm D. Macleod, “Particle filtering applied to musical tempo tracking,” EURASIP Journal on Advances in Signal Processing, vol. 2004, no. 15, Nov 2004.

[33]

Florian Krebs, Sebastian Böck, and Gerhard Widmer, “Rhythmic pattern modeling for beat and downbeat tracking in musical audio.,” in ISMIR. Citeseer, 2013, pp. 227–232.

[34]

Oriol Nieto, Matthew C McCallum, Matthew EP Davies, Andrew Robertson, Adam M Stark, and Eran Egozy, “The harmonix set: Beats, downbeats, and functional segment annotations of western popular music.,” in ISMIR, 2019, pp. 565–572.

[35]

André Holzapfel, Matthew E. P. Davies, José R. Zapata, João Lobato Oliveira, and Fabien Gouyon, “Selective sampling for beat tracking evaluation,” IEEE TASLP, vol. 20, no. 9, pp. 2539–2548, Nov 2012.

[36]

Ugo Marchand and Geoffroy Peeters, “Swing ratio estimation,” Nov 2015.

HingeNet: A Harmonic-Aware Fine-Tuning Approach for Beat Tracking