BACHI: Boundary-Aware Symbolic Chord Recognition Through Masked Iterative Decoding on Pop and Classical Music

University of California San Diego
BACHI architecture overview

BACHI introduces a boundary-aware approach to symbolic chord recognition that mirrors human ear-training practices. The model decomposes chord recognition into explicit steps: detecting chord boundaries, then iteratively predicting root, quality, and bass in confidence order.

Abstract

Automatic chord recognition (ACR) via deep learning models has gradually achieved promising recognition accuracy, yet two key challenges remain. First, prior work has primarily focused on audio-domain ACR, while symbolic music (e.g., score) ACR has received limited attention due to data scarcity. Second, existing methods still overlook strategies that are aligned with human music analytical practices. To address these challenges, we make two contributions: (1) we introduce POP909-CL, an enhanced version of POP909 dataset with tempo-aligned content and human-corrected labels of chords, beats, keys, and time signatures; and (2) we propose BACHI, a symbolic chord recognition model that decomposes the task into different decision steps, namely boundary detection and iterative ranking of chord root, quality, and bass (inversion). This mechanism mirrors the human ear-training practices. Experiments demonstrate that BACHI achieves state-of-the-art chord recognition performance on both classical and pop music benchmarks, with ablation studies validating the effectiveness of each module.

Method

BACHI performs symbolic chord recognition in two boundary-aware stages that mirror human ear-training.

  • Patch Embedding & Transformer Encoder: Beat-synchronous piano-roll tokens are embedded on both piano-roll frame and time dimensions and processed by six transformer encoder layers.
  • Boundary-Conditioned Context: A supervised chord boundary detector modulates encoder hidden states through FiLM and a local context window.
  • Confidence-Ordered Decoding: A masked transformer decoder iteratively fills root, quality, and bass in confidence order, providing robust chord labeling.

Training Data: POP909-CL with human-corrected annotations and a curated classical corpus combining When-in-Rome and DCMLwith deduplication.

Results

Qualitative Examples

BACHI predictions on classical and pop music examples

Example predictions from BACHI on classical music (top) and POP909-CL (bottom), comparing ground truth annotations with model predictions.

Ablation on Classical & Pop

We perform a comprehensive ablation study on both classical and pop music benchmarks. We report the per-piece macro accuracy (%) across classical and POP909-CL ablations. The following is the detailed ablation study:

  • BACHI w/o BD & ID: We use the patch embedding and transformer encoder modules only (no boundary-aware detection and iterative decoding).
  • BACHI w/o ID: We use the patch embedding, transformer encoder and boundary-aware detection modules only (no iterative decoding).
  • BACHI + Key Detection: On top of complete BACHI, we add a key detection module and use the key detection result to guide the iterative decoding like boundary
  • BACHI (Full): We use the full BACHI model (patch embedding, transformer encoder, boundary-aware detection, and iterative decoding).

While boundary detection alone makes unstable effects on the model, the iterative decoding has a larger influence on the performance, indicating that iterative decoding across chord elements contributes significantly to overall accuracy. Interestingly, adding key detection as an additional conditioning signal slightly decreases full-chord accuracy compared to the full BACHI, likely due to errors in key prediction propagating to chord recognition.

Benchmark Comparison

# Model Classical Corpus POP909-CL
Root Quality Bass Full Root Quality Bass Full
1 Rule-based 54.6 45.8 50.5 28.4 85.9 69.7 85.8 65.0
2 AugmentedNet 73.9 74.2 72.3 57.2 88.6 84.5 90.5 78.7
3 ChordGNN 73.0 73.7 71.0 58.5 80.7 82.0 82.7 71.6
4 Harmony Transformer v2 76.1 76.8 75.2 62.1 90.5 86.9 92.1 82.2
5 BACHI (ours) 77.8 79.0 77.0 68.1 89.6 86.8 91.3 82.4

For classical music, BACHI achieves the highest scores across all metrics, with a notable improvement in full chord accuracy (68.1%) over prior baselines. For pop music, BACHI achieves state-of-the-art performance, with a full chord accuracy of 82.4%. Most deep learning methods are all performing well on pop music, and though some components are not ranking at first, the full chord accuracy is still the best. Overall, these results demonstrate both the strong performance of BACHI on classical and pop music, where BACHI is particuarly good at more complex classical music.

Confidence-Ordered Decoding

Classical POP909-CL
Most frequent first-decoded field Quality (40.8%) Bass (66.9%)
Top decoding sequence Quality → Root → Bass (32.2%) Bass → Root → Quality (56.4%)
Interpretation Classical harmony emphasizes voice-leading and functional quality cues. Pop arrangements highlight bass-led cues for chord identity.

These genre-variant orders indicate that the model internalizes musician-like heuristics, supporting our hypothesis that human-mimicking decision paths benefit symbolic ACR and improve upon fixed-order decoding.

Main Contributions

BibTeX

@article{yao2025bachi,
  author    = {Mingyang Yao and Ke Chen and Shlomo Dubnov and Taylor Berg-Kirkpatrick},
  title     = {BACHI: Boundary-Aware Symbolic Chord Recognition Through Masked Iterative Decoding on Pop and Classical Music},
  journal   = {arXiv},
  year      = {2025},
}