Background Speaker diarization has become increasingly valuable in applications designed for high-noise environments, often tailored for complex audio settings, emphasizing robust audio processing capabilities. Initially, the development of these systems centered on audio data alone. However, there is a growing shift toward incorporating multimodal information, such as visual and textual cues, motivated by the fact that multimodal integration can significantly improve performance. Recent advancements in NLP have introduced powerful tools such as BERT and large language models (LLM). BERT is a pre-trained transformer model designed to process bidirectional context, enabling it to understand the relationships between words in a sentence more effectively. It has been widely adopted for tasks requiring fine-grained contextual understanding, such as text classification and question answering. On the other hand, LLMs extend these capabilities by scaling up model size and training on diverse datasets, enabling them to generate coherent text and handle complex reasoning tasks. These models have demonstrated their ability to capture semantic nuances, where leveraging contextual information can mitigate speaker ID errors and improve overall accuracy. This is the paper presented in ASJ(日本音響学会)
Methods
1. Speaker Diarization with BERT
The system processes a combination of input features: the initial speaker ID sequence, contextual embeddings for the current and subsequent sentences derived using BERT, and speaker embeddings for the current and next sentences obtained from the diarization model. These features are fed into a 3-layer LSTM network, designed to capture sequential dependencies across sentence boundaries, thereby enhancing the robustness of speaker identification. The output is a refined sequence of speaker IDs, referred to as the Second Pass Speaker ID Sequence. The model is trained with Cross Entropy loss, which optimizes accuracy in predicting speaker assignments.

2. Speaker Diarization with Large Language Models (LLMs)
The second approach leverages the capabilities of an LLM to refine speaker diarization results by incorporating contextual understanding. The process begins with the outputs from the initial diarization step, including input sequences of speaker utterances and speaker IDs, which are potentially prone to errors. To address these issues, instructions are provided to the LLM, guiding it to correct speaker ID assignments based on the contextual information in the text. In this pipeline, the LLM processes the input text containing speaker utterances and their initially assigned speaker IDs and generates corrected speaker ID assignments in the output text. This integration of contextual embeddings and speaker IDs ensures more accurate speaker diarization.

Experiments
1. Experiment Setting
- Dataset
- RevComm Dataset: The RevComm dataset, a Japanese Meeting dataset, consists of 237 two-speaker conversations extracted from Zoom meetings. Each conversation is enriched with channel information, enabling the precise extraction and labeling of individual speakers. The dataset comprises approximately 189 hours of audio data, divided into 133 hours for training, consisting of 165 dialogues and 56 hours for testing, consisting of 72 dialogues. This dataset reflects real-world conversational scenarios, including variations in speech quality, overlapping speech, and environmental noise.
- AMI Meeting Corpus: The AMI Meeting Corpus is a widely used dataset for speaker diarization research, consisting of recorded multi-speaker meetings. The corpus features multiple audio sources, including individual headset microphones (IHM), single distant microphones (SDM), allowing for comprehensive testing of diarization models across varied acoustic conditions. For this study, the standard train-test split provided in the dataset was used to evaluate the effectiveness of our proposed approaches.
- Pretrained model
- rinna/llama-3-youko-8b for RevComm dataset
- Llama3.1-8b for the AMI dataset
2. Experiment Results
This section presents the experimental results for both the BERT-based and LLM-based speaker diarization approaches, evaluated using the CDER. Insights into the impact of window sizes, speaker embeddings (SE), and performance trade-offs are discussed.
RevComm Dataset (BERT): The BERT-based approach refines first-pass diarization results using contextual embeddings and SE. Table 1 summarizes the CDER results for different window sizes and configurations. The results demonstrate that increasing the window size improves performance, as larger windows provide additional context, enabling the model to make better-informed speaker attribution decisions. Additionally, integrating SE further reduces CDER, particularly with larger window sizes. The best performance of 5.79% CDER is achieved with a window size of 32 and SE, marking a significant improvement over the baseline CDER of 7.62%.
CDER for BERT-based diarization on the RevComm dataset, evaluated with varying window sizes and the inclusion/exclusion of SE
Character DER Pyannote Baseline 7.62% window 8 6.31% window 16 6.29% window 16 + SE 6.27% window 32 6.24% Window 32 + SE 5.79%
RevComm Dataset (LLM): The LLM-based approach builds upon the output of the BERT-based model by leveraging the advanced contextual understanding of a large language model. Table 2 shows the CDER results for PyAnnote Baseline, BERT, and LLM configurations with a window size of 32. The LLM-based approach achieves the best overall performance, with a CDER of 5.00%. This result highlights the ability of LLMs to refine speaker labels further by effectively incorporating global contextual information.
CDER Comparison on RevComm Dataset
Character DER PyAnnote Baseline 7.62% BERT + SE Embedding 5.79% LLM (window size 32) 5.00%
AMI Meeting Dataset(LLM): The LLM-based approach was also evaluated on the AMI dataset. Table 3 presents the CDER results for PyAnnote Baseline and the LLM-based approach. The results show significant improvements with the LLM-based method, reducing CDER by 5.6% for IHM and 7.2% for SDM, demonstrating the effectiveness and generalizability of the LLM-based method.
AMI(IHM) AMI(SDM) PyAnnote Baseline 18.8% 22.4% LLM 13.2% 15.2%
Conclusion
In this study, we investigated the use of language models to improve speaker diarization accuracy through post-processing techniques, evaluating both BERT-based and LLM-based approaches. The BERT-based method is effective, and the LLM-based approach further improves the performance with higher computational costs. These findings highlight the trade-off between accuracy and computational efficiency, emphasizing the need for practical optimizations in real-world applications. Speaker Diarization with Language Model is a purely NLP-based post-processing module. Functioning as an ASR post-processor, it boosts accuracy and expands the potential for further analysis in products such as Recpod.