RevComm Tech Blog

コミュニケーションを再発明し 人が人を想う社会を創る

IPSJ 266th Natural Language Processing & 158th Speech Language Information Processing Joint Research Presentation Meeting - Presentation and Participation Report

Introduction

This is Santoso from RevComm Research. I participated in a research meeting held in mid-December and presented our ongoing research results. This time, RevComm Research reported on our latest findings under the theme "Generative Error Correction for Product Names with Phonemic and Lexical Constraints." I had very meaningful discussions with many experts and spent a very fruitful time.

Conference Overview

https://www.ipsj.or.jp/kenkyukai/event/nl266slp158.html

The conference was hosted by the Information Processing Society of Japan (IPSJ), the largest IT-related academic organization in Japan. This specific event was a joint meeting of two special interest groups: Natural Language (NL) and Speech and Language Processing (SLP). These groups hold joint meetings several times a year to facilitate interdisciplinary dialogue, primarily through oral research presentations. The details of the conference are posted in this page.

Conference Period

December 15-17, 2025

Venue

Kyoto Terrsa (Kyoto City) (https://www.kyoto-terrsa.or.jp/)

Target Fields

Mainly targeting Natural Language Processing (NL) and Speech Language Processing (SLP). In particular, the advancement of speech recognition using Large Language Models (LLM) and domain-specific language processing were major themes.

Number of Presentations

  • Oral Presentations: 28
  • Invited Lectures: 2
  • International Conference Participation Reports: 2 (INTERSPEECH 2025 and ACL2025)

In total, there are 32 presentations spanning over 13 sessions.

Presentation statistics

Number of Participants

Approximately 200 people (including online participants)

The atmosphere of the presentation venue

Welcome to conference

Invited lecture session

About Our Presentation

Title

Generative Error Correction for Product Names with Phonemic and Lexical Constraints

Background and Motivation

In this study, we proposed a method to post-process and correct product names and model numbers, which are terms that ASR systems often struggle with, using an LLM-based framework. Specifically, we integrate phonetic information extracted from the audio with a product dictionary into the LLM prompt. This allows for accurate correction of out-of-vocabulary (OOV) terms without the need to retrain the core ASR model.

Experimental results on both English and Japanese datasets showed significant improvements in product name recognition accuracy. This development is expected to greatly enhance the quality of automated meeting minutes in business settings where specialized terminology is frequent.

Our Approach

To solve the problem, we discussed two main points:

  • Provision of a dataset construction protocol: We proposed a method for constructing conversational speech data specialized for product names and model numbers.

Fig. 1. Dataset construction protocol

  • Proposal of a new error correction framework: We developed a zero-shot error correction method (GEC) that combines phonemic information (Phoneme), a product dictionary (Lexical), and conversational context, without retraining the ASR.

Fig. 2. Flow of the generative error correction (GEC)

Experimental results confirmed that the accuracy of product name recognition improved significantly in both English and Japanese datasets.

Q&A Highlights

The 20-minute presentation was followed by a 5-minute lively Q&A session.

Q1: What is the exact threshold for the fuzzy matching and why?

A: We empirically tested thresholds between 0.6 and 0.9 with 0.05 intervals. This was determined during validation to balance retrieval success (recall) against the cleanliness of the context (precision) provided to the LLM.

Q2: The speaker number for the dataset is too small. Is it common setting for testing?

A: For this study, we used text-to-speech (TTS) to create synthetic data with two male and two female voices for each language. In this domain, the diversity of product names is often more critical than the number of speakers. We believe this is a valid testing setting as it ensures enough variation in how product names are recognized.

Q3: Can a summary be used as context instead of the full conversation sequence?

A: The goal of our research is to clarify business insights by improving product name recognition; accurate summaries actually depend on correct product names. If a summary is generated from misrecognized transcripts, key information might be lost or the quality degraded. This could prevent the LLM from focusing on the correct product identifiers, thereby reducing its correction capability. Thus, using the raw conversation sequence as direct context is more appropriate.

Notable Sessions

Invited Lecture 1: Multimodal Language Understanding in LLM

This inspiring lecture examined the degree to which LLMs "understand" linguistic and visual information from four perspectives. In its application to consecutive interpretation, prompt control overcomes data shortages and achieves accuracy surpassing conventional methods. However, experiments such as articulatory estimation using visual information demonstrated that absolute visual understanding remains a challenge.

Invited Lecture 2: The Past, Present, and Future of Speech Corpora

This lecture presented recommendations regarding speech data in the AI era. It emphasized the importance of international collaboration to expand corpora for Japanese and Asian languages, and the need to balance the "quantity" of data for LLM development with the "quality" of data for rigorous experiments. It also emphasized that conversational speech, especially speech containing facial data, is sensitive personal information that requires strict ethical management and consent.

International Conference Reports

  • INTERSPEECH 2025: It was reported that the fusion of ASR, LLM, and self-supervised learning (SSL) is the mainstream trend. The rise of "Speech LMs" (Speech Language Models) is expanding the field into paralinguistic understanding, full-duplex dialogue, and zero-shot TTS.
  • ACL 2025: The speaker shared insights into the ARR (ACL Rolling Review) system. Discussions focused on the importance of re-submissions for improvement and the necessity of robust rebuttals. It also provided tips for creating visually minimal, high-impact posters to attract attention in crowded venues.

Summary

It was a privilege to share our latest research at the conference this December. With nearly 200 participants, the event was full of energy, featuring heated discussions about the possibilities of speech processing using LLM, including invited lectures and international conference reports.

Speaking directly with other experts provided us with incredibly helpful feedback. While our team worked hard to prepare a strong presentation, these live discussions were essential for identifying new challenges and shaping the future direction of our work. I felt that new challenges and future research directions became much clearer.

At RevComm Research, we plan to advance research on multilingual support and applications to specific domains based on these research findings. We are currently looking for new colleagues to join us in solving these exciting technical challenges.