Abstract

In this demo, we present a compact intelligent audio system-on-chip (SoC) integrated with a keyword spotting accelerator, enabling ultra-low latency, low-power, and low-cost voice interaction in Internet of Things (IoT) devices. Through algorithm-hardware co-design, the system’s energy efficiency is maximized. We demonstrate the system’s capabilities through a live FPGA-based prototype, showcasing stable performance and real-time voice interaction for edge intelligence applications.

1 Introduction↩︎

The rapid evolution of IoT and edge computing has driven an exponential increase in demand for intelligent voice systems. In particular, keyword spotting (KWS) plays a vital role in facilitating speech-based interactions with smart devices. Given its real-time and always-on modes, KWS must operate within a stringent power budget while ensuring low latency. Consequently, there is a growing trend towards using FPGA or ASIC to implement KWS to achieve higher energy efficiency. Some KWS accelerators leverage DNNs to push accuracy to higher levels [1], [2]. However, it also leads to increased latency, power and chip area. Integrated smart system-on-chips (SoCs) have also emerged as a key area of research. The combination of CPU, DSP and AI Engine is used in [3] to handle KWS and image classification. However, it suffers from higher latency and chip cost (related to area and power).

To address this challenge, we propose a compact SoC, featuring an energy-efficient KWS accelerator, enabling ultra-low latency, low-power and low-cost voice processing at the edge. Our main contribution lies in the algorithm-hardware co-design, with aggressive algorithm compression and custom hardware computation. In the live demo, we present a fully functional FPGA-based prototype, showcasing the system’s real-time voice interaction capabilities.

2 System Design↩︎

Figure 1: Architecture of the proposed KWS accelerator.

Figure 2: Architecture of the proposed audio SoC.

Figure 3: Evaluation prototype (top left), FPGA board (top right), illustration of demo (bottom left), and FPGA post-implementation layout (bottom right).

2.1 Algorithm & Optimization↩︎

Unlike methods bound to large-parameter DNN models, a lightweight approach is adopted in this work, combining Mel frequency cepstral coefficient (MFCC), Vector Quantization (VQ) and Dynamic time warping (DTW). Speech features are extracted using multiple stages of the MFCC pipeline. A codebook is trained using VQ to represent multiple keywords, and template matching is performed using DTW. To compress the algorithm, three strategies are applied: data downsampling, algorithm simplification, and offline parameter optimization, making it suitable for deployment in edge devices.

2.2 KWS Accelerator↩︎

Based on the optimized algorithm, a dedicated KWS accelerator is designed, as shown in Fig. 1. It consists of two main processing units: the Feature Extraction Unit (FEU) and the Template Classification Unit (TCU), which respectively execute the MFCC and DTW algorithms. Our architectural improvements include: a 7-stage pipeline to address the computational bottleneck of 128-point FFT, sparsity-based Mel filter optimizations, dataflow enhancements for the DCT stages. In the TCU, the dynamic time warping is refined to a fixed diagonal distance computation, reducing area by 99.2% and power consumption by 84.2%.

2.3 Audio SoC↩︎

The KWS accelerator is integrated into the SoC to enable more sophisticated application (Fig. 2). The Nuclei E203 RISC-V is responsible for flexible control, scheduling, and data preprocessing, while the KWS accelerator, serving as a coprocessor, performs speech processing tasks with high parallelism. The custom Audio module is dedicated to driving the Analog Frontend (AFE) for speech input/output. All components collaborate efficiently through an interrupt mechanism.

3 Evaluation and Demonstration↩︎

Deploying on the E203 RISC-V processor, the optimized algorithm reduces computation time by 98.2% and memory usage by 98.3%. Implemented using a 40 nm technology, the SoC operates at 50 MHz and the KWS accelerator at 400 kHz, consuming 12.4 mW and 28.3 \(\mu\)W, respectively. Compared to KWS works using DNNs [1], [2], our accelerator achieves a frame latency of 2.98 ms, delivering a 10× speedup over [1], with an area of 0.28 mm\(^2\), which is 1/6 of the area of [2] after normalization. The SoC has a compact core area of 1.34 mm\(^2\) with 128 kB of on-chip memory, further reducing chip cost. Compared to other IoT-oriented audio SoCs [3], [4], our design achieves the smallest area, lowest latency, and just 47.7% of the power consumption of [3].

An FPGA-based prototype and evaluation platform are built (Fig. 3). The system demonstrates stable performance across over 300 1-second test cases, correctly recognizing up to 89% of commands, with average processing times of 0.4 ms for KWS and 0.5 s for the SoC. For the live demo, we will showcase the SoC prototype on-site through multiple commands in either English or Chinese, including system wake-up, home appliance control, robotic hand gestures, etc. During demo, the AFE captures and encodes speech, which is processed on the FPGA for command recognition. Results are displayed on the host computer, with responses delivered through the player and executed by the robotic hand and LED. Additionally, components within the prototype collaborate via Bluetooth wireless communication. Through this real-time interactive demo, we hope to provide valuable insights into prototyping edge intelligent systems and inspire further innovation.

Acknowledgment↩︎

This work was supported by the National Key Research and Development Program, China, under Project No.2023YFB4403805.

References↩︎

[1]

W. Shan, et al., “AAD-KWS: A Sub-\(\mu\)W Keyword Spotting Chip With an Acoustic Activity Detector Embedded in MFCC and a Tunable Detection Window in 28-nm CMOS,” IEEE J. Solid-State Circuits, vol. 58, no. 3, pp. 867–876, Mar. 2023.

[2]

H. Yang, et al., “A 1.5-\(\mu\)W Fully-Integrated Keyword Spotting SoC in 28-nm CMOS With Skip-RNN and Fast-Settling Analog Frontend for Adaptive Frame Skipping,” IEEE J. Solid-State Circuits, vol. 59, no. 1, pp. 29–39, Jan. 2024.

[3]

Y. Dong, et al., “A Model-Specific End-to-End Design Methodology for Resource-Constrained TinyML Hardware,” in 2023 60th ACM/IEEE Design Automation Conference (DAC), Jul. 2023, pp. 1–6.

[4]

Z. Fan, et al., “AIMMI: Audio and Image Multi-Modal Intelligence via a Low-Power SoC With 2-MByte On-Chip MRAM for IoT Devices,” IEEE J. Solid-State Circuits, vol. 59, no. 10, pp. 3488–3501, Oct. 2024.

Prototype: A Keyword Spotting-Based Intelligent Audio SoC for IoT