Lean in to listen:A Real-Time Speech Recognition APP (with Source Code)

Origin

During a conversation with my hearing-impaired mother, I realized I needed to speak very loudly for her to hear me. This inspired me to develop software that could convert my speech into large, real-time text display.

Features

Real-time speech recognition (Chinese/English); No internet required; Only needs microphone and overlay permissions
Floating window display; Adjustable font size; Manual save; Auto-save (when exceeding 300 characters)
Built with WeNet open-source model
Demo Video

After two months of development, the UI interactions still need improvement. Potential expansions like translation or local audio file processing were shelved to focus on core objectives.

WeNet Introduction

While chatting with DeepSeek, I learned about open-source models available in China and ultimately chose WeNet due to its end-to-end architecture, industry-ready deployment capabilities, and streaming recognition support.

Chinese model: wenet_aishell1
ModelScope address: https://modelscope.cn/models/wenet/aishell
Trained on the Aishell1 dataset, 209MB; required files: final.zip, train.yaml, units.txt.
English model: wenet_librispeech
ModelScope address: https://modelscope.cn/models/wenet/librispeech
Trained on the Librispeech dataset, 213MB; required files: final.zip, train.yaml, units.txt.

WeNet Speech Recognition Toolkit (version 3.1.0)

Repository address: https://github.com/wenet-e2e/wenet

1. Main directories:

.github/: Contains GitHub workflows and issue templates
docs/: Project documentation and image resources
examples/: Training examples for various speech datasets (aishell, librispeech, etc.)
runtime/: Runtime components
wenet/: Core source code
tools/: Auxiliary tools We only need files from the runtime/android directory.

2. Android directory structure:

\wenet-3.1.0\runtime\android
├── .gitignore
├── .gradle\
│   ├── 7.5\
│   │   ├── checksums\
│   │   │   └── checksums.lock
│   │   ├── dependencies-accessors\
│   │   │   ├── dependencies-accessors.lock
│   │   │   └── gc.properties
│   │   ├── fileChanges\
│   │   │   └── last-build.bin
│   │   ├── fileHashes\
│   │   │   └── fileHashes.lock
│   │   ├── gc.properties
│   │   └── vcsMetadata\
├── buildOutputCleanup\
│   │   ├── buildOutputCleanup.lock
│   │   └── cache.properties
│   └── vcs-1\
│       └── gc.properties
├── README.md
├── app\
│   ├── .gitignore
│   ├── build.gradle
│   ├── proguard-rules.pro
│   ├── src\
│   │   ├── androidTest\
│   │   │   └── java\
│   │   ├── main\
│   │   │   ├── AndroidManifest.xml
│   │   │   ├── assets\
│   │   │   ├── cpp\
│   │   │   ├── java\
│   │   │   └── res\
│   │   └── test\
│   │       └── java\
│   └── wenet.keystore
├── build.gradle
├── gradle\
│   └── wrapper\
│       ├── gradle-wrapper.jar
│       └── gradle-wrapper.properties
├── gradle.properties
├── gradlew
├── gradlew.bat
└── settings.gradle

The android\app\src\main\cpp folder contains only CMakeLists.txt and wenet.cc, along with 8 shortcuts that point to folders in wenet-3.1.0\runtime\core. You need to copy these folders into your project: bin, cmake, decoder, frontend, kaldi, patch, post_processor, and utils.

3. Build error: Missing WeTextProcessing

WeTextProcessing is an open-source toolkit focused on Text Normalization (TN) and Inverse Text Normalization (ITN), developed and maintained by the wenet-e2e team. It’s primarily used for text preprocessing and postprocessing in NLP tasks, with wide applications in speech recognition (ASR), machine translation, information retrieval, chatbots, etc.

Text Normalization (TN): Converts non-standard text (e.g., colloquial expressions) into standard written form. Examples:
Input: “二点五平方电线” → Output: “2.5平方电线”
Input: “你好 WeTextProcessing 1.0 船新版本儿” → Output: “你好 WeTextProcessing 1.0 全新版本”.
Inverse Text Normalization (ITN): Restores standardized text to natural language expressions.
Examples:
Input: “2.5平方电线” → Output: “二点五平方电线”
Input: “1:05” → Output: “一点零五分”.
You need to include this library file or download and extract it directly to the \main\cpp directory
Repository address:
https://github.com/wenet-e2e/WeTextProcessing

APP Source Code Structure

Contains two modules:

1. Main module: app

The main module contains the app’s interface and logic, including:

(1) MainActivity.kt:

Speech recognition functionality:
Uses RecognizerService for speech recognition; manages audio recording (AudioRecord) and recognition status; supports Chinese/English recognition switching
Interface management:
Switching between main interface and floating window (FloatingWindowManager); font size adjustment; text display area management
File operations:
Saving recognition results to files; automatic save option; clearing current results
Permission management:
Handling floating window permission requests; audio recording permission checks

(2) FloatingWindowManager.kt:

Floating window management:
Creating and managing floating window interface; handling display, hiding, and removal; managing layout parameters
Interface interaction:
Implementing drag functionality; handling click events to show/hide button bar; providing close button to return to main interface
Font size control:
Switching between large/small font display; synchronizing font size settings between main interface and floating window
Screen orientation lock:
Providing orientation lock/unlock functionality; handling layout adjustments during screen rotation
State synchronization:
Maintaining state synchronization with MainActivity; sharing text display box reference; synchronizing font size settings
Layout management:
Dynamically adjusting floating window layout; handling layout changes in different screen orientations; calculating and updating control sizes

2. Speech recognition module: wenetmodel

Contains model files, Recognize.java, VoiceRectView.java, RecognizerService.java, and cpp folder;

(1) assets directory:

wenetmodels_zn/: Chinese model
wenetmodels_en/: English model

(2) Recognize.java:

This class serves as the core bridge for speech recognition functionality, forwarding Java layer calls to the underlying C++ implementation. Its main role is as a wrapper for Java Native Interface (JNI) to connect the Java layer with the underlying WeNet speech recognition engine. Specific functions include:
Native library loading:
Static code block loads the local library named “wenet”; includes logging for successful/failed loading; throws runtime exception on failure
JNI method declarations:
init(): Initializes model and decoder; reset(): Resets decoder state; acceptWaveform(): Receives audio waveform data; setInputFinished(): Sets input end flag; getFinished(): Checks if recognition is complete; startDecode(): Starts decoding thread; getResult(): Retrieves recognition results; clearResultAtEndpoint(): Clears endpoint accumulated results; getState(): Gets current decoding state

(3) VoiceRectView.java: Used for drawing waveforms (not utilized in this project).

(4) RecognizerService.java:

Speech recognition service management: Runs speech recognition as a background service; handles service lifecycle (onCreate, onStartCommand, onDestroy); provides foreground service notification
Model management: Manages Chinese and English speech recognition models; extracts model files from assets directory to device storage; provides model initialization functionality
Speech recognition processing: Starts/stops speech recognition process; handles audio data queue; calls underlying recognition engine (Recognize class) for speech recognition
Callback mechanism: Provides initialization callback (InitializationCallback); provides recognition result callback (RecognitionCallback)
Multithreaded processing: Uses separate threads for model initialization and speech recognition; uses BlockingQueue for audio data; uses AtomicBoolean for recognition state management
Notification management: Creates and manages foreground service notification; handles notification click to return to main interface

(5) cpp directory:

Contains wenet.cc file, which is the C++ implementation responsible for loading models, initializing recognizers, processing audio data, calling recognition engine, and returning recognition results.

(6) Additional notes:

Do not delete pytorch_android-1.13.0.aar and wenet-openfst-android-1.0.2.aar in the wenetmodel\build folder during reconstruction.

3. Calling logic:

    1)、Initialization Phase
        // Initialization in MainActivity
        recognizer = new RecognizerService(context)  // Construct service
        ↓
        recognizer.initialize(callback)  //Trigger initialization process
            ↓
            extractAssets()  //Extract model files from APK to device storage: data/user/0/com.lhh.voiceAssistant.debug/files/wenet_models/
                ↓
                Recognize.init(modelDir)  // Initialize recognition engine
                ↓
                Recognize.reset()  // Reset engine state
    2)、Speech Recognition Phase
        recognizer.startRecognition(callback)  // Start recognition
        ↓
        Recognize.startDecode()  // Start decoder
        ↓
        // Process audio queue in loop
        while (isProcessing) {
            audioQueue.poll()  // Get audio data
            ↓
            Recognize.acceptWaveform(audioData)  // Feed audio data to engine
            ↓
            Recognize.getResult()  // Get recognition result
            ↓
            callback.onResult()  // Callback result
        }
    3)、 Auxiliary Method Calls
        // Audio Input
        recognizer.feedAudioData(short[])  // Feed audio data to queue
            ↓
            audioQueue.offer()  // Store in queue

        // Stop Recognition
        recognizer.stopRecognition()  
            ↓
            Recognize.setInputFinished()  // Notify input end
    4)、Key Call Sequences
        MainActivity.onCreate()
        → RecognizerService.initialize()
            → extractAssets() 
            → Recognize.init()
        → startRecognition()
            → Process audioQueue in loop
        → stopRecognition()

Lessons Learned:

Dependency version mismatches are the root of most issues - consult AI for optimal versions early.
CMake configuration:
Since the model is written in C, we had to set up a CMakeLists.txt configuration file for cross-platform compilation. The biggest challenges involved resolving library linking and dependency ordering issues.
These details have been recorded in the source code or in a STUDYME.md file within the relevant directory.

Downloads

App (Baidu Netdisk) Password: mvbr (~400MB)
The source code link has been removed to avoid copyright disputes. (2025-09-18)

Support My Work

Buy Me a Coffee at ko-fi.com