Lean in to listen:A Real-Time Speech Recognition APP (with Source Code)

2025-08-07·
XIETU
XIETU
· 4 min read · 阅读量

Origin

During a conversation with my hearing-impaired mother, I realized I needed to speak very loudly for her to hear me. This inspired me to develop software that could convert my speech into large, real-time text display.

Features

  • Real-time speech recognition (Chinese/English); No internet required; Only needs microphone and overlay permissions
  • Floating window display; Adjustable font size; Manual save; Auto-save (when exceeding 300 characters)
  • Built with WeNet open-source model
  • Demo Video

After two months of development, the UI interactions still need improvement. Potential expansions like translation or local audio file processing were shelved to focus on core objectives.

WeNet Introduction

While chatting with DeepSeek, I learned about open-source models available in China and ultimately chose WeNet due to its end-to-end architecture, industry-ready deployment capabilities, and streaming recognition support.

WeNet Speech Recognition Toolkit (version 3.1.0)

Repository address: https://github.com/wenet-e2e/wenet

1. Main directories:

  • .github/: Contains GitHub workflows and issue templates
  • docs/: Project documentation and image resources
  • examples/: Training examples for various speech datasets (aishell, librispeech, etc.)
  • runtime/: Runtime components
  • wenet/: Core source code
  • tools/: Auxiliary tools We only need files from the runtime/android directory.

2. Android directory structure:

\wenet-3.1.0\runtime\android
├── .gitignore
├── .gradle\
│   ├── 7.5\
│   │   ├── checksums\
│   │   │   └── checksums.lock
│   │   ├── dependencies-accessors\
│   │   │   ├── dependencies-accessors.lock
│   │   │   └── gc.properties
│   │   ├── fileChanges\
│   │   │   └── last-build.bin
│   │   ├── fileHashes\
│   │   │   └── fileHashes.lock
│   │   ├── gc.properties
│   │   └── vcsMetadata\
├── buildOutputCleanup\
│   │   ├── buildOutputCleanup.lock
│   │   └── cache.properties
│   └── vcs-1\
│       └── gc.properties
├── README.md
├── app\
│   ├── .gitignore
│   ├── build.gradle
│   ├── proguard-rules.pro
│   ├── src\
│   │   ├── androidTest\
│   │   │   └── java\
│   │   ├── main\
│   │   │   ├── AndroidManifest.xml
│   │   │   ├── assets\
│   │   │   ├── cpp\
│   │   │   ├── java\
│   │   │   └── res\
│   │   └── test\
│   │       └── java\
│   └── wenet.keystore
├── build.gradle
├── gradle\
│   └── wrapper\
│       ├── gradle-wrapper.jar
│       └── gradle-wrapper.properties
├── gradle.properties
├── gradlew
├── gradlew.bat
└── settings.gradle

The android\app\src\main\cpp folder contains only CMakeLists.txt and wenet.cc, along with 8 shortcuts that point to folders in wenet-3.1.0\runtime\core. You need to copy these folders into your project: bin, cmake, decoder, frontend, kaldi, patch, post_processor, and utils.

3. Build error: Missing WeTextProcessing

WeTextProcessing is an open-source toolkit focused on Text Normalization (TN) and Inverse Text Normalization (ITN), developed and maintained by the wenet-e2e team. It’s primarily used for text preprocessing and postprocessing in NLP tasks, with wide applications in speech recognition (ASR), machine translation, information retrieval, chatbots, etc.

  • Text Normalization (TN): Converts non-standard text (e.g., colloquial expressions) into standard written form. Examples:
    Input: “二点五平方电线” → Output: “2.5平方电线”
    Input: “你好 WeTextProcessing 1.0 船新版本儿” → Output: “你好 WeTextProcessing 1.0 全新版本”.

  • Inverse Text Normalization (ITN): Restores standardized text to natural language expressions.
    Examples:
    Input: “2.5平方电线” → Output: “二点五平方电线”
    Input: “1:05” → Output: “一点零五分”.

  • You need to include this library file or download and extract it directly to the \main\cpp directory
    Repository address:
    https://github.com/wenet-e2e/WeTextProcessing

APP Source Code Structure

Contains two modules:

1. Main module: app

The main module contains the app’s interface and logic, including:

(1) MainActivity.kt:
  • Speech recognition functionality:
    Uses RecognizerService for speech recognition; manages audio recording (AudioRecord) and recognition status; supports Chinese/English recognition switching

  • Interface management:
    Switching between main interface and floating window (FloatingWindowManager); font size adjustment; text display area management

  • File operations:
    Saving recognition results to files; automatic save option; clearing current results

  • Permission management:
    Handling floating window permission requests; audio recording permission checks

(2) FloatingWindowManager.kt:
  • Floating window management:
    Creating and managing floating window interface; handling display, hiding, and removal; managing layout parameters
  • Interface interaction:
    Implementing drag functionality; handling click events to show/hide button bar; providing close button to return to main interface
  • Font size control:
    Switching between large/small font display; synchronizing font size settings between main interface and floating window
  • Screen orientation lock:
    Providing orientation lock/unlock functionality; handling layout adjustments during screen rotation
  • State synchronization:
    Maintaining state synchronization with MainActivity; sharing text display box reference; synchronizing font size settings
  • Layout management:
    Dynamically adjusting floating window layout; handling layout changes in different screen orientations; calculating and updating control sizes

2. Speech recognition module: wenetmodel

Contains model files, Recognize.java, VoiceRectView.java, RecognizerService.java, and cpp folder;

(1) assets directory:
  • wenetmodels_zn/: Chinese model
  • wenetmodels_en/: English model
(2) Recognize.java:
  • This class serves as the core bridge for speech recognition functionality, forwarding Java layer calls to the underlying C++ implementation. Its main role is as a wrapper for Java Native Interface (JNI) to connect the Java layer with the underlying WeNet speech recognition engine. Specific functions include:

  • Native library loading:
    Static code block loads the local library named “wenet”; includes logging for successful/failed loading; throws runtime exception on failure

  • JNI method declarations:
    init(): Initializes model and decoder; reset(): Resets decoder state; acceptWaveform(): Receives audio waveform data; setInputFinished(): Sets input end flag; getFinished(): Checks if recognition is complete; startDecode(): Starts decoding thread; getResult(): Retrieves recognition results; clearResultAtEndpoint(): Clears endpoint accumulated results; getState(): Gets current decoding state

(3) VoiceRectView.java: Used for drawing waveforms (not utilized in this project).
(4) RecognizerService.java:
  • Speech recognition service management: Runs speech recognition as a background service; handles service lifecycle (onCreate, onStartCommand, onDestroy); provides foreground service notification

  • Model management: Manages Chinese and English speech recognition models; extracts model files from assets directory to device storage; provides model initialization functionality

  • Speech recognition processing: Starts/stops speech recognition process; handles audio data queue; calls underlying recognition engine (Recognize class) for speech recognition

  • Callback mechanism: Provides initialization callback (InitializationCallback); provides recognition result callback (RecognitionCallback)

  • Multithreaded processing: Uses separate threads for model initialization and speech recognition; uses BlockingQueue for audio data; uses AtomicBoolean for recognition state management

  • Notification management: Creates and manages foreground service notification; handles notification click to return to main interface

(5) cpp directory:

Contains wenet.cc file, which is the C++ implementation responsible for loading models, initializing recognizers, processing audio data, calling recognition engine, and returning recognition results.

(6) Additional notes:

Do not delete pytorch_android-1.13.0.aar and wenet-openfst-android-1.0.2.aar in the wenetmodel\build folder during reconstruction.

3. Calling logic:

    1)、Initialization Phase
        // Initialization in MainActivity
        recognizer = new RecognizerService(context)  // Construct service
        recognizer.initialize(callback)  //Trigger initialization process
            extractAssets()  //Extract model files from APK to device storage: data/user/0/com.lhh.voiceAssistant.debug/files/wenet_models/
                Recognize.init(modelDir)  // Initialize recognition engine
                Recognize.reset()  // Reset engine state
    2)、Speech Recognition Phase
        recognizer.startRecognition(callback)  // Start recognition
        Recognize.startDecode()  // Start decoder
        // Process audio queue in loop
        while (isProcessing) {
            audioQueue.poll()  // Get audio data
            Recognize.acceptWaveform(audioData)  // Feed audio data to engine
            Recognize.getResult()  // Get recognition result
            callback.onResult()  // Callback result
        }
    3)、 Auxiliary Method Calls
        // Audio Input
        recognizer.feedAudioData(short[])  // Feed audio data to queue
            audioQueue.offer()  // Store in queue

        // Stop Recognition
        recognizer.stopRecognition()  
            Recognize.setInputFinished()  // Notify input end
    4)、Key Call Sequences
        MainActivity.onCreate()
        → RecognizerService.initialize()
            → extractAssets() 
            → Recognize.init()
        → startRecognition()
            → Process audioQueue in loop
        → stopRecognition()

Lessons Learned:

  • Dependency version mismatches are the root of most issues - consult AI for optimal versions early.
  • CMake configuration:
    Since the model is written in C, we had to set up a CMakeLists.txt configuration file for cross-platform compilation. The biggest challenges involved resolving library linking and dependency ordering issues.
    These details have been recorded in the source code or in a STUDYME.md file within the relevant directory.

Downloads

App (Baidu Netdisk) Password: mvbr (~400MB)
Source Code (Baidu Netdisk) Password: btqy (~400MB, requires rebuild)