I’m developing an Android app for instant speech-to-speech translation with a strong focus on offline operation and low latency.
Requirements:
Target devices: Android 10+ on mid-range hardware such as Snapdragon 778G with 8 GB RAM
Language pairs: EN↔RU and EN↔FR, with possible expansion later
Offline-first approach
Privacy-focused design, ideally without cloud APIs
Target latency: under 500 ms from microphone input to translated audio output
Current pipeline:
Speech-to-text → local translation model → text-to-speech
Problems I’m running into:
Speech recognition latency is around 1.5 seconds even when using partial results
Local translation models are too slow on mid-range devices
The audio pipeline can block the UI if not handled carefully
Main questions:
What architecture is most practical for sub-500 ms speech-to-speech translation on Android?
Which local translation models are currently the best fit for mobile devices if size and inference speed are the main constraints?
What is the best strategy for handling partial speech recognition results without triggering translation too early?
What threading or pipeline design works best for this kind of audio workflow on Android?
I’d especially appreciate answers based on practical experience with TensorFlow Lite, ONNX Runtime Mobile, or similar mobile inference setups. My main goal is to understand what architecture and optimization strategy are realistic for low-latency offline translation on Android.
If it's not a secret, how much space does an offline language database take up on a device?