Continuing on the journey to get my hands dirty with voice UIs - I put down some user perceived latency metrics I was seeing when building VUIs.
Key points:
- I used the 'pipeline' approach of STT + LLM + TTS (as opposed to the S2S approach eg: gpt-realtime)
- This approach (with my specific setup) - yielded latency far greater than the 500ms target, where conversations feel "natural" and there aren't any awkward silences
- With the LLM as gpt-5-mini I saw latency at ~1.4s and with the LLM as Llama 3.1-8b on Cerebras I saws 1.1s
Continuing on the journey to get my hands dirty with voice UIs - I put down some user perceived latency metrics I was seeing when building VUIs.
Key points: - I used the 'pipeline' approach of STT + LLM + TTS (as opposed to the S2S approach eg: gpt-realtime) - This approach (with my specific setup) - yielded latency far greater than the 500ms target, where conversations feel "natural" and there aren't any awkward silences - With the LLM as gpt-5-mini I saw latency at ~1.4s and with the LLM as Llama 3.1-8b on Cerebras I saws 1.1s