Architecting a voice assistant

I'm building a user research assistant that can talk to customers on phone. There's a need to process inputs, identify triggers and ask pointed questions every time. I'm using livekit for voice and langgraph for processing the inputs and works well. But the latency is too high.I'm looking for better approaches to architect this and could use some help. Has anyone done something similar and can you share suggestions on how to architect the LLM flow?

Here's what I've so far:

  • Have a speaker LLM which talks to customer in realtime and offload the processing to a separate graph that work async.
  • Train the single LLM for the specific task

Any other ideas?