SpeechStack
Submit a template
← All templatestemplates / multimodal-voice-and-vision-assistant-for-ios
LiveKit Agentsupdated May 14, 2025 · other · support

Multimodal Voice and Vision Assistant for iOS

A voice AI assistant with realtime audio and video input capabilities. Built for iOS, it supports front and back camera switching, natural voice conversations, live screen sharing, and background operation. The assistant can observe and interact seamlessly while users work on other tasks, making it suitable for hands-free assistance scenarios.

Try the demoView sourceFork template
The numbers
latency
cost / min
frameworkLiveKit Agents
The stack
telephonyWeb Only
speech-to-textGoogle Speech-to-Text
llmGemini 2.5 Pro
text-to-speechGoogle Cloud TTS
System prompt
No prompt published.
Config
config.json
{
  "multimodal": true,
  "video_input": true,
  "image_format": "JPEG",
  "max_image_size": "1024x1024",
  "screen_sharing": true,
  "frame_rate_idle": 0.3,
  "camera_switching": true,
  "background_support": true,
  "frame_rate_speaking": 1
}
Tags
multimodalvisioniosmobilescreen-sharingbackground-modegemini-live
Voice Notes

Voice AI recipes, picks, and analysis.

Get the useful new templates plus the occasional teardown of what’s working in production voice AI.

contributed by @bcherry · Apache-2.0 · source: github discoverylanguages: en-US