Multimodal Voice and Vision Assistant for iOS

A voice AI assistant with realtime audio and video input capabilities. Built for iOS, it supports front and back camera switching, natural voice conversations, live screen sharing, and background operation. The assistant can observe and interact seamlessly while users work on other tasks, making it suitable for hands-free assistance scenarios.

Try the demo View source Fork template

The numbers

latency—

cost / min—

frameworkLiveKit Agents

The stack

telephonyWeb Only

speech-to-textGoogle Speech-to-Text

llmGemini 2.5 Pro

text-to-speechGoogle Cloud TTS

System prompt

No prompt published.

Config

config.json

{
  "multimodal": true,
  "video_input": true,
  "image_format": "JPEG",
  "max_image_size": "1024x1024",
  "screen_sharing": true,
  "frame_rate_idle": 0.3,
  "camera_switching": true,
  "background_support": true,
  "frame_rate_speaking": 1
}

Voice AI recipes, picks, and analysis.

Get the useful new templates plus the occasional teardown of what’s working in production voice AI.