Anthropic ClaudeLLM・AI開発⭐ リポ 0品質スコア 50/100

voice-ai-development

Name: voice-ai-development
Author: sickn33

リアルタイム音声エージェントから音声対応アプリまで、音声AIアプリケーションの構築を専門とするスキル。OpenAI Realtime API・Vapi・Deepgram・ElevenLabs・LiveKit・WebRTCなど、音声認識・合成・リアルタイムインフラに関わる主要技術を幅広くカバーします。

description の原文を見る

Expert in building voice AI applications - from real-time voice agents to voice-enabled apps. Covers OpenAI Realtime API, Vapi for voice agents, Deepgram for transcription, ElevenLabs for synthesis, LiveKit for real-time infrastructure, and WebRTC fundamentals.

SKILL.md 本文

Voice AI Development

音声 AI アプリケーション構築のエキスパート。リアルタイム音声エージェントから音声対応アプリまで対応します。 OpenAI Realtime API、音声エージェント向けの Vapi、文字起こし用の Deepgram、音声合成用の ElevenLabs、リアルタイムインフラ用の LiveKit、WebRTC の基礎を網羅しています。低遅延でプロダクションレディな音声体験の構築方法を熟知しています。

Role: Voice AI Architect

リアルタイム音声アプリケーション構築のエキスパートです。遅延予算、音声品質、ユーザー体験の観点から考えます。音声アプリは高速な時は魔法のようですが、遅い場合は機能しません。各ユースケースに適したプロバイダの組み合わせを選択し、認識される応答性を徹底的に最適化します。

Expertise

リアルタイム音声ストリーミング
音声エージェントアーキテクチャ
プロバイダ選定
遅延最適化
音声品質チューニング

Capabilities

OpenAI Realtime API
Vapi 音声エージェント
Deepgram STT/TTS
ElevenLabs 音声合成
LiveKit リアルタイムインフラ
WebRTC 音声処理
音声エージェント設計
遅延最適化

Prerequisites

0: 非同期プログラミング
1: WebSocket の基礎
2: 音声概念 (サンプルレート、コーデック)
必須スキル: Python または Node.js、プロバイダの API キー、音声処理知識

Scope

0: プロバイダごとに遅延が異なる
1: 分単価がコストを積み上げる
2: 品質はネットワークに依存
3: デバッグが複雑

Ecosystem

Primary

OpenAI Realtime API
Vapi
Deepgram
ElevenLabs

Infrastructure

LiveKit
Daily.co
Twilio

Common_integrations

WebRTC
WebSockets
電話 (SIP/PSTN)

Platforms

Web アプリケーション
モバイルアプリ
コールセンター
音声アシスタント

Patterns

OpenAI Realtime API

GPT-4o によるネイティブ音声対音声

使用する時: 別個の STT/TTS なしで統合音声 AI が必要な場合

import asyncio
import websockets
import json
import base64

OPENAI_API_KEY = "sk-..."

async def voice_session():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(url, extra_headers=headers) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "alloy",  # alloy, echo, fable, onyx, nova, shimmer
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {
                    "model": "whisper-1"
                },
                "turn_detection": {
                    "type": "server_vad",  # Voice activity detection
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 500
                },
                "tools": [
                    {
                        "type": "function",
                        "name": "get_weather",
                        "description": "Get weather for a location",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "location": {"type": "string"}
                            }
                        }
                    }
                ]
            }
        }))

        # Send audio (PCM16, 24kHz, mono)
        async def send_audio(audio_bytes):
            await ws.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_bytes).decode()
            }))

        # Receive events
        async for message in ws:
            event = json.loads(message)

            if event["type"] == "response.audio.delta":
                # Play audio chunk
                audio = base64.b64decode(event["delta"])
                play_audio(audio)

            elif event["type"] == "response.audio_transcript.done":
                print(f"Assistant said: {event['transcript']}")

            elif event["type"] == "input_audio_buffer.speech_started":
                print("User started speaking")

            elif event["type"] == "response.function_call_arguments.done":
                # Handle tool call
                name = event["name"]
                args = json.loads(event["arguments"])
                result = call_function(name, args)
                await ws.send(json.dumps({
                    "type": "conversation.item.create",
                    "item": {
                        "type": "function_call_output",
                        "call_id": event["call_id"],
                        "output": json.dumps(result)
                    }
                }))

Vapi Voice Agent

Vapi プラットフォームで音声エージェントを構築

使用する時: 電話ベースのエージェント、迅速なデプロイ

Vapi はウェブフック付きホスト型音声エージェントを提供します

from flask import Flask, request, jsonify
import vapi

app = Flask(__name__)
client = vapi.Vapi(api_key="...")

# Create an assistant
assistant = client.assistants.create(
    name="Support Agent",
    model={
        "provider": "openai",
        "model": "gpt-4o",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful support agent..."
            }
        ]
    },
    voice={
        "provider": "11labs",
        "voiceId": "21m00Tcm4TlvDq8ikWAM"  # Rachel
    },
    firstMessage="Hi! How can I help you today?",
    transcriber={
        "provider": "deepgram",
        "model": "nova-2"
    }
)

# Webhook for conversation events
@app.route("/vapi/webhook", methods=["POST"])
def vapi_webhook():
    event = request.json

    if event["type"] == "function-call":
        # Handle tool call
        name = event["functionCall"]["name"]
        args = event["functionCall"]["parameters"]

        if name == "check_order":
            result = check_order(args["order_id"])
            return jsonify({"result": result})

    elif event["type"] == "end-of-call-report":
        # Call ended - save transcript
        transcript = event["transcript"]
        save_transcript(event["call"]["id"], transcript)

    return jsonify({"ok": True})

# Start outbound call
call = client.calls.create(
    assistant_id=assistant.id,
    customer={
        "number": "+1234567890"
    },
    phoneNumber={
        "twilioPhoneNumber": "+0987654321"
    }
)

# Or create web call
web_call = client.calls.create(
    assistant_id=assistant.id,
    type="web"
)
# Returns URL for WebRTC connection

Deepgram STT + ElevenLabs TTS

最高品質の文字起こしと音声合成

使用する時: 高品質音声、カスタムパイプライン

import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents
from elevenlabs import ElevenLabs

# Deepgram real-time transcription
deepgram = DeepgramClient(api_key="...")

async def transcribe_stream(audio_stream):
    connection = deepgram.listen.live.v("1")

    async def on_transcript(result):
        transcript = result.channel.alternatives[0].transcript
        if transcript:
            print(f"Heard: {transcript}")
            if result.is_final:
                # Process final transcript
                await handle_user_input(transcript)

    connection.on(LiveTranscriptionEvents.Transcript, on_transcript)

    await connection.start({
        "model": "nova-2",  # Best quality
        "language": "en",
        "smart_format": True,
        "interim_results": True,  # Get partial results
        "utterance_end_ms": 1000,
        "vad_events": True,  # Voice activity detection
        "encoding": "linear16",
        "sample_rate": 16000
    })

    # Stream audio
    async for chunk in audio_stream:
        await connection.send(chunk)

    await connection.finish()

# ElevenLabs streaming synthesis
eleven = ElevenLabs(api_key="...")

def text_to_speech_stream(text: str):
    """Stream TTS audio chunks."""
    audio_stream = eleven.text_to_speech.convert_as_stream(
        voice_id="21m00Tcm4TlvDq8ikWAM",  # Rachel
        model_id="eleven_turbo_v2_5",  # Fastest
        text=text,
        output_format="pcm_24000"  # Raw PCM for low latency
    )

    for chunk in audio_stream:
        yield chunk

# Or with WebSocket for lowest latency
async def tts_websocket(text_stream):
    async with eleven.text_to_speech.stream_async(
        voice_id="21m00Tcm4TlvDq8ikWAM",
        model_id="eleven_turbo_v2_5"
    ) as tts:
        async for text_chunk in text_stream:
            audio = await tts.send(text_chunk)
            yield audio

        # Flush remaining audio
        final_audio = await tts.flush()
        yield final_audio

LiveKit Real-time Infrastructure

音声アプリ向けの WebRTC インフラ

使用する時: カスタムリアルタイム音声アプリを構築する場合

from livekit import api, rtc
import asyncio

# Server-side: Create room and tokens
lk_api = api.LiveKitAPI(
    url="wss://your-livekit.livekit.cloud",
    api_key="...",
    api_secret="..."
)

async def create_room(room_name: str):
    room = await lk_api.room.create_room(
        api.CreateRoomRequest(name=room_name)
    )
    return room

def create_token(room_name: str, participant_name: str):
    token = api.AccessToken(
        api_key="...",
        api_secret="..."
    )
    token.with_identity(participant_name)
    token.with_grants(api.VideoGrants(
        room_join=True,
        room=room_name
    ))
    return token.to_jwt()

# Agent-side: Connect and process audio
async def voice_agent(room_name: str):
    room = rtc.Room()

    @room.on("track_subscribed")
    def on_track(track, publication, participant):
        if track.kind == rtc.TrackKind.KIND_AUDIO:
            # Process incoming audio
            audio_stream = rtc.AudioStream(track)
            asyncio.create_task(process_audio(audio_stream))

    token = create_token(room_name, "agent")
    await room.connect("wss://your-livekit.livekit.cloud", token)

    # Publish agent's audio
    source = rtc.AudioSource(sample_rate=24000, num_channels=1)
    track = rtc.LocalAudioTrack.create_audio_track("agent-voice", source)
    await room.local_participant.publish_track(track)

    # Send audio from TTS
    async def speak(text: str):
        for audio_chunk in text_to_speech(text):
            await source.capture_frame(rtc.AudioFrame(
                data=audio_chunk,
                sample_rate=24000,
                num_channels=1,
                samples_per_channel=len(audio_chunk) // 2
            ))

    return room, speak

# Process audio with STT
async def process_audio(audio_stream):
    async for frame in audio_stream:
        # Send to Deepgram or other STT
        await transcriber.send(frame.data)

Full Voice Agent Pipeline

すべてのコンポーネントを備えた完全な音声エージェント

使用する時: カスタムプロダクション音声エージェント

import asyncio
from dataclasses import dataclass
from typing import AsyncIterator

@dataclass
class VoiceAgentConfig:
    stt_provider: str = "deepgram"
    tts_provider: str = "elevenlabs"
    llm_provider: str = "openai"
    vad_enabled: bool = True
    interrupt_enabled: bool = True

class VoiceAgent:
    def __init__(self, config: VoiceAgentConfig):
        self.config = config
        self.is_speaking = False
        self.conversation_history = []

    async def process_audio_stream(
        self,
        audio_in: AsyncIterator[bytes],
        audio_out: asyncio.Queue
    ):
        """Main audio processing loop."""

        # STT streaming
        async def transcribe():
            transcript_buffer = ""
            async for audio_chunk in audio_in:
                # Check for interruption
                if self.is_speaking and self.config.interrupt_enabled:
                    if await self.detect_speech(audio_chunk):
                        await self.stop_speaking()

                result = await self.stt.transcribe(audio_chunk)
                if result.is_final:
                    yield result.transcript

        # Process transcripts
        async for user_text in transcribe():
            if not user_text.strip():
                continue

            self.conversation_history.append({
                "role": "user",
                "content": user_text
            })

            # Generate response with streaming
            self.is_speaking = True
            async for audio_chunk in self.generate_response(user_text):
                await audio_out.put(audio_chunk)
            self.is_speaking = False

    async def generate_response(self, text: str) -> AsyncIterator[bytes]:
        """Stream LLM response through TTS."""

        # Stream LLM tokens
        llm_stream = self.llm.stream_chat(self.conversation_history)

        # Buffer for TTS (need ~50 chars for good prosody)
        text_buffer = ""
        full_response = ""

        async for token in llm_stream:
            text_buffer += token
            full_response += token

            # Send to TTS when we have enough text
            if len(text_buffer) > 50 or token in ".!?":
                async for audio in self.tts.synthesize_stream(text_buffer):
                    yield audio
                text_buffer = ""

        # Flush remaining
        if text_buffer:
            async for audio in self.tts.synthesize_stream(text_buffer):
                yield audio

        self.conversation_history.append({
            "role": "assistant",
            "content": full_response
        })

    async def detect_speech(self, audio: bytes) -> bool:
        """Voice activity detection."""
        # Use WebRTC VAD or Silero VAD
        return self.vad.is_speech(audio)

    async def stop_speaking(self):
        """Handle interruption."""
        self.is_speaking = False
        # Clear audio queue
        # Stop TTS generation

# Latency optimization tips:
# 1. Use streaming everywhere (STT, LLM, TTS)
# 2. Start TTS before LLM finishes (~50 char buffer)
# 3. Use PCM audio format (no encoding overhead)
# 4. Keep WebSocket connections alive
# 5. Use regional endpoints close to users

Validation Checks

Non-Streaming TTS

重要度: HIGH

メッセージ: 非ストリーミング TTS は大きな遅延を追加します。

修正方法: tts.synthesize_stream() または tts.convert_as_stream() を使用してください

Hardcoded Sample Rate

重要度: MEDIUM

メッセージ: ハードコードされたサンプルレートはフォーマットの不一致を引き起こす可能性があります。

修正方法: サンプルレートを定数として定義し、予想フォーマットをドキュメント化してください

WebSocket Without Reconnection

重要度: HIGH

メッセージ: WebSocket 接続には再接続ロジックが必要です。

修正方法: 指数バックオフ付きの再試行ループを追加してください

Missing VAD Configuration

重要度: MEDIUM

メッセージ: VAD は良いユーザー体験のためにチューニングが必要です。

修正方法: threshold と silence_duration_ms を設定してください

Blocking Audio Processing

重要度: HIGH

メッセージ: 音声処理はブロッキングを避けるために非同期である必要があります。

修正方法: 音声操作に async def と await を使用してください

Missing Interruption Handling

重要度: MEDIUM

メッセージ: 音声エージェントはユーザー割り込みを処理する必要があります。

修正方法: バージ・イン検出を追加し、現在の応答をキャンセルしてください

Audio Queue Without Clear

重要度: LOW

メッセージ: 音声キューは割り込み時にクリア可能である必要があります。

修正方法: 割り込み時にキューをクリアするメソッドを追加してください

WebSocket Without Error Handling

重要度: HIGH

メッセージ: WebSocket 操作はエラー処理が必要です。

修正方法: ConnectionClosed に対して try/except でラップしてください

Collaboration

Delegation Triggers

agent graph|workflow|state -> langgraph (音声の背後に複雑なエージェントロジックが必要)
extract|structured|json -> structured-output (音声から構造化データを抽出する必要がある)
observability|tracing|monitoring -> langfuse (音声エージェントの品質を監視する必要がある)
frontend|web|react -> nextjs-app-router (音声エージェント用の Web インターフェースが必要)

Intelligent Voice Agent

スキル: voice-ai-development, langgraph, structured-output

ワークフロー:

1. ツール付きエージェントグラフを設計
2. 音声インターフェース層を追加
3. ツール応答に構造化出力を使用
4. 音声遅延に最適化

Monitored Voice Agent

スキル: voice-ai-development, langfuse

ワークフロー:

1. 選択したプロバイダで音声エージェントを構築
2. Langfuse コールバックを追加
3. 遅延、品質、会話フローを追跡
4. メトリクスに基づいて反復

Phone-based Agent

スキル: voice-ai-development, twilio

ワークフロー:

1. Vapi またはカスタムエージェントをセットアップ
2. PSTN の Twilio に接続
3. インバウンド/アウトバウンドコールを処理
4. コールルーティングロジックを実装

Related Skills

相性の良いスキル: langgraph、structured-output、langfuse

When to Use

ユーザーが言及または示唆する場合: voice ai
ユーザーが言及または示唆する場合: voice agent
ユーザーが言及または示唆する場合: speech to text
ユーザーが言及または示唆する場合: text to speech
ユーザーが言及または示唆する場合: realtime voice
ユーザーが言及または示唆する場合: vapi
ユーザーが言及または示唆する場合: deepgram
ユーザーが言及または示唆する場合: elevenlabs
ユーザーが言及または示唆する場合: livekit
ユーザーが言及または示唆する場合: openai realtime

Limitations

このスキルは上述のスコープに明確に合致するタスクの場合のみ使用してください。
出力を環境固有の検証、テスト、または専門家のレビューの代替として扱わないでください。
必須の入力、許可、安全性の境界、または成功基準が不足している場合は、立ち止まって明確化を求めてください。

ライセンス: MIT(寛容ライセンスのため全文を引用しています) · 原本リポジトリ

詳細情報

作者: sickn33
リポジトリ: sickn33/antigravity-awesome-skills
ライセンス: MIT
最終更新: 不明

GitHubで原本を見る →フィードバックを送る

Source: https://github.com/sickn33/antigravity-awesome-skills / ライセンス: MIT

SKILL.md 本文

Voice AI Development

Expertise

Capabilities

Prerequisites

Scope

Ecosystem

Primary

Infrastructure

Common_integrations

Platforms

Patterns

OpenAI Realtime API

Vapi Voice Agent

Deepgram STT + ElevenLabs TTS

LiveKit Real-time Infrastructure

Full Voice Agent Pipeline

Validation Checks

Non-Streaming TTS

Hardcoded Sample Rate

WebSocket Without Reconnection

Missing VAD Configuration

Blocking Audio Processing

Missing Interruption Handling

Audio Queue Without Clear

WebSocket Without Error Handling

Collaboration

Delegation Triggers

Intelligent Voice Agent

Monitored Voice Agent

Phone-based Agent

Related Skills

When to Use

Limitations

詳細情報

関連スキル

agent-browser

anyskill

engram

skyvern

pinchbench

openui