ルーティングに関する注釈: ユーザーの意図が曖昧な場合は、references/intent-clarification.md の共有された確認テンプレートを使用してください。

AI ゲートウェイ

TrueFoundry の AI ゲートウェイを使用して、1000 以上の LLM に、統一された OpenAI 互換 API を通じてアクセスできます。レート制限、予算管理、ロードバランシング、ルーティング、および可観測性を備えています。

使用する場合

TrueFoundry の統一された OpenAI 互換ゲートウェイを通じて LLM にアクセスしたい、認証トークン（PAT/VAT）を設定したい、レート制限や予算管理、またはプロバイダー間のロードバランシングを設定したい場合。

使用しない場合

ユーザーが自社ホスト型モデルをデプロイしたい → 自社ホスト型モデルのデプロイには、接続されたクラスタを備えた TrueFoundry Enterprise アカウントが必要です。https://truefoundry.com を参照してください
ユーザーがツールサーバーをデプロイしたい → ワークロードのデプロイには、接続されたクラスタを備えた TrueFoundry Enterprise アカウントが必要です。https://truefoundry.com を参照してください
ユーザーが TrueFoundry プラットフォームの認証情報を管理したい → status スキルを優先し、ユーザーが他の有効なパスを望んでいるかどうか確認してください

</objective> <context>

概要

AI ゲートウェイはアプリケーションと LLM プロバイダーの間に位置します:

Your App → AI Gateway → OpenAI / Anthropic / Azure / Self-hosted vLLM / etc.
                ↑
         Unified API + Auth + Rate Limiting + Routing + Logging

主な利点:

単一エンドポイント: すべてのモデル（クラウド + 自社ホスト型）
1 つの API キー: プロバイダーごとのキーを管理する代わりに、PAT または VAT を使用
OpenAI 互換: OpenAI SDK クライアントと連携
レート制限: ユーザー、チーム、またはアプリケーション単位
予算管理: コスト制限を実施
ロードバランシング: モデルインスタンス間のバランシングとフォールバック
可観測性: リクエストログ、コスト追跡、分析

ゲートウェイエンドポイント

ゲートウェイのベース URL は TrueFoundry プラットフォーム URL + /api/llm です:

{TFY_BASE_URL}/api/llm

例: https://your-org.truefoundry.cloud/api/llm

認証

パーソナルアクセストークン（PAT）

開発と個人使用向け:

TrueFoundry ダッシュボード → Access → Personal Access Tokens に移動
New Personal Access Token をクリック
トークンをコピー

バーチャルアクセストークン（VAT）

本番アプリケーション向け（推奨）:

TrueFoundry ダッシュボード → Access → Virtual Account Tokens に移動
New Virtual Account をクリック（管理者権限が必要）
名前を付け、アクセス可能なモデルを選択
トークンをコピー

本番環境では VAT を推奨 する理由:

特定ユーザーに紐付けられない（チーム変更時に有効）
細粒度のモデルアクセス制御をサポート
アプリケーションごとの使用追跡に有効

</context> <instructions>

モデルの呼び出し

Python（OpenAI SDK）

from openai import OpenAI

client = OpenAI(
    api_key="<your-PAT-or-VAT>",
    base_url="https://<your-truefoundry-url>/api/llm",
)

# Chat completion
response = client.chat.completions.create(
    model="openai/gpt-4o",  # or any configured model name
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"},
    ],
    max_tokens=200,
)
print(response.choices[0].message.content)

Python（ストリーミング）

stream = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about AI"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

cURL

curl "${TFY_BASE_URL}/api/llm/chat/completions" \
  -H "Authorization: Bearer ${TFY_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 200
  }'

JavaScript / Node.js

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "<your-PAT-or-VAT>",
  baseURL: "https://<your-truefoundry-url>/api/llm",
});

const response = await client.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [{ role: "user", content: "Hello!" }],
});

環境変数

任意の OpenAI 互換ライブラリで使用するよう設定します:

export OPENAI_BASE_URL="${TFY_BASE_URL}/api/llm"
export OPENAI_API_KEY="<your-PAT-or-VAT>"

その後、明示的なパラメータなしで openai.OpenAI() を使用するコードは、ゲートウェイを自動的に使用します。

サポートされている API

API	エンドポイント	説明
Chat Completions	`/chat/completions`	任意のモデルとチャット（ストリーミング + 非ストリーミング）
Completions	`/completions`	レガシーテキスト補完
Embeddings	`/embeddings`	テキストエンベディング（テキスト + リスト入力）
Image Generation	`/images/generations`	画像生成
Image Editing	`/images/edits`	画像編集
Audio Transcription	`/audio/transcriptions`	音声テキスト化
Audio Translation	`/audio/translations`	音声翻訳
Text-to-Speech	`/audio/speech`	音声生成
Reranking	`/rerank`	ドキュメントの再ランク付け
Batch Processing	`/batches`	バッチ予測
Moderations	`/moderations`	コンテンツセーフティ

サポートされているプロバイダー

ゲートウェイは 25 以上のプロバイダーをサポートしています:

プロバイダー	モデル名の例
OpenAI	`openai/gpt-4o`, `openai/gpt-4o-mini`
Anthropic	`anthropic/claude-sonnet-4-5-20250929`
Google Vertex	`google/gemini-2.0-flash`
AWS Bedrock	`bedrock/anthropic.claude-3-5-sonnet`
Azure OpenAI	`azure/gpt-4o`
Mistral	`mistral/mistral-large-latest`
Groq	`groq/llama-3.1-70b-versatile`
Cohere	`cohere/command-r-plus`
Together AI	`together/meta-llama/Meta-Llama-3.1-70B`
自社ホスト型（vLLM/TGI）	`my-custom-model-name`

モデル名はゲートウェイでの設定方法に依存します。 正確な名前については、TrueFoundry ダッシュボード → AI Gateway → Models を確認してください。

モデルとプロバイダーの追加

現在、TrueFoundry ダッシュボード UI を通じて実行されます:

AI Gateway → Models に移動
Add Provider Account をクリック
プロバイダーを選択（OpenAI、Anthropic など）
API 認証情報を入力
有効にするモデルを選択

自社ホスト型モデルの追加（クラスタ内）

自社ホスト型モデルをデプロイした後:

AI Gateway → Models → Add Provider Account に移動
プロバイダータイプとして 「Self Hosted」 を選択
内部エンドポイントを入力: http://{model-name}.{namespace}.svc.cluster.local:8000
モデルはクラウドモデルと並べてゲートウェイを通じてアクセス可能になります

セキュリティ: 制御できるモデルエンドポイントのみを登録してください。外部または信頼できないモデルエンドポイントは操作された応答を返す可能性があります。自社ホスト型モデルには内部クラスタ DNS（svc.cluster.local）を使用してください。プロバイダー API 認証情報がハードコーディングされるのではなく、TrueFoundry シークレットに安全に保存されていることを確認してください。

外部 OpenAI 互換 API の追加（NVIDIA、カスタムプロバイダー）

外部ホスト型で OpenAI 互換である API（NVIDIA Cloud API、カスタム推論エンドポイントなど）については、type: provider-account/self-hosted-model を auth_data で使用します:

# gateway.yaml — 外部ホスト API（例：NVIDIA Cloud）
- name: nvidia-external
  type: provider-account/self-hosted-model
  integrations:
    - name: nemotron-nano
      type: integration/model/self-hosted-model
      hosted_model_name: nvidia/nemotron-3-nano-30b-a3b
      url: "https://integrate.api.nvidia.com/v1"
      model_server: "openai-compatible"
      model_types: ["chat"]
      auth_data:
        type: bearer-auth
        bearer_token: "tfy-secret://<tenant>:<group>:<key>"

そして仮想モデルルーティングターゲットで、"<provider-account-name>/<integration-name>" として参照します:

targets:
  - model: "nvidia-external/nemotron-nano"  # "<provider-account-name>/<integration-name>"

以下で適用します:

tfy apply -f gateway.yaml

警告: provider-account/nvidia-nim はスキーマに存在しません — これを使用しないでください。すべての外部 OpenAI 互換 API には、上記のように auth_data を備えた provider-account/self-hosted-model を使用してください。

スキーマの信頼できる情報源: 正式なフィールド名とタイプについては、プラットフォームリポジトリの servicefoundry-server/src/autogen/models.ts を確認してください。ドキュメントのみからフィールド名を推測しないでください。

ゲートウェイ設定の適用

ゲートウェイ YAML は tfy apply で直接適用されます — サービスビルドまたは Docker イメージは不要です:

# 変更をプレビュー
tfy apply -f gateway.yaml --dry-run --show-diff

# 適用
tfy apply -f gateway.yaml

ゲートウェイの適用をデプロイメントスキルに委譲しないでください。 ゲートウェイ設定（type: gateway-*、type: provider-account/*）は tfy apply でインラインで適用されます。

適用後にテストしてください:

# cURL を使用した簡単なスモークテスト
curl "${TFY_BASE_URL}/api/llm/chat/completions" \
  -H "Authorization: Bearer ${TFY_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia-external/nemotron-nano",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

または Python で:

from openai import OpenAI
client = OpenAI(api_key="<PAT-or-VAT>", base_url=f"{TFY_BASE_URL}/api/llm")
resp = client.chat.completions.create(
    model="nvidia-external/nemotron-nano",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

注: 単発のゲートウェイ設定適用には tfy apply を直接使用してください。CI/CD パイプラインでは、tfy apply を既存の自動化に統合してください。

仮想モデルとロードバランシング

仮想モデルは、gateway-load-balancing-config マニフェストを使用して、複数のモデルインスタンス間でリクエストをルーティングします。ターゲットは実カタログモデルを "<provider-account-name>/<integration-name>" として参照します。

重み付きルーティング

name: chat-routing
type: gateway-load-balancing-config
rules:
  - id: weighted-chat
    type: weight-based-routing
    when:
      subjects: ["*"]
      models: ["openai/gpt-4o"]
    load_balance_targets:
      - target: "openai-main/gpt-4o"
        weight: 70
        fallback_candidate: true
        retry_config:
          delay: 100
          attempts: 1
          on_status_codes: ["429", "500", "502", "503"]
      - target: "azure-backup/gpt-4o"
        weight: 30
        fallback_candidate: true
        retry_config:
          delay: 100
          attempts: 1
          on_status_codes: ["429", "500", "502", "503"]

レイテンシベースのルーティング

最も低レイテンシのモデルに自動的にルーティングします（過去 20 分間の出力トークンあたりの時間を測定）:

rules:
  - id: latency-chat
    type: latency-based-routing
    when:
      subjects: ["*"]
      models: ["openai/gpt-4o"]
    load_balance_targets:
      - target: "openai-main/gpt-4o"
        fallback_candidate: true
      - target: "azure-backup/gpt-4o"
        fallback_candidate: true

優先度ベースのルーティング

SLA カットオフを備えた最も高い優先度の健全なモデルにルーティングします（TPOT がしきい値を超えると自動的に不健全とマークされます）:

rules:
  - id: priority-chat
    type: priority-based-routing
    when:
      subjects: ["team:premium"]
      models: ["*"]
    load_balance_targets:
      - target: "openai-main/gpt-4o"
        priority: 0
        sla_cutoff:
          time_per_output_token_ms: 50
        fallback_candidate: true
      - target: "azure-backup/gpt-4o"
        priority: 1
        fallback_candidate: true

スティッキーセッション

ユーザーを期間内の同じターゲットに固定します:

rules:
  - id: sticky-chat
    type: weight-based-routing
    sticky_routing:
      ttl_seconds: 3600
      session_identifiers:
        - key: x-user-id
          source: headers
    load_balance_targets:
      - target: "openai-main/gpt-4o"
        weight: 50
      - target: "azure-backup/gpt-4o"
        weight: 50

ターゲットごとのヘッダーオーバーライド

load_balance_targets:
  - target: "openai-main/gpt-4o"
    weight: 80
    headers_override:
      set:
        x-region: us-east-1
      remove:
        - x-internal-debug

フォールバック動作

フォールバックはターゲットごと、load_balance_targets 内で設定されます:

fallback_status_codes: デフォルトは ["401", "403", "404", "429", "500", "502", "503"]
fallback_candidate: true はターゲットをフェイルオーバーの対象としてマークします
retry_config.on_status_codes は再試行をトリガーするエラーを制御します

適用

tfy apply -f gateway-load-balancing-config.yaml --dry-run --show-diff
tfy apply -f gateway-load-balancing-config.yaml

注: ターゲットは実カタログモデルである必要があり、ネストされた仮想モデルではありません。

レート制限

gateway-rate-limiting-config マニフェストを使用して、ユーザー、チーム、モデル、またはカスタムメタデータごとにレート制限を設定します。最初にマッチしたルールのみが適用されます — 特定のルールを汎用ルールの前に配置してください。

name: rate-limits
type: gateway-rate-limiting-config
rules:
  - id: "team-rpm-limit"
    when:
      subjects: ["team:backend"]
      models: ["openai-main/gpt-4o"]
    limit_to: 20000
    unit: tokens_per_minute

  - id: "user-daily-limit"
    when:
      subjects: ["user:bob@example.com"]
      models: ["openai-main/gpt-4o"]
    limit_to: 1000
    unit: requests_per_day

  - id: "per-project-hourly"
    when: {}
    limit_to: 50000
    unit: tokens_per_hour
    rate_limit_applies_per: ["metadata.project_id"]

  - id: "global-fallback"
    when: {}
    limit_to: 500
    unit: requests_per_minute
    rate_limit_applies_per: ["user"]

ユニット: requests_per_minute、requests_per_hour、requests_per_day、tokens_per_minute、tokens_per_hour、tokens_per_day

rate_limit_applies_per: エンティティごとに個別の制限を作成します（最大 2 値）。オプション: user、model、virtualaccount、metadata.<key>。

tfy apply -f gateway-rate-limiting-config.yaml

予算管理

gateway-budget-config マニフェストを使用して、ユーザー、チーム、またはメタデータごとのコスト制限を実施します。コストはモデルの価格設定に基づいて自動的に追跡されます。

name: budget-controls
type: gateway-budget-config
rules:
  - id: "team-monthly-budget"
    when:
      subjects: ["team:engineering"]
    limit_to: 5000
    unit: cost_per_month
    budget_applies_per: ["team"]
    alerts:
      thresholds: [75, 90, 100]
      notification_target:
        - type: email
          notification_channel: "budget-alerts"
          to_emails: ["lead@example.com"]

  - id: "user-daily-budget"
    when: {}
    limit_to: 100
    unit: cost_per_day
    budget_applies_per: ["user"]

  - id: "project-daily-budget"
    when:
      metadata:
        environment: "production"
    limit_to: 200
    unit: cost_per_day
    budget_applies_per: ["metadata.project_id"]

ユニット: cost_per_day（UTC 真夜中にリセット）、cost_per_week（月曜日にリセット）、cost_per_month（1 日にリセット）

budget_applies_per: レート制限と同じオプション — user、model、team、virtualaccount、metadata.<key>。

アラート: メール、Slack ウェブフック、または Slack ボット通知を使用したしきい値パーセンテージを設定します。

tfy apply -f gateway-budget-config.yaml

可観測性

リクエストログ

すべてのゲートウェイリクエストは以下で記録されます:

入出力トークン
レイテンシ（TTFT、合計）
コスト
モデルとプロバイダー
ユーザーアイデンティティ
カスタムメタデータ

カスタムメタデータ

リクエストにカスタムメタデータを使用してタグ付けして追跡します:

response = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={
        "X-TFY-LOGGING-CONFIG": '{"project": "my-app", "environment": "production"}'
    },
)

分析

TrueFoundry ダッシュボードで使用分析を表示します:

モデルごとのリクエスト/分
モデルごとのトークン/分
モデルごとの障害/分
モデル、ユーザー、チーム別のコスト内訳

OpenTelemetry 統合

トレースを可観測性スタックにエクスポートします:

Prometheus + Grafana
Datadog
カスタム OTEL コレクター

ガードレール

コンテンツフィルタリング、PII 検出、プロンプトインジェクション防止、およびカスタムセーフティルールについては、guardrails スキルを使用してください。これはこのゲートウェイのトラフィックに適用されるガードレールプロバイダーとルールを設定します。

MCP ゲートウェイアタッチメントフロー

ユーザーがツールサーバーを既にデプロイしており、MCP ゲートウェイにアタッチしたい場合:

TrueFoundry ダッシュボード経由でデプロイメント状態とエンドポイント URL を確認
エンドポイントを MCP サーバーとして登録（mcp-servers スキル）
登録 ID/名前を確認し、ポリシーで参照する方法を共有

フレームワーク統合

ゲートウェイは人気のある AI フレームワークと連携します:

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="openai/gpt-4o",
    api_key="<your-PAT-or-VAT>",
    base_url="https://<your-truefoundry-url>/api/llm",
)

LlamaIndex

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="openai/gpt-4o",
    api_key="<your-PAT-or-VAT>",
    api_base="https://<your-truefoundry-url>/api/llm",
)

Cursor / Claude Code / Cline

ゲートウェイをコーディングアシスタント設定のカスタム API エンドポイントとして設定します:

ベース URL: {TFY_BASE_URL}/api/llm
API キー: 自分の PAT または VAT

ゲートウェイ情報の表示

ユーザーがゲートウェイ設定について質問するとき:

AI ゲートウェイ:
  エンドポイント: https://your-org.truefoundry.cloud/api/llm
  認証:     パーソナルアクセストークン（PAT）またはバーチャルアクセストークン（VAT）

利用可能なモデル（現在のリストについてはダッシュボードを確認）:
| モデル名        | プロバイダー     | タイプ        |
|-------------------|-------------|-------------|
| openai/gpt-4o     | OpenAI      | クラウド       |
| my-gemma-2b       | 自社ホスト型 | vLLM (T4)   |
| anthropic/claude   | Anthropic   | クラウド       |

使用方法:
  export OPENAI_BASE_URL="https://your-org.truefoundry.cloud/api/llm"
  export OPENAI_API_KEY="your-token"
  # その後、任意の OpenAI 互換 SDK を使用

</instructions>

<success_criteria>

成功基準

ユーザーは OpenAI 互換 SDK または cURL を使用してゲートウェイエンドポイント経由で LLM を呼び出すことができる
ユーザーはゲートウェイアクセス用の有効な認証トークン（PAT または VAT）を設定している
エージェントはターゲットモデル名がユーザーのゲートウェイ設定で利用可能なことを確認している
ユーザーはゲートウェイからの正常な応答と正しいモデル出力を確認できる
エージェントはユーザーの言語とフレームワークに適したワーキングコードスニペットを提供している
ユーザーがリクエストした場合、レート制限、予算管理、またはルーティングが設定されている

</success_criteria>

構成可能性

最初にモデルをデプロイ: 自社ホスト型モデルをデプロイ（TrueFoundry Enterprise が必要）してからゲートウェイに追加
API キーが必要: TrueFoundry ダッシュボード → Access で PAT/VAT を作成
レート制限: ダッシュボード → AI Gateway → Rate Limiting で設定
ルーティング設定: tfy apply でルーティング YAML を直接適用；CI/CD パイプラインでは、tfy apply を既存の自動化に統合
ツールサーバー: ツールサーバーをインフラストラクチャにデプロイしてから、ゲートウェイに登録
デプロイされたモデルを確認:

ライセンス: MIT(寛容ライセンスのため全文を引用しています) · 原本リポジトリ

SKILL.md 本文