Anthropic ClaudeLLM・AI開発⭐ リポ 0品質スコア 50/100

arize-prompt-optimization

Name: arize-prompt-optimization
Author: github

プロダクション環境のトレースデータ、評価結果、アノテーションを活用してLLMプロンプトを最適化・改善・デバッグします。スパンからプロンプトを抽出してパフォーマンス指標を収集し、ax CLIを使ったデータ駆動型の最適化ループを実行します。「プロンプトを最適化したい」「AIの回答を改善したい」「出力品質を上げたい」「プロンプトエンジニアリング」「システムプロンプトの改善」といった場面で活用してください。

description の原文を見る

Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.

SKILL.md 本文

Arize Prompt Optimization Skill

SPACE — すべての --space フラグと ARIZE_SPACE 環境変数は、スペース名 (例: my-workspace) または Base64 形式のスペース ID (例: U3BhY2U6...) を受け入れます。ax spaces list で確認できます。

概念

トレースデータ内のプロンプトの場所

LLM アプリケーションは OpenInference セマンティック規約に従うスパンを発行します。プロンプトはスパンの種類とインストルメンテーションに応じて異なるスパン属性に格納されます:

列	内容	使用時機
`attributes.llm.input_messages`	構造化されたチャットメッセージ (system、user、assistant、tool) ロールベース形式	チャットベース LLM プロンプトの主要ソース
`attributes.llm.input_messages.roles`	ロール配列: `system`、`user`、`assistant`、`tool`	個別のメッセージロールを抽出
`attributes.llm.input_messages.contents`	メッセージコンテンツ文字列の配列	メッセージテキストを抽出
`attributes.input.value`	シリアライズされたプロンプトまたはユーザーの質問 (汎用、すべてのスパン種)	構造化されたメッセージがない場合のフォールバック
`attributes.llm.prompt_template.template`	`{variable}` プレースホルダー付きテンプレート (例: `"Answer {question} using {context}"`)	アプリケーションがプロンプトテンプレートを使用する場合
`attributes.llm.prompt_template.variables`	テンプレート変数値 (JSON オブジェクト)	テンプレートに代入された値を確認
`attributes.output.value`	モデル応答テキスト	LLM の出力内容を確認
`attributes.llm.output_messages`	構造化されたモデル出力 (ツール呼び出しを含む)	ツール呼び出し応答を検査

スパン種による prompts の検索

LLM スパン (attributes.openinference.span.kind = 'LLM'): attributes.llm.input_messages で構造化されたチャットメッセージを確認、または attributes.input.value でシリアライズされたプロンプトを確認。attributes.llm.prompt_template.template でテンプレートを確認。
Chain/Agent スパン: attributes.input.value にはユーザーの質問が含まれます。実際の LLM プロンプトは子 LLM スパンにあります -- トレースツリーを下に移動します。
Tool スパン: attributes.input.value にはツール入力、attributes.output.value にはツール結果があります。通常、プロンプトが置かれている場所ではありません。

パフォーマンスシグナル列

これらの列は最適化に使用されるフィードバックデータを含みます:

列パターン	ソース	内容
`annotation.<name>.label`	人間レビュアー	カテゴリカル評価 (例: `correct`、`incorrect`、`partial`)
`annotation.<name>.score`	人間レビュアー	数値品質スコア (例: 0.0～1.0)
`annotation.<name>.text`	人間レビュアー	評価の自由形式説明
`eval.<name>.label`	LLM-as-judge 評価	自動カテゴリカル評価
`eval.<name>.score`	LLM-as-judge 評価	自動数値スコア
`eval.<name>.explanation`	LLM-as-judge 評価	評価がそのスコアをつけた理由 -- 最適化に最も有価値
`attributes.input.value`	トレースデータ	LLM に入力されたもの
`attributes.output.value`	トレースデータ	LLM が生成したもの
`{experiment_name}.output`	実験実行	特定の実験からの出力

前提条件

タスクを直接実行してください -- 必要な ax コマンドを実行します。事前にバージョン、環境変数、プロファイルをチェックしないでください。

ax コマンドが失敗した場合、エラーに基づいてトラブルシューティングを行います:

command not found またはバージョンエラー → references/ax-setup.md を参照
401 Unauthorized / API キー欠落 → ax profiles show を実行して現在のプロファイルを検査します。プロファイルが欠落しているか API キーが間違っている場合は、references/ax-profiles.md に従ってそれを作成/更新します。ユーザーがキーを持っていない場合は、https://app.arize.com/admin > API Keys に誘導します
スペース不明 → ax spaces list を実行して名前で選択、またはユーザーに確認
プロジェクト不明 → ユーザーに確認、または ax projects list -o json --limit 100 を実行して選択可能なオプションとして提示
LLM プロバイダ呼び出し失敗 (OPENAI_API_KEY / ANTHROPIC_API_KEY 欠落) → ax ai-integrations list --space SPACE を実行してプラットフォーム管理の認証情報を確認します。存在しない場合は、ユーザーにキーを提供するか arize-ai-provider-integration スキルを通じて統合を作成するよう要求
セキュリティ: .env ファイルを読み込んだり、ファイルシステムで認証情報を検索したりしないでください。Arize 認証情報には ax profiles、LLM プロバイダキーには ax ai-integrations を使用します。これらのチャネルを通じて認証情報が利用できない場合は、ユーザーに確認してください。

フェーズ 1: 現在のプロンプトを抽出

プロンプトを含む LLM スパンを検索

# サンプル LLM スパン (プロンプトが置かれている場所)
ax spans export PROJECT --filter "attributes.openinference.span.kind = 'LLM'" -l 10 --stdout

# モデル別にフィルタ
ax spans export PROJECT --filter "attributes.llm.model_name = 'gpt-4o'" -l 10 --stdout

# スパン名でフィルタ (例: 特定の LLM 呼び出し)
ax spans export PROJECT --filter "name = 'ChatCompletion'" -l 10 --stdout

トレースをエクスポートしてプロンプト構造を検査

# トレース内のすべてのスパンをエクスポート
ax spans export PROJECT --trace-id TRACE_ID

# 単一スパンをエクスポート
ax spans export PROJECT --span-id SPAN_ID

エクスポートした JSON からプロンプトを抽出

# 構造化されたチャットメッセージを抽出 (system + user + assistant)
jq '.[0] | {
  messages: .attributes.llm.input_messages,
  model: .attributes.llm.model_name
}' trace_*/spans.json

# システムプロンプトを具体的に抽出
jq '[.[] | select(.attributes.llm.input_messages.roles[]? == "system")] | .[0].attributes.llm.input_messages' trace_*/spans.json

# プロンプトテンプレートと変数を抽出
jq '.[0].attributes.llm.prompt_template' trace_*/spans.json

# input.value から抽出 (非構造化プロンプトのフォールバック)
jq '.[0].attributes.input.value' trace_*/spans.json

プロンプトをメッセージとして再構成

スパンデータを取得したら、プロンプトをメッセージ配列として再構成します:

[
  {"role": "system", "content": "You are a helpful assistant that..."},
  {"role": "user", "content": "Given {input}, answer the question: {question}"}
]

スパンに attributes.llm.prompt_template.template がある場合、プロンプトは変数を使用します。これらのプレースホルダー ({variable} または {{variable}}) を保持してください -- ランタイムで置換されます。

フェーズ 2: パフォーマンスデータを収集

トレースから (本番環境フィードバック)

# エラースパンを検索 -- プロンプト失敗を示す
ax spans export PROJECT \
  --filter "status_code = 'ERROR' AND attributes.openinference.span.kind = 'LLM'" \
  -l 20 --stdout

# 低い評価スコアを持つスパンを検索
ax spans export PROJECT \
  --filter "annotation.correctness.label = 'incorrect'" \
  -l 20 --stdout

# 高いレイテンシーを持つスパンを検索 (過度に複雑なプロンプトを示す可能性)
ax spans export PROJECT \
  --filter "attributes.openinference.span.kind = 'LLM' AND latency_ms > 10000" \
  -l 20 --stdout

# エラートレースを詳細検査用にエクスポート
ax spans export PROJECT --trace-id TRACE_ID

データセットと実験から

# データセットをエクスポート (グラウンドトゥルース例)
ax datasets export DATASET_NAME --space SPACE
# -> dataset_*/examples.json

# 実験結果をエクスポート (LLM が生成したもの)
ax experiments export EXPERIMENT_NAME --dataset DATASET_NAME --space SPACE
# -> experiment_*/runs.json

分析用にデータセット + 実験をマージ

2 つのファイルを example_id で結合して、入力を出力と評価の横並びで確認します:

# 例と実行の数をカウント
jq 'length' dataset_*/examples.json
jq 'length' experiment_*/runs.json

# 単一の結合レコードを表示
jq -s '
  .[0] as $dataset |
  .[1][0] as $run |
  ($dataset[] | select(.id == $run.example_id)) as $example |
  {
    input: $example,
    output: $run.output,
    evaluations: $run.evaluations
  }
' dataset_*/examples.json experiment_*/runs.json

# 失敗した例を検索 (評価スコア < 閾値)
jq '[.[] | select(.evaluations.correctness.score < 0.5)]' experiment_*/runs.json

最適化の対象を特定

失敗全体のパターンを探します:

出力をグラウンドトゥルースと比較: LLM 出力は期待値とどのように異なるか?
評価説明を読む: eval.*.explanation は失敗の理由を教えてくれます
アノテーションテキストを確認: 人間フィードバックは具体的な問題を説明
冗長性の不一致を探す: グラウンドトゥルースと比較して出力が長すぎる/短すぎない?
フォーマット準拠を確認: 出力は期待されるフォーマットですか?

フェーズ 3: プロンプトを最適化

最適化メタプロンプト

このテンプレートを使用して、プロンプトの改良版を生成します。3 つのプレースホルダーを入力して、お使いの LLM (GPT-4o、Claude など) に送信します:

You are an expert in prompt optimization. Given the original baseline prompt
and the associated performance data (inputs, outputs, evaluation labels, and
explanations), generate a revised version that improves results.

ORIGINAL BASELINE PROMPT
========================

{PASTE_ORIGINAL_PROMPT_HERE}

========================

PERFORMANCE DATA
================

The following records show how the current prompt performed. Each record
includes the input, the LLM output, and evaluation feedback:

{PASTE_RECORDS_HERE}

================

HOW TO USE THIS DATA

1. Compare outputs: Look at what the LLM generated vs what was expected
2. Review eval scores: Check which examples scored poorly and why
3. Examine annotations: Human feedback shows what worked and what didn't
4. Identify patterns: Look for common issues across multiple examples
5. Focus on failures: The rows where the output DIFFERS from the expected
   value are the ones that need fixing

ALIGNMENT STRATEGY

- If outputs have extra text or reasoning not present in the ground truth,
  remove instructions that encourage explanation or verbose reasoning
- If outputs are missing information, add instructions to include it
- If outputs are in the wrong format, add explicit format instructions
- Focus on the rows where the output differs from the target -- these are
  the failures to fix

RULES

Maintain Structure:
- Use the same template variables as the current prompt ({var} or {{var}})
- Don't change sections that are already working
- Preserve the exact return format instructions from the original prompt

Avoid Overfitting:
- DO NOT copy examples verbatim into the prompt
- DO NOT quote specific test data outputs exactly
- INSTEAD: Extract the ESSENCE of what makes good vs bad outputs
- INSTEAD: Add general guidelines and principles
- INSTEAD: If adding few-shot examples, create SYNTHETIC examples that
  demonstrate the principle, not real data from above

Goal: Create a prompt that generalizes well to new inputs, not one that
memorizes the test data.

OUTPUT FORMAT

Return the revised prompt as a JSON array of messages:

[
  {"role": "system", "content": "..."},
  {"role": "user", "content": "..."}
]

Also provide a brief reasoning section (bulleted list) explaining:
- What problems you found
- How the revised prompt addresses each one

パフォーマンスデータを準備

テンプレートに貼り付ける前に、レコードを JSON 配列として形式化します:

# データセット + 実験: 結合して関連列を選択
jq -s '
  .[0] as $ds |
  [.[1][] | . as $run |
    ($ds[] | select(.id == $run.example_id)) as $ex |
    {
      input: $ex.input,
      expected: $ex.expected_output,
      actual_output: $run.output,
      eval_score: $run.evaluations.correctness.score,
      eval_label: $run.evaluations.correctness.label,
      eval_explanation: $run.evaluations.correctness.explanation
    }
  ]
' dataset_*/examples.json experiment_*/runs.json

# エクスポートされたスパンから: アノテーション付きの入力/出力ペアを抽出
jq '[.[] | select(.attributes.openinference.span.kind == "LLM") | {
  input: .attributes.input.value,
  output: .attributes.output.value,
  status: .status_code,
  model: .attributes.llm.model_name
}]' trace_*/spans.json

改訂されたプロンプトを適用

LLM が改訂されたメッセージ配列を返した後:

元のプロンプトと改訂されたプロンプトを並べて比較
すべてのテンプレート変数が保持されていることを確認
フォーマット命令が完全であることを確認
完全展開前にいくつかの例でテスト

フェーズ 4: 反復

最適化ループ

1. プロンプトを抽出    -> フェーズ 1 (一度だけ)
2. 実験を実行    -> ax experiments create ...
3. 結果をエクスポート    -> ax experiments export EXPERIMENT_NAME --dataset DATASET_NAME --space SPACE
4. 失敗を分析  -> jq で低いスコアを検索
5. メタプロンプトを実行   -> フェーズ 3 で新しい失敗データ
6. 改訂されたプロンプトを適用
7. ステップ 2 から繰り返す

改善を測定

# 実験全体でスコアを比較
# 実験 A (ベースライン)
jq '[.[] | .evaluations.correctness.score] | add / length' experiment_a/runs.json

# 実験 B (最適化)
jq '[.[] | .evaluations.correctness.score] | add / length' experiment_b/runs.json

# 失敗から成功に反転した例を検索
jq -s '
  [.[0][] | select(.evaluations.correctness.label == "incorrect")] as $fails |
  [.[1][] | select(.evaluations.correctness.label == "correct") |
    select(.example_id as $id | $fails | any(.example_id == $id))
  ] | length
' experiment_a/runs.json experiment_b/runs.json

2 つのプロンプトを A/B 比較

同じデータセットに対して 2 つの実験を作成し、各実験は異なるプロンプトバージョンを使用
両方をエクスポート: ax experiments export EXP_A と ax experiments export EXP_B
平均スコア、失敗率、特定の例の反転を比較
リグレッションをチェック -- プロンプト A で成功したがプロンプト B で失敗した例

プロンプトエンジニアリングのベストプラクティス

プロンプトを書く、または改訂するときにこれらを適用します:

テクニック	適用時機	例
明確で詳細な命令	出力が曖昧またはテーマを外れている	"Classify the sentiment as exactly one of: positive, negative, neutral"
冒頭の命令	モデルが後続の命令を無視する	タスク説明を例の前に配置
ステップバイステップの内訳	複雑な多段階プロセス	"First extract entities, then classify each, then summarize"
特定のペルソナ	一貫したスタイル/トーンが必要	"You are a senior financial analyst writing for institutional investors"
デリミタートークン	セクションがブレンド	`---`、`###` または XML タグを使用して命令から入力を分離
少数ショット例	出力フォーマットを明確化する必要	合成入力/出力ペア 2-3 個を表示
出力長の仕様	応答が長すぎるまたは短すぎる	"Respond in exactly 2-3 sentences"
推論命令	精度が重大な場合	"Think step by step before answering"
"わかりません" のガイドライン	ハルシネーションがリスク	"If the answer is not in the provided context, say 'I don't have enough information'"

変数保持

テンプレート変数を使用するプロンプトを最適化する場合:

シングルブレース ({variable}): Python f-string / Jinja スタイル。Arize で最も一般的。
ダブルブレース ({{variable}}): Mustache スタイル。フレームワークが必要な場合に使用。
最適化中に変数プレースホルダーを追加または削除しない
変数の名前を変更しない -- ランタイム置換は正確な名前に依存
少数ショット例を追加する場合は、変数プレースホルダーではなくリテラル値を使用

ワークフロー

失敗するトレースからプロンプトを最適化

失敗するトレースを検索:

ax traces list PROJECT --filter "status_code = 'ERROR'" --limit 5

トレースをエクスポート:

ax spans export PROJECT --trace-id TRACE_ID

LLM スパンからプロンプトを抽出:

jq '[.[] | select(.attributes.openinference.span.kind == "LLM")][0] | {
  messages: .attributes.llm.input_messages,
  template: .attributes.llm.prompt_template,
  output: .attributes.output.value,
  error: .attributes.exception.message
}' trace_*/spans.json

エラーメッセージまたは出力から何が失敗したかを特定
最適化メタプロンプト (フェーズ 3) にプロンプトとエラーコンテキストを入力
改訂されたプロンプトを適用

データセットと実験を使用して最適化

データセットと実験を検索:

ax datasets list --space SPACE
ax experiments list --dataset DATASET_NAME --space SPACE

両方をエクスポート:

ax datasets export DATASET_NAME --space SPACE
ax experiments export EXPERIMENT_NAME --dataset DATASET_NAME --space SPACE

メタプロンプト用に結合データを準備
最適化メタプロンプトを実行
改訂されたプロンプトで新しい実験を作成して改善を測定

間違ったフォーマットを出力するプロンプトをデバッグ

出力フォーマットが間違っているスパンをエクスポート:

ax spans export PROJECT \
  --filter "attributes.openinference.span.kind = 'LLM' AND annotation.format.label = 'incorrect'" \
  -l 10 --stdout > bad_format.json

LLM が何を生成しているか、期待されるものと比較を確認
明示的なフォーマット命令をプロンプトに追加 (JSON スキーマ、例、デリミタ)
一般的な修正: 正確な希望出力フォーマットを示す少数ショット例を追加

RAG プロンプトでハルシネーションを減らす

モデルがハルシネーションしたトレースを検索:

ax spans export PROJECT \
  --filter "annotation.faithfulness.label = 'unfaithful'" \
  -l 20 --stdout

リトリーバー + LLM スパンを一緒に検査するようにエクスポートおよび検査:

ax spans export PROJECT --trace-id TRACE_ID
jq '[.[] | {kind: .attributes.openinference.span.kind, name, input: .attributes.input.value, output: .attributes.output.value}]' trace_*/spans.json

取得されたコンテキストに実際に答えが含まれているかを確認
システムプロンプトにグラウンディング命令を追加: "Only use information from the provided context. If the answer is not in the context, say so."

トラブルシューティング

問題	解決策
`ax: command not found`	references/ax-setup.md を参照
`No profile found`	プロファイルが設定されていません。references/ax-profiles.md に従ってプロファイルを作成します。
スパンに `input_messages` がない	スパンの種類を確認 -- Chain/Agent スパンはプロンプトをそれ自体に保存せず、子 LLM スパンに保存します
プロンプトテンプレートが `null`	すべてのインストルメンテーションが `prompt_template` を出力するわけではありません。代わりに `input_messages` または `input.value` を使用してください
最適化後に変数が失われた	改訂されたプロンプトが元のすべての `{var}` プレースホルダーを保持していることを確認してください
最適化でさらに悪化	オーバーフィッティングを確認 -- メタプロンプトがテストデータを暗記している可能性があります。少数ショット例が合成であることを確認
評価/アノテーション列がない	まず評価を実行してから (Arize UI または SDK を介して)、再度エクスポートします
実験出力列が見つからない	列名は `{experiment_name}.output` です -- `ax experiments get` で正確な実験名を確認
スパン JSON で `jq` エラー	正しいファイルパス (例: `trace_*/spans.json`) をターゲットにしていることを確認

ライセンス: MIT(寛容ライセンスのため全文を引用しています) · 原本リポジトリ

詳細情報

作者: github
リポジトリ: github/awesome-copilot
ライセンス: MIT
最終更新: 不明

GitHubで原本を見る →フィードバックを送る

Source: https://github.com/github/awesome-copilot / ライセンス: MIT