agent-evaluation
既存のLLMエージェントの出力品質を評価・改善・最適化したい場合に使用します。ツール選択の精度向上、回答品質の改善、コスト削減、誤った・不完全な応答の修正などに対応し、MLflowのデータセット・スコアラー・トレーシングを活用して体系的に評価します。エンドツーエンドの評価ワークフロー全体、またはトレーシング設定・データセット作成・スコアラー定義・評価実行などの個別コンポーネント単位での利用も可能です。
description の原文を見る
Use this when you need to EVALUATE OR IMPROVE or OPTIMIZE an existing LLM agent's output quality - including improving tool selection accuracy, answer quality, reducing costs, or fixing issues where the agent gives wrong/incomplete responses. Evaluates agents systematically using MLflow evaluation with datasets, scorers, and tracing. IMPORTANT - Always also load the instrumenting-with-mlflow-tracing skill before starting any work. Covers end-to-end evaluation workflow or individual components (tracing setup, dataset creation, scorer definition, evaluation execution).
SKILL.md 本文
Agent Evaluation with MLflow
Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.
⛔ CRITICAL: Must Use MLflow APIs
DO NOT create custom evaluation frameworks. You MUST use MLflow's native APIs:
- Datasets: Use
mlflow.genai.datasets.create_dataset()- NOT custom test case files - Scorers: Use
mlflow.genai.scorersandmlflow.genai.judges.make_judge()- NOT custom scorer functions - Evaluation: Use
mlflow.genai.evaluate()- NOT custom evaluation loops - Scripts: Use the provided
scripts/directory templates - NOT customevaluation/directories
Why? MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability.
If you're tempted to create evaluation/eval_dataset.py or similar custom files, STOP. Use scripts/create_dataset_template.py instead.
Table of Contents
Quick Start
⚠️ REMINDER: Use MLflow APIs from this skill. Do not create custom evaluation frameworks.
Setup (prerequisite): Install MLflow 3.8+, configure environment, integrate tracing
Evaluation workflow in 5 steps (each uses MLflow APIs):
- Understand: Run agent, inspect traces, understand purpose
- Scorers: Select and register scorers for quality criteria
- Dataset: ALWAYS discover existing datasets first, only create new if needed 3.5. Dry Run: Run 3 questions first — catch broken tools and misconfigured scorers before full eval
- Evaluate: Run agent on dataset, apply scorers, analyze results
Command Conventions
Always use uv run for MLflow and Python commands:
uv run mlflow --version # MLflow CLI commands
uv run python scripts/xxx.py # Python script execution
uv run python -c "..." # Python one-liners
This ensures commands run in the correct environment with proper dependencies.
CRITICAL: Separate stderr from stdout when capturing CLI output:
When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:
# Save both separately for debugging
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
Documentation Access Protocol
All MLflow documentation must be accessed through llms.txt:
- Start at:
https://mlflow.org/docs/latest/llms.txt - Query llms.txt for your topic with specific prompt
- If llms.txt references another doc, use WebFetch with that URL
- Do not use WebSearch - use WebFetch with llms.txt first
This applies to all steps, especially:
- Dataset creation (read GenAI dataset docs from llms.txt)
- Scorer registration (check MLflow docs for scorer APIs)
- Evaluation execution (understand mlflow.genai.evaluate API)
Discovering Agent Structure
Each project has unique structure. Use dynamic exploration instead of assumptions:
Find Agent Entry Points
# Search for main agent functions
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"
# Check common locations
ls main.py app.py src/*/agent.py 2>/dev/null
# Look for API routes
grep -r "@app\.(get|post)" . --include="*.py" # FastAPI/Flask
grep -r "def.*route" . --include="*.py"
Understand Project Structure
# Check entry points in package config
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"
# Read project documentation
cat README.md docs/*.md 2>/dev/null | head -100
# Explore main directories
ls -la src/ app/ agent/ 2>/dev/null
Setup Overview
Pre-check: Use Existing Environment
Before doing ANY setup, check if MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID are already set:
echo "MLFLOW_TRACKING_URI=$MLFLOW_TRACKING_URI"
echo "MLFLOW_EXPERIMENT_ID=$MLFLOW_EXPERIMENT_ID"
If BOTH are already set, skip Steps 1-2 entirely. The environment is pre-configured. Do NOT run setup_mlflow.py, do NOT create a .env file, do NOT override these values. Go directly to Step 3 (tracing integration) and the evaluation workflow.
Setup Steps (only if environment is NOT pre-configured)
- Install MLflow (version >=3.8.0)
- Configure environment (tracking URI and experiment)
- Guide: Follow
references/setup-guide.mdSteps 1-2
- Guide: Follow
- Integrate tracing (autolog and @mlflow.trace decorators)
- ⚠️ MANDATORY: Use the
instrumenting-with-mlflow-tracingskill for tracing setup - ✓ VERIFY: Run
scripts/validate_tracing_runtime.pyafter implementing
- ⚠️ MANDATORY: Use the
⚠️ Tracing must work before evaluation. If tracing fails, stop and troubleshoot.
Checkpoint - verify before proceeding:
- MLflow >=3.8.0 installed
- MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID set
- Autolog enabled and @mlflow.trace decorators added
- Test run creates a trace (verify trace ID is not None)
Validation scripts:
uv run python scripts/validate_environment.py # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py # Test authentication before expensive operations
Evaluation Workflow
Step 1: Agent Interview (REQUIRED — do not skip)
Before doing anything else, ask the user these questions. Do NOT proceed until you have answers.
Required:
- "What does your agent do? Describe its purpose in 1-2 sentences."
- "What are the 2-3 most important things it needs to get right?"
- "Are there common failure modes you've already noticed?"
Use answers to:
- Derive scorer names and criteria (do not invent them)
- Write the
agent_descriptionparameter forgenerate_evals_df - Set evaluation priorities
If running in automated mode: Read agent purpose from the codebase (SKILL.md, README, or main entry point docstring). Still surface what you found and confirm before proceeding.
Step 2: Define Quality Scorers
- Check registered scorers in your experiment:
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID
IMPORTANT: if there are registered scorers in the experiment then they must be used for evaluation.
- Select additional built-in scorers that apply to the agent
See references/scorers.md for the built-in scorers. Select any that are useful for assessing the agent's quality and that are not already registered.
- Create additional custom scorers as needed
If needed, create additional scorers using the make_judge() API. See references/scorers.md on how to create custom scorers and references/scorers-constraints.md for best practices.
⚠️ CRITICAL — Scorer Return Values: Scorers MUST instruct the LLM judge to return
"yes"or"no"(or booleans/numerics). Return values of"pass"or"fail"are silently cast toNoneby_cast_assessment_value_to_floatand excluded fromresults.metricswith no error or warning — results simply disappear. Seereferences/scorers-constraints.mdConstraint 2 for the full list of safe vs. broken return values.
-
REQUIRED: Register new scorers before evaluation using Python API:
from mlflow.genai.judges import make_judge from mlflow.genai.scorers import BuiltinScorerName import os scorer = make_judge(...) # Or, scorer = BuiltinScorerName() scorer.register()
** IMPORTANT: See references/scorers.md → "Model Selection for Scorers" to configure the model parameter of scorers before registration.
⚠️ Scorers MUST be registered before evaluation. Inline scorers that aren't registered won't appear in mlflow scorers list and won't be reusable.
- Verify registration:
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID # Should show your scorers
Step 3: Prepare Evaluation Dataset
ALWAYS discover existing datasets first to prevent duplicate work:
-
Run dataset discovery (mandatory):
uv run python scripts/list_datasets.py # Lists, compares, recommends datasets uv run python scripts/list_datasets.py --format json # Machine-readable output uv run python scripts/list_datasets.py --help # All options -
Present findings to user:
- Show all discovered datasets with their characteristics (size, topics covered)
- If datasets found, highlight most relevant options based on agent type
-
Ask user about existing datasets:
- "I found [N] existing evaluation dataset(s). Do you want to use one of these? (y/n)"
- If yes: Ask which dataset to use and record the dataset name — skip to Step 3.5
- If no: Proceed to Phase A below
If creating a new dataset, use the two-phase approach below.
Phase A: Sanity Check (5 questions — always run first)
Create a minimal 5-question dataset manually from the Step 1 interview answers. The goal is to confirm the pipeline works end-to-end before investing in large-scale generation.
import mlflow
from mlflow.genai.datasets import create_dataset
# Derive 5 representative questions directly from the agent's stated purpose
# and known failure modes identified in Step 1
sanity_records = [
{"inputs": {"query": "<question 1 from interview>"}, "expected_response": "<expected answer>"},
{"inputs": {"query": "<question 2 from interview>"}, "expected_response": "<expected answer>"},
# ... 5 total
]
sanity_dataset = create_dataset(
records=sanity_records,
name="sanity-check-5q",
)
Run evaluation on this dataset (see Step 4), then present results to the user with this framing:
"This is a sanity check — 5 questions confirm the pipeline works but aren't statistically meaningful. Proceeding to Phase B to generate a proper evaluation set."
Only proceed to Phase B once Phase A completes without errors.
Phase B: Proper Evaluation Dataset (100+ questions — run after Phase A passes)
Generate questions from the agent's actual corpus rather than inventing them from scratch. The approach depends on whether the project uses Databricks or OSS MLflow.
On Databricks — use generate_evals_df to synthesize questions from the agent's document corpus:
from databricks.agents.evals import generate_evals_df, estimate_synthetic_num_evals
import mlflow
# agent_description comes from Step 1 interview answers
agent_description = "<agent purpose from interview>"
# docs_df: a Spark or pandas DataFrame with a "content" column containing
# the documents/chunks the agent retrieves from (e.g., your Vector Search index)
evals = generate_evals_df(
docs=docs_df,
num_evals=100,
agent_description=agent_description,
)
# Merge into MLflow dataset — don't create a separate dataset
dataset = mlflow.genai.datasets.create_dataset(name="generated-evals-100q")
dataset.merge_records(evals)
To estimate the right num_evals before generating:
recommended = estimate_synthetic_num_evals(docs_df)
print(f"Recommended num_evals: {recommended}")
Dataset size guidance:
- <30 questions: not statistically meaningful — avoid drawing conclusions
- 50–100 questions: adequate for catching regressions, suitable for most agents
- 200+ questions: recommended when comparing model variants or scoring multiple dimensions
On OSS MLflow — use RAGAS TestsetGenerator to generate from your document corpus:
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
generator = TestsetGenerator(
llm=LangchainLLMWrapper(your_llm),
embedding_model=LangchainEmbeddingsWrapper(your_embeddings),
)
testset = generator.generate_with_langchain_docs(docs, testset_size=100)
evals_df = testset.to_pandas()
# Convert to MLflow dataset schema and merge
import mlflow
records = [
{"inputs": {"query": row["user_input"]}, "expected_response": row["reference"]}
for _, row in evals_df.iterrows()
]
dataset = mlflow.genai.datasets.create_dataset(name="generated-evals-100q")
dataset.merge_records(records)
If no document corpus is available — ask the user to provide 50–100 representative queries from production logs or usage history. These are more realistic than synthetic questions and are preferable when available.
IMPORTANT: Do not skip dataset discovery. Always run list_datasets.py first, even if you plan to create a new dataset. This prevents duplicate work and ensures users are aware of existing evaluation datasets.
For complete dataset guide: See references/dataset-preparation.md
Checkpoint - verify before proceeding:
- Scorers have been registered
- Phase A sanity check passed (pipeline runs end-to-end)
- Phase B dataset created with 50+ questions (or existing dataset selected)
Step 3.5: Dry Run (REQUIRED before full eval)
Run evaluation on 3 questions from the dataset before committing to the full run. This catches broken tools, misconfigured scorers, and auth failures early — before they silently corrupt 100-question results.
If you completed Phase A above, the pipeline is already validated — focus the dry run on scorer output only.
import mlflow
dataset = mlflow.genai.datasets.get_dataset(name="<your-dataset-name>")
dry_run_records = dataset.df.head(3)
Run mlflow.genai.evaluate() on these 3 records using the same wrapper and scorers as the full eval.
For each response, check:
- Tool calls — Did the agent call any tools? If it called zero tools on questions that require retrieval, tools are likely broken (403s, rate limits, missing credentials).
- Response quality — Are responses empty or generic ("I don't know", "I can't help with that")? Empty responses score as irrelevant and will skew the full eval.
- Scorer output — Did all 3 scores come back as
0orNone? If so, the scorer is misconfigured (check return values —"pass"/"fail"are silently cast toNone; use"yes"/"no"instead).
Decision gate:
- If dry run shows tool failures or empty responses: Stop. Fix the underlying issue (auth, tool config, retrieval) before proceeding. Do not run the full eval on broken infrastructure.
- If all 3 scorer outputs are 0 or None: Stop. Debug scorer return values and re-register before proceeding.
- If dry run passes: Report to the user: "Dry run passed (3/3 responses non-empty, tools called, scores non-zero). Proceeding to full eval." Then continue to Step 4.
Why this matters: Tool failures (403s from docs scraping, GitHub API rate limits) produce empty agent responses that score as 0. Running a 100-question eval only to discover all tools were failing wastes time and produces misleading results. The dry run catches this in under a minute.
Step 4: Run Evaluation
Large datasets (50+ questions)? See
references/throughput-guide.mdfor throughput optimization — covers parallelism env vars, async predict functions, and dataset splitting for 200+ question evals.
4a. Estimate Runtime Before Starting
Before launching evaluation, tell the user how long it will take:
-
Count the dataset questions:
import mlflow dataset = mlflow.genai.datasets.get_dataset(name="<your-dataset-name>") print(f"Dataset size: {len(dataset.df)} questions") -
Calculate the estimate — each question runs the agent once and the judge scorer once:
- Opus-class judge models (e.g.
claude-opus-4): ~45–90s per question - Sonnet-class judge models (e.g.
claude-sonnet-4): ~20–45s per question - Multiple scorers per question add time proportionally
Estimated time = N questions × 30–60s per question ÷ parallelism factor (typically 4–8x) - Opus-class judge models (e.g.
-
Tell the user before starting:
"This dataset has N questions. At ~30–60s per question with typical parallelism, evaluation will take approximately X–Y minutes. I'll run it as a background task so you can continue working — I'll summarize the results when it's done."
4b. Generate the Evaluation Script
# Generate evaluation script (specify module and entry point)
uv run python scripts/run_evaluation_template.py \
--module mlflow_agent.agent \
--entry-point run_agent
The generated script creates a wrapper function that:
- Accepts keyword arguments matching the dataset's input keys
- Provides any additional arguments the agent needs (like
llm_provider) - Runs
mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers) - Saves results to
evaluation_results.csv
⚠️ CRITICAL: wrapper Signature Must Match Dataset Input Keys
MLflow calls predict_fn(**inputs) - it unpacks the inputs dict as keyword arguments.
| Dataset Record | MLflow Calls | predict_fn Must Be |
|---|---|---|
{"inputs": {"query": "..."}} | predict_fn(query="...") | def wrapper(query): |
{"inputs": {"question": "...", "context": "..."}} | predict_fn(question="...", context="...") | def wrapper(question, context): |
Common Mistake (WRONG):
def wrapper(inputs): # ❌ WRONG - inputs is NOT a dict
return agent(inputs["query"])
4c. Launch as a Background Sub-Agent
Run the evaluation as a background sub-agent so the main session stays available. Use the Agent tool with run_in_background: true:
Sub-agent instructions (pass these verbatim):
Run the agent evaluation and write results to scratchpad.
Steps:
1. cd <project-directory>
2. Run: uv run python run_agent_evaluation.py
3. When complete, write a summary to scratchpad/eval-results.md with:
- Exit status (success or error message)
- Path to results file (evaluation_results.csv)
- Wall-clock time taken
4. Return only: "Evaluation complete. Results written to scratchpad/eval-results.md"
In the main session, poll for completion by checking for the scratchpad file rather than blocking:
# Poll every 30s using Glob
# Glob("scratchpad/eval-results.md")
# When the file appears, read it and proceed to analysis
Do NOT use TaskOutput to wait for the background agent — that dumps the full transcript (~10–20k tokens) into the main context.
4d. Analyze Results (after evaluation completes)
Once scratchpad/eval-results.md appears, run analysis:
# Pattern detection, failure analysis, recommendations
# Reads the CSV produced by mlflow.genai.evaluate() above
uv run python scripts/analyze_results.py evaluation_results.csv
Generates evaluation_report.md with per-scorer pass rates and improvement suggestions.
The script reads {scorer_name}/value and {scorer_name}/rationale columns from the CSV.
It also accepts the legacy JSON format from mlflow traces evaluate for backward compatibility:
uv run python scripts/analyze_results.py evaluation_results.json # legacy format
uv run python scripts/analyze_results.py evaluation_results.csv --output my_report.md # custom output
References
Detailed guides in references/ (load as needed):
- setup-guide.md - Environment setup (MLflow install, tracking URI configuration)
- Tracing: Use the
instrumenting-with-mlflow-tracingskill (authoritative guide for autolog, decorators, session tracking, verification) - dataset-preparation.md - Dataset schema, APIs, creation, Unity Catalog
- scorers.md - Built-in vs custom scorers, registration, testing
- scorers-constraints.md - CLI requirements for custom scorers (yes/no format, templates)
- troubleshooting.md - Common errors by phase with solutions
- throughput-guide.md - Parallelism env vars, async predict_fn, dataset splitting for 200+ question evals
Scripts are self-documenting - run with --help for usage details.
ライセンス: Apache-2.0(寛容ライセンスのため全文を引用しています) · 原本リポジトリ
詳細情報
- 作者
- mlflow
- リポジトリ
- mlflow/skills
- ライセンス
- Apache-2.0
- 最終更新
- 不明
Source: https://github.com/mlflow/skills / ライセンス: Apache-2.0
関連スキル
agent-browser
AI エージェント向けのブラウザ自動化 CLI です。ウェブサイトとの対話が必要な場合に使用します。ページ遷移、フォーム入力、ボタンクリック、スクリーンショット取得、データ抽出、ウェブアプリのテスト、ブラウザ操作の自動化など、あらゆるブラウザタスクに対応できます。「ウェブサイトを開く」「フォームに記入する」「ボタンをクリックする」「スクリーンショットを取得する」「ページからデータを抽出する」「このウェブアプリをテストする」「サイトにログインする」「ブラウザ操作を自動化する」といった要求や、プログラマティックなウェブ操作が必要なタスクで起動します。
anyskill
AnySkill — あなたのプライベート・スキルクラウド。GitHubを基盤としたリポジトリからエージェントスキルを管理、同期、動的にロードできます。自然言語でクラウドスキルを検索し、オンデマンドでプロンプトを自動ロード、カスタムスキルのアップロードと共有、スキルバンドルの一括インストールが可能です。OpenClaw、Antigravity、Claude Code、Cursorに対応しています。
engram
AIエージェント向けの永続的なメモリシステムです。バグ修正、意思決定、発見、設定変更の後はmem_saveを使用してください。ユーザーが「覚えている」「記憶している」と言及した場合、または以前のセッションと重複する作業を開始する際はmem_searchを使用します。セッション終了前にmem_session_summaryを使用して、コンテキストを保持してください。
skyvern
AI駆動のブラウザ自動化により、任意のウェブサイトを自動化できます。フォーム入力、データ抽出、ファイルダウンロード、ログイン、複数ステップのワークフロー実行など、ユーザーがウェブサイトと連携する必要があるときに使用します。Skyvernは、LLMとコンピュータビジョンを活用して、未知のサイトも自動操作可能です。Python SDK、TypeScript SDK、REST API、MCPサーバー、またはCLIを通じて統合できます。
pinchbench
PinchBenchベンチマークを実行して、OpenClawエージェントの実世界タスクにおけるパフォーマンスを評価できます。モデルの機能テスト、モデル間の比較、ベンチマーク結果のリーダーボード提出、またはOpenClawのセットアップがカレンダー、メール、リサーチ、コーディング、複数ステップのワークフローにどの程度対応しているかを確認する際に使用します。
openui
OpenUIとOpenUI Langを使用してジェネレーティブUIアプリを構築できます。これらはLLM生成インターフェースのためのトークン効率的なオープン標準です。OpenUI、@openuidev、ジェネレーティブUI、LLMからのストリーミングUI、AI向けコンポーネントライブラリ、またはjson-render/A2UIの置き換えについて述べる際に使用します。スキャフォルディング、defineComponent、システムプロンプト、Renderer、およびOpenUI Lang出力のデバッグに対応しています。