Instructor: 構造化されたLLM出力

このスキルを使用する場合

Instructorを使用する場合:

LLM レスポンスから構造化データを確実に抽出する必要がある
Pydantic スキーマに対して出力を自動的に検証する必要がある
検証失敗時に自動エラーハンドリングで抽出を再試行する必要がある
型安全性と検証を備えた複雑な JSON をパースする必要がある
リアルタイム処理のために部分的な結果をストリーミングする必要がある
複数の LLM プロバイダーを一貫した API でサポートする必要がある

GitHub スター: 15,000+ | 実戦での検証: 100,000+ 開発者

インストール

# 基本的なインストール
pip install instructor

# 特定のプロバイダーをインストール
pip install "instructor[anthropic]"  # Anthropic Claude
pip install "instructor[openai]"     # OpenAI
pip install "instructor[all]"        # すべてのプロバイダー

クイックスタート

基本例: ユーザーデータの抽出

import instructor
from pydantic import BaseModel
from anthropic import Anthropic

# 出力構造を定義
class User(BaseModel):
    name: str
    age: int
    email: str

# instructor クライアントを作成
client = instructor.from_anthropic(Anthropic())

# 構造化データを抽出
user = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "John Doe is 30 years old. His email is john@example.com"
    }],
    response_model=User
)

print(user.name)   # "John Doe"
print(user.age)    # 30
print(user.email)  # "john@example.com"

OpenAI を使用する場合

from openai import OpenAI

client = instructor.from_openai(OpenAI())

user = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=User,
    messages=[{"role": "user", "content": "Extract: Alice, 25, alice@email.com"}]
)

コアコンセプト

1. レスポンスモデル (Pydantic)

レスポンスモデルは、LLM出力の構造と検証ルールを定義します。

基本的なモデル

from pydantic import BaseModel, Field

class Article(BaseModel):
    title: str = Field(description="Article title")
    author: str = Field(description="Author name")
    word_count: int = Field(description="Number of words", gt=0)
    tags: list[str] = Field(description="List of relevant tags")

article = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Analyze this article: [article text]"
    }],
    response_model=Article
)

メリット:

Python の型ヒントによる型安全性
自動検証 (word_count > 0)
Field の説明による自己ドキュメント化
IDE のオートコンプリート機能

ネストされたモデル

class Address(BaseModel):
    street: str
    city: str
    country: str

class Person(BaseModel):
    name: str
    age: int
    address: Address  # ネストされたモデル

person = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "John lives at 123 Main St, Boston, USA"
    }],
    response_model=Person
)

print(person.address.city)  # "Boston"

オプショナルフィールド

from typing import Optional

class Product(BaseModel):
    name: str
    price: float
    discount: Optional[float] = None  # オプショナル
    description: str = Field(default="No description")  # デフォルト値

# LLM は discount または description を指定する必要がありません

制約のための Enum

from enum import Enum

class Sentiment(str, Enum):
    POSITIVE = "positive"
    NEGATIVE = "negative"
    NEUTRAL = "neutral"

class Review(BaseModel):
    text: str
    sentiment: Sentiment  # これら 3 つの値のみ許可

review = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "This product is amazing!"
    }],
    response_model=Review
)

print(review.sentiment)  # Sentiment.POSITIVE

2. 検証

Pydantic は LLM 出力を自動的に検証します。検証が失敗した場合、Instructor は再試行します。

組み込みバリデーター

from pydantic import Field, EmailStr, HttpUrl

class Contact(BaseModel):
    name: str = Field(min_length=2, max_length=100)
    age: int = Field(ge=0, le=120)  # 0 <= age <= 120
    email: EmailStr  # メール形式を検証
    website: HttpUrl  # URL 形式を検証

# LLM が無効なデータを指定した場合、Instructor は自動的に再試行します

カスタムバリデーター

from pydantic import field_validator

class Event(BaseModel):
    name: str
    date: str
    attendees: int

    @field_validator('date')
    def validate_date(cls, v):
        """日付が YYYY-MM-DD 形式であることを確認します。"""
        import re
        if not re.match(r'\d{4}-\d{2}-\d{2}', v):
            raise ValueError('Date must be YYYY-MM-DD format')
        return v

    @field_validator('attendees')
    def validate_attendees(cls, v):
        """参加者が正の数であることを確認します。"""
        if v < 1:
            raise ValueError('Must have at least 1 attendee')
        return v

モデルレベルの検証

from pydantic import model_validator

class DateRange(BaseModel):
    start_date: str
    end_date: str

    @model_validator(mode='after')
    def check_dates(self):
        """end_date が start_date より後であることを確認します。"""
        from datetime import datetime
        start = datetime.strptime(self.start_date, '%Y-%m-%d')
        end = datetime.strptime(self.end_date, '%Y-%m-%d')

        if end < start:
            raise ValueError('end_date must be after start_date')
        return self

3. 自動再試行

Instructor は検証失敗時に自動的に再試行し、LLM にエラーフィードバックを提供します。

# 検証失敗時は最大 3 回まで再試行します
user = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Extract user from: John, age unknown"
    }],
    response_model=User,
    max_retries=3  # デフォルトは 3
)

# age を抽出できない場合、Instructor は LLM に以下を伝えます:
# "Validation error: age - field required"
# LLM はエラーフィードバックを使ってもう一度試行します

仕組み:

LLM が出力を生成
Pydantic が検証
無効な場合: エラーメッセージを LLM に送信
LLM がエラーフィードバック付きで再試行
max_retries に達するまで繰り返し

4. ストリーミング

リアルタイム処理のために部分的な結果をストリーミングします。

部分的なオブジェクトのストリーミング

from instructor import Partial

class Story(BaseModel):
    title: str
    content: str
    tags: list[str]

# LLM が生成する際に部分的な更新をストリーミング
for partial_story in client.messages.create_partial(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Write a short sci-fi story"
    }],
    response_model=Story
):
    print(f"Title: {partial_story.title}")
    print(f"Content so far: {partial_story.content[:100]}...")
    # リアルタイムで UI を更新

イテラブルのストリーミング

class Task(BaseModel):
    title: str
    priority: str

# リスト項目が生成されるにつれてストリーミング
tasks = client.messages.create_iterable(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Generate 10 project tasks"
    }],
    response_model=Task
)

for task in tasks:
    print(f"- {task.title} ({task.priority})")
    # 各タスクが到着するにつれて処理

プロバイダーの設定

Anthropic Claude

import instructor
from anthropic import Anthropic

client = instructor.from_anthropic(
    Anthropic(api_key="your-api-key")
)

# Claude モデルで使用
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[...],
    response_model=YourModel
)

OpenAI

from openai import OpenAI

client = instructor.from_openai(
    OpenAI(api_key="your-api-key")
)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=YourModel,
    messages=[...]
)

ローカルモデル (Ollama)

from openai import OpenAI

# ローカルの Ollama サーバーをポイント
client = instructor.from_openai(
    OpenAI(
        base_url="http://localhost:11434/v1",
        api_key="ollama"  # 必須ですが無視されます
    ),
    mode=instructor.Mode.JSON
)

response = client.chat.completions.create(
    model="llama3.1",
    response_model=YourModel,
    messages=[...]
)

一般的なパターン

パターン 1: テキストからのデータ抽出

class CompanyInfo(BaseModel):
    name: str
    founded_year: int
    industry: str
    employees: int
    headquarters: str

text = """
Tesla, Inc. was founded in 2003. It operates in the automotive and energy
industry with approximately 140,000 employees. The company is headquartered
in Austin, Texas.
"""

company = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"Extract company information from: {text}"
    }],
    response_model=CompanyInfo
)

パターン 2: 分類

class Category(str, Enum):
    TECHNOLOGY = "technology"
    FINANCE = "finance"
    HEALTHCARE = "healthcare"
    EDUCATION = "education"
    OTHER = "other"

class ArticleClassification(BaseModel):
    category: Category
    confidence: float = Field(ge=0.0, le=1.0)
    keywords: list[str]

classification = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Classify this article: [article text]"
    }],
    response_model=ArticleClassification
)

パターン 3: 複数エンティティの抽出

class Person(BaseModel):
    name: str
    role: str

class Organization(BaseModel):
    name: str
    industry: str

class Entities(BaseModel):
    people: list[Person]
    organizations: list[Organization]
    locations: list[str]

text = "Tim Cook, CEO of Apple, announced at the event in Cupertino..."

entities = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"Extract all entities from: {text}"
    }],
    response_model=Entities
)

for person in entities.people:
    print(f"{person.name} - {person.role}")

パターン 4: 構造化分析

class SentimentAnalysis(BaseModel):
    overall_sentiment: Sentiment
    positive_aspects: list[str]
    negative_aspects: list[str]
    suggestions: list[str]
    score: float = Field(ge=-1.0, le=1.0)

review = "The product works well but setup was confusing..."

analysis = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"Analyze this review: {review}"
    }],
    response_model=SentimentAnalysis
)

パターン 5: バッチ処理

def extract_person(text: str) -> Person:
    return client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Extract person from: {text}"
        }],
        response_model=Person
    )

texts = [
    "John Doe is a 30-year-old engineer",
    "Jane Smith, 25, works in marketing",
    "Bob Johnson, age 40, software developer"
]

people = [extract_person(text) for text in texts]

高度な機能

Union 型

from typing import Union

class TextContent(BaseModel):
    type: str = "text"
    content: str

class ImageContent(BaseModel):
    type: str = "image"
    url: HttpUrl
    caption: str

class Post(BaseModel):
    title: str
    content: Union[TextContent, ImageContent]  # どちらかのタイプ

# LLM はコンテンツに基づいて適切なタイプを選択

動的モデル

from pydantic import create_model

# 実行時にモデルを作成
DynamicUser = create_model(
    'User',
    name=(str, ...),
    age=(int, Field(ge=0)),
    email=(EmailStr, ...)
)

user = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[...],
    response_model=DynamicUser
)

カスタムモード

# ネイティブ構造化出力がないプロバイダー向け
client = instructor.from_anthropic(
    Anthropic(),
    mode=instructor.Mode.JSON  # JSON モード
)

# 利用可能なモード:
# - Mode.ANTHROPIC_TOOLS (Claude 向けに推奨)
# - Mode.JSON (フォールバック)
# - Mode.TOOLS (OpenAI ツール)

コンテキスト管理

# 単一使用クライアント
with instructor.from_anthropic(Anthropic()) as client:
    result = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        messages=[...],
        response_model=YourModel
    )
    # クライアントは自動的にクローズされます

エラーハンドリング

検証エラーの処理

from pydantic import ValidationError

try:
    user = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        messages=[...],
        response_model=User,
        max_retries=3
    )
except ValidationError as e:
    print(f"Failed after retries: {e}")
    # 適切に処理

except Exception as e:
    print(f"API error: {e}")

カスタムエラーメッセージ

class ValidatedUser(BaseModel):
    name: str = Field(description="Full name, 2-100 characters")
    age: int = Field(description="Age between 0 and 120", ge=0, le=120)
    email: EmailStr = Field(description="Valid email address")

    class Config:
        # カスタムエラーメッセージ
        json_schema_extra = {
            "examples": [
                {
                    "name": "John Doe",
                    "age": 30,
                    "email": "john@example.com"
                }
            ]
        }

ベストプラクティス

1. 明確なフィールド説明

# ❌ 悪い例: 曖昧
class Product(BaseModel):
    name: str
    price: float

# ✅ 良い例: 説明的
class Product(BaseModel):
    name: str = Field(description="Product name from the text")
    price: float = Field(description="Price in USD, without currency symbol")

2. 適切な検証を使用

# ✅ 良い例: 値を制限
class Rating(BaseModel):
    score: int = Field(ge=1, le=5, description="Rating from 1 to 5 stars")
    review: str = Field(min_length=10, description="Review text, at least 10 chars")

3. プロンプトで例を提供

messages = [{
    "role": "user",
    "content": """Extract person info from: "John, 30, engineer"

Example format:
{
  "name": "John Doe",
  "age": 30,
  "occupation": "engineer"
}"""
}]

4. 固定カテゴリーに Enum を使用

# ✅ 良い例: Enum は有効な値を確保
class Status(str, Enum):
    PENDING = "pending"
    APPROVED = "approved"
    REJECTED = "rejected"

class Application(BaseModel):
    status: Status  # LLM は enum から選択する必要があります

5. 欠落データを適切に処理

class PartialData(BaseModel):
    required_field: str
    optional_field: Optional[str] = None
    default_field: str = "default_value"

# LLM は required_field のみを指定する必要があります

代替手段との比較

機能	Instructor	手動 JSON	LangChain	DSPy
型安全性	✅ はい	❌ いいえ	⚠️ 部分的	✅ はい
自動検証	✅ はい	❌ いいえ	❌ いいえ	⚠️ 制限あり
自動再試行	✅ はい	❌ いいえ	❌ いいえ	✅ はい
ストリーミング	✅ はい	❌ いいえ	✅ はい	❌ いいえ
マルチプロバイダー	✅ はい	⚠️ 手動	✅ はい	✅ はい
学習曲線	低	低	中	高

Instructor を選択する場合:

構造化された検証済み出力が必要
型安全性と IDE サポートが必要
自動再試行が必要
データ抽出システムを構築している

代替手段を選択する場合:

DSPy: プロンプト最適化が必要
LangChain: 複雑なチェーンを構築する場合
手動: シンプルな 1 回限りの抽出

リソース

ドキュメント: https://python.useinstructor.com
GitHub: https://github.com/jxnl/instructor (15k+ スター)
Cookbook: https://python.useinstructor.com/examples
Discord: コミュニティサポートが利用可能

SKILL.md 本文