Anthropic ClaudeLLM・AI開発⭐ リポ 0品質スコア 50/100

vision-multimodal

Name: vision-multimodal
Author: lobbi-docs

画像分析・PDF処理・ドキュメント理解などのビジョンおよびマルチモーダル機能をClaudeに提供します。画像入力、base64エンコード、複数画像の処理、視覚的な分析が必要な場面で活用できます。

description の原文を見る

Vision and multimodal capabilities for Claude including image analysis, PDF processing, and document understanding. Activate for image input, base64 encoding, multiple images, and visual analysis.

SKILL.md 本文

ビジョン & マルチモーダルスキル

Claude のビジョン機能を活用して、画像分析、ドキュメント処理、マルチモーダル理解を実現します。

このスキルを使う場合

画像分析と説明
ドキュメント/PDF 処理
スクリーンショット分析
OCR 的なテキスト抽出
ビジュアル比較
グラフと図表の解釈

サポートされているフォーマット

フォーマット	ステータス	最適用途
JPEG	✓	写真、自然のシーン
PNG	✓	スクリーンショット、UI、テキスト
GIF	✓	アニメーション（最初のフレーム）
WebP	✓	モダン、圧縮
PDF	✓	ドキュメント（Files API 経由）

画像サイズガイドライン

最小: 200 ピクセル（小さいほど精度が低下）
最適: 1000x1000 ピクセル
最大: 8000x8000 ピクセル
トークンコスト: 約（幅 × 高さ）/ 1000
ヒント: 最大寸法を 1568px にリサイズして、30～50% のトークン節約が可能

コアパターン

パターン 1: 単一画像分析

import anthropic
import base64

client = anthropic.Anthropic()

# 画像をロードしてエンコード
with open("image.jpg", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/jpeg",
                    "data": image_data
                }
            },
            {
                "type": "text",
                "text": "Describe this image in detail."
            }
        ]
    }]
)

パターン 2: URL からの画像

import httpx

# URL からフェッチしてエンコード
image_url = "https://example.com/image.jpg"
response = httpx.get(image_url)
image_data = base64.standard_b64encode(response.content).decode("utf-8")

# 上記と同じパターンを使用

パターン 3: 複数画像

# 複数画像を比較（リクエストあたり最大 100 枚）
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}},
        {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}},
        {"type": "text", "text": "Compare these two images and list the differences."}
    ]
}]

パターン 4: 画像を使った Few-Shot

# 例で教える
messages = [
    # 例 1
    {"role": "user", "content": [
        {"type": "image", "source": {...}},
        {"type": "text", "text": "Classify this image."}
    ]},
    {"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},

    # 例 2
    {"role": "user", "content": [
        {"type": "image", "source": {...}},
        {"type": "text", "text": "Classify this image."}
    ]},
    {"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},

    # ターゲット画像
    {"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
        {"type": "text", "text": "Classify this image."}
    ]}
]

パターン 5: PDF 処理

# Files API を使用（ベータ版）
with open("document.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "base64",
                    "media_type": "application/pdf",
                    "data": pdf_data
                }
            },
            {"type": "text", "text": "Summarize this document."}
        ]
    }]
)

ビジョンのプロンプトエンジニアリング

戦略 1: ロール割り当て

prompt = """You have perfect vision and exceptional attention to detail,
making you an expert at analyzing technical diagrams.

Analyze this architecture diagram and identify:
1. All components
2. Data flow between components
3. Potential bottlenecks"""

戦略 2: ステップバイステップの思考

prompt = """Before answering, analyze the image systematically:

<thinking>
1. What is the overall subject?
2. What are the key elements?
3. How do elements relate to each other?
4. What details stand out?
</thinking>

Then provide your answer based on this analysis."""

戦略 3: 構造化された出力

prompt = """Extract information from this receipt and return as JSON:

{
    "vendor": "",
    "date": "",
    "items": [{"name": "", "price": 0}],
    "total": 0
}"""

画像の最適化

from PIL import Image
import io

def optimize_for_claude(image_path, max_dimension=1568):
    """トークン使用量を 30～50% 削減するために画像をリサイズ"""
    with Image.open(image_path) as img:
        # 新しい寸法を計算
        ratio = min(max_dimension / img.width, max_dimension / img.height)
        if ratio < 1:
            new_size = (int(img.width * ratio), int(img.height * ratio))
            img = img.resize(new_size, Image.LANCZOS)

        # バイト列に変換
        buffer = io.BytesIO()
        img.save(buffer, format="JPEG", quality=85)
        return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")

一般的な使用例

テキスト抽出（OCR 的）

prompt = """Extract all text from this image.
Preserve the original formatting and structure as much as possible.
If text is unclear, indicate with [unclear]."""

テーブル抽出

prompt = """Extract the table data from this image.
Return as a markdown table with proper headers and alignment."""

グラフ分析

prompt = """Analyze this chart:
1. What type of chart is this?
2. What are the axes/labels?
3. What are the key data points?
4. What trends or patterns are visible?"""

ベストプラクティス

すべきこと:

高品質な画像を使用（≥1000px）
トークン節約のため大きな画像をリサイズ
何を探すべきかについてコンテキストを提供
一貫した出力のため few-shot の例を使用

してはいけないこと:

200px より小さい画像を送信
手書きに対する完璧な OCR を期待
非常に大きな画像を送信（>8000px）
複数画像のトークンコストを無視

制限事項

特定の個人を識別できない
非常に小さなテキストではうまくいかないことがある
アニメーション GIF: 最初のフレームのみ分析
一部の特殊な記号は誤読される可能性がある

参照

[[llm-integration]] - API の基本
[[extended-thinking]] - 複雑な推論
[[citations-retrieval]] - ドキュメント引用

ライセンス: MIT(寛容ライセンスのため全文を引用しています) · 原本リポジトリ

詳細情報

作者: lobbi-docs
リポジトリ: lobbi-docs/claude
ライセンス: MIT
最終更新: 不明

GitHubで原本を見る →フィードバックを送る

Source: https://github.com/lobbi-docs/claude / ライセンス: MIT