Anthropic Claudeその他⭐ リポ 0品質スコア 50/100

llava

Name: llava
Author: davila7

大規模言語・ビジョンアシスタントとして、CLIPビジョンエンコーダとVicuna/LLaMAの言語モデルを組み合わせ、画像を用いた会話やビジュアル質問応答、指示に従った画像理解を実現します。マルチターンの画像チャットや視覚言語チャットボットの構築、会話形式での画像分析タスクに最適です。

description の原文を見る

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

SKILL.md 本文

LLaVA - Large Language and Vision Assistant

画像理解の会話機能を備えたオープンソースのビジョン言語モデルです。

LLaVA を使用する場合

以下の場合に使用してください：

ビジョン言語チャットボットの構築
視覚的質問応答 (VQA)
画像の説明とキャプション生成
マルチターンの画像会話
視覚的指示の遵行
画像を含むドキュメント理解

メトリクス:

GitHub スター 23,000+ 以上
GPT-4V レベルの機能 (目標)
Apache 2.0 ライセンス
複数のモデルサイズ (7B～34B パラメータ)

代わりに選択肢を使用してください:

GPT-4V: 最高の品質、API ベース
CLIP: シンプルなゼロショット分類
BLIP-2: キャプション生成に最適
Flamingo: 研究用、オープンソースではない

クイックスタート

インストール

# リポジトリをクローン
git clone https://github.com/haotian-liu/LLaVA
cd LLaVA

# インストール
pip install -e .

基本的な使用方法

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import torch

# モデルをロード
model_path = "liuhaotian/llava-v1.5-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

# 画像をロード
image = Image.open("image.jpg")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_tensor.to(model.device, dtype=torch.float16)

# 会話を作成
conv = conv_templates["llava_v1"].copy()
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# レスポンスを生成
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        do_sample=True,
        temperature=0.2,
        max_new_tokens=512
    )

response = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
print(response)

利用可能なモデル

モデル	パラメータ	VRAM	品質
LLaVA-v1.5-7B	7B	~14 GB	良
LLaVA-v1.5-13B	13B	~28 GB	より良い
LLaVA-v1.6-34B	34B	~70 GB	最高

# 異なるモデルをロード
model_7b = "liuhaotian/llava-v1.5-7b"
model_13b = "liuhaotian/llava-v1.5-13b"
model_34b = "liuhaotian/llava-v1.6-34b"

# 4ビット量子化で VRAM を削減
load_4bit = True  # VRAM を約 4 倍削減

CLI 使用方法

# 単一画像クエリ
python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file image.jpg \
    --query "What is in this image?"

# マルチターン会話
python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file image.jpg
# その後、対話的に質問を入力してください

Web UI (Gradio)

# Gradio インターフェースをローンチ
python -m llava.serve.gradio_web_server \
    --model-path liuhaotian/llava-v1.5-7b \
    --load-4bit  # オプション: VRAM を削減

# http://localhost:7860 でアクセス

マルチターン会話

# 会話を初期化
conv = conv_templates["llava_v1"].copy()

# ターン 1
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
conv.append_message(conv.roles[1], None)
response1 = generate(conv, model, image)  # "A dog playing in a park"

# ターン 2
conv.messages[-1][1] = response1  # 前のレスポンスを追加
conv.append_message(conv.roles[0], "What breed is the dog?")
conv.append_message(conv.roles[1], None)
response2 = generate(conv, model, image)  # "Golden Retriever"

# ターン 3
conv.messages[-1][1] = response2
conv.append_message(conv.roles[0], "What time of day is it?")
conv.append_message(conv.roles[1], None)
response3 = generate(conv, model, image)

一般的なタスク

画像キャプション生成

question = "Describe this image in detail."
response = ask(model, image, question)

視覚的質問応答

question = "How many people are in the image?"
response = ask(model, image, question)

オブジェクト検出 (テキスト形式)

question = "List all the objects you can see in this image."
response = ask(model, image, question)

シーン理解

question = "What is happening in this scene?"
response = ask(model, image, question)

ドキュメント理解

question = "What is the main topic of this document?"
response = ask(model, document_image, question)

カスタムモデルの学習

# ステージ 1: 機能アライメント (558K 画像キャプション ペア)
bash scripts/v1_5/pretrain.sh

# ステージ 2: ビジュアル指示チューニング (150K 指示データ)
bash scripts/v1_5/finetune.sh

量子化 (VRAM 削減)

# 4ビット量子化
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path="liuhaotian/llava-v1.5-13b",
    model_base=None,
    model_name=get_model_name_from_path("liuhaotian/llava-v1.5-13b"),
    load_4bit=True  # VRAM を約 4 倍削減
)

# 8ビット量子化
load_8bit=True  # VRAM を約 2 倍削減

ベストプラクティス

7B モデルから開始 - 良質で管理可能な VRAM
4ビット量子化を使用 - VRAM を大幅に削減
GPU が必要 - CPU 推論は非常に遅い
明確なプロンプト - 具体的な質問はより良い回答を得られる
マルチターン会話 - 会話コンテキストを保持
温度 0.2～0.7 - 創造性と一貫性のバランス
max_new_tokens 512～1024 - 詳細なレスポンスの場合
バッチ処理 - 複数の画像を順序立てて処理

パフォーマンス

モデル	VRAM (FP16)	VRAM (4ビット)	速度 (トークン/秒)
7B	~14 GB	~4 GB	~20
13B	~28 GB	~8 GB	~12
34B	~70 GB	~18 GB	~5

A100 GPU での計測値

ベンチマーク

LLaVA は以下で競争力のあるスコアを達成しています：

VQAv2: 78.5%
GQA: 62.0%
MM-Vet: 35.4%
MMBench: 64.3%

制限事項

ハルシネーション - 画像に含まれていない物を説明する可能性がある
空間推論 - 正確な位置の判断が苦手
小さいテキスト - 細かい文字の読み取りが難しい
オブジェクト数え - 多くのオブジェクトの数え方が不正確
VRAM 要件 - 強力な GPU が必要
推論速度 - CLIP より遅い

フレームワークとの統合

LangChain

from langchain.llms.base import LLM

class LLaVALLM(LLM):
    def _call(self, prompt, stop=None):
        # カスタム LLaVA 推論
        return response

llm = LLaVALLM()

Gradio アプリ

import gradio as gr

def chat(image, text, history):
    response = ask_llava(model, image, text)
    return response

demo = gr.ChatInterface(
    chat,
    additional_inputs=[gr.Image(type="pil")],
    title="LLaVA Chat"
)
demo.launch()

リソース

GitHub: https://github.com/haotian-liu/LLaVA ⭐ 23,000+
論文: https://arxiv.org/abs/2304.08485
デモ: https://llava.hliu.cc
モデル: https://huggingface.co/liuhaotian
ライセンス: Apache 2.0

ライセンス: MIT(寛容ライセンスのため全文を引用しています) · 原本リポジトリ

詳細情報

作者: davila7
リポジトリ: davila7/claude-code-templates
ライセンス: MIT
最終更新: 不明

GitHubで原本を見る →フィードバックを送る

Source: https://github.com/davila7/claude-code-templates / ライセンス: MIT