Anthropic ClaudeLLM・AI開発⭐ リポ 0品質スコア 50/100

model-pruning

Name: model-pruning
Author: davila7

WandaやSparseGPTなどの枝刈り技術を用いて、LLMのサイズ削減と推論高速化を行います。再学習なしにモデルを圧縮したい場合、精度劣化を最小限に抑えながら50%のスパース性を達成したい場合、またはハードウェアアクセラレータ上での高速推論を実現したい場合に活用してください。非構造化プルーニング、構造化プルーニング、N:Mスパース性、マグニチュードプルーニング、ワンショット手法をカバーします。

description の原文を見る

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.

SKILL.md 本文

モデルプルーニング：LLM の圧縮

このスキルを使う場面

以下の場合にモデルプルーニングを使用します：

モデルサイズを 40～60% 削減 し、精度低下は 1% 未満に抑える
ハードウェアフレンドリーなスパース性を使用して推論を高速化 (2～4 倍の高速化)
制約のあるハードウェア (モバイル、エッジデバイス) へのデプロイ
再トレーニングなしで圧縮 (ワンショット方式)
メモリフットプリント削減で効率的なサービス提供を実現

主要テクニック: Wanda (重み × アクティベーション)、SparseGPT (二次微分)、構造化プルーニング、N:M スパース性

論文: Wanda ICLR 2024 (arXiv 2306.11695)、SparseGPT (arXiv 2301.00774)

インストール

# Wanda 実装
git clone https://github.com/locuslab/wanda
cd wanda
pip install -r requirements.txt

# オプション: SparseGPT
git clone https://github.com/IST-DASLab/sparsegpt
cd sparsegpt
pip install -e .

# 依存関係
pip install torch transformers accelerate

クイックスタート

Wanda プルーニング (ワンショット、再トレーニングなし)

出典: ICLR 2024 (arXiv 2306.11695)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# モデルを読み込む
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# キャリブレーションデータ (アクティベーション統計用の小さなデータセット)
calib_data = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is transforming the world.",
    "Artificial intelligence powers modern applications.",
]

# Wanda プルーニング関数
def wanda_prune(model, calib_data, sparsity=0.5):
    """
    Wanda: 重みの大きさ × 入力アクティベーションでプルーニング.

    Args:
        sparsity: プルーニング対象の重みの割合 (0.5 = 50%)
    """
    # 1. アクティベーション統計を収集
    activations = {}

    def hook_fn(name):
        def hook(module, input, output):
            # 入力アクティベーションの正規ノルムを保存
            activations[name] = input[0].detach().abs().mean(dim=0)
        return hook

    # すべての線形層にフックを登録
    hooks = []
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            hooks.append(module.register_forward_hook(hook_fn(name)))

    # キャリブレーションデータを実行
    model.eval()
    with torch.no_grad():
        for text in calib_data:
            inputs = tokenizer(text, return_tensors="pt").to(model.device)
            model(**inputs)

    # フックを削除
    for hook in hooks:
        hook.remove()

    # 2. |重み| × アクティベーションに基づいて重みをプルーニング
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear) and name in activations:
            W = module.weight.data
            act = activations[name]

            # 重要度を計算: |重み| × アクティベーション
            importance = W.abs() * act.unsqueeze(0)

            # フラット化してしきい値を検索
            threshold = torch.quantile(importance.flatten(), sparsity)

            # マスクを作成
            mask = importance >= threshold

            # マスクを適用 (プルーニング)
            W *= mask.float()

    return model

# Wanda プルーニングを適用 (50% スパース性、ワンショット、再トレーニングなし)
pruned_model = wanda_prune(model, calib_data, sparsity=0.5)

# 保存
pruned_model.save_pretrained("./llama-2-7b-wanda-50")

SparseGPT (二次微分プルーニング)

出典: arXiv 2301.00774

from sparsegpt import SparseGPT

# モデルを読み込む
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# SparseGPT を初期化
pruner = SparseGPT(model)

# キャリブレーションデータ
calib_data = load_calibration_data()  # ~128 サンプル

# プルーニング (ワンショット、層単位の再構成)
pruned_model = pruner.prune(
    calib_data=calib_data,
    sparsity=0.5,           # 50% スパース性
    prunen=0,               # 非構造化 (0) または N:M 構造化
    prunem=0,
    percdamp=0.01,          # Hessian 逆行列のダンピング
)

# 結果: 50% スパース性でのほぼロスレスプルーニング

N:M 構造化プルーニング (ハードウェアアクセラレータ)

def nm_prune(weight, n=2, m=4):
    """
    N:M プルーニング: M 個の連続した重みのうち N 個を保持.
    例: 2:4 = 4 個の重みごとに 2 個を保持.

    NVIDIA スパーステンソルコア (2:4、4:8) と互換性あり.
    """
    # 重みを M グループに再形成
    shape = weight.shape
    weight_flat = weight.flatten()

    # M の倍数までパディング
    pad_size = (m - weight_flat.numel() % m) % m
    weight_padded = F.pad(weight_flat, (0, pad_size))

    # (num_groups, m) に再形成
    weight_grouped = weight_padded.reshape(-1, m)

    # 各グループの上位 N を検索
    _, indices = torch.topk(weight_grouped.abs(), n, dim=-1)

    # マスクを作成
    mask = torch.zeros_like(weight_grouped)
    mask.scatter_(1, indices, 1.0)

    # マスクを適用
    weight_pruned = weight_grouped * mask

    # 元の形状に戻す
    weight_pruned = weight_pruned.flatten()[:weight_flat.numel()]
    return weight_pruned.reshape(shape)

# 2:4 スパース性を適用 (NVIDIA ハードウェア)
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        module.weight.data = nm_prune(module.weight.data, n=2, m=4)

# 50% スパース性、A100 上の sparse tensor cores で 2 倍の高速化

コアコンセプト

1. プルーニング基準

重みの大きさプルーニング (ベースライン):

# 最も絶対値が小さい重みをプルーニング
importance = weight.abs()
threshold = torch.quantile(importance, sparsity)
mask = importance >= threshold

Wanda (重み × アクティベーション):

# 重要度 = |重み| × 入力アクティベーション
importance = weight.abs() * activation
# 重みの大きさだけより優秀 (使用状況を考慮)

SparseGPT (二次微分):

# Hessian (二次微分) を重要度として使用
# より正確だが計算コストが高い
importance = weight^2 / diag(Hessian)

2. 構造化 vs 非構造化

非構造化 (きめ細かい):

個々の重みをプルーニング
より高品質 (より良い精度)
ハードウェアの高速化なし (不規則なスパース性)

構造化 (粗粒度):

ニューロン、ヘッド、層全体をプルーニング
より低品質 (精度低下が大きい)
ハードウェア高速化あり (規則的なスパース性)

半構造化 (N:M):

両者の利点を兼ね備える
50% スパース性 (2:4) → NVIDIA GPU で 2 倍高速化
最小限の精度低下

3. スパース性パターン

# 非構造化 (ランダム)
# [1, 0, 1, 0, 1, 1, 0, 0]
# メリット: 柔軟、高品質
# デメリット: 高速化なし

# 構造化 (ブロック)
# [1, 1, 0, 0, 1, 1, 0, 0]
# メリット: ハードウェアフレンドリー
# デメリット: より精度低下が大きい

# N:M (半構造化)
# [1, 0, 1, 0] [1, 1, 0, 0]  (2:4 パターン)
# メリット: ハードウェア高速化 + 良好な品質
# デメリット: 特定のハードウェア (NVIDIA) が必要

プルーニング戦略

戦略 1: 段階的な重みの大きさプルーニング

def gradual_prune(model, initial_sparsity=0.0, final_sparsity=0.5, num_steps=100):
    """トレーニング中にスパース性を段階的に増加."""
    for step in range(num_steps):
        # 現在のスパース性
        current_sparsity = initial_sparsity + (final_sparsity - initial_sparsity) * (step / num_steps)

        # 現在のスパース性でプルーニング
        for module in model.modules():
            if isinstance(module, torch.nn.Linear):
                weight = module.weight.data
                threshold = torch.quantile(weight.abs().flatten(), current_sparsity)
                mask = weight.abs() >= threshold
                weight *= mask.float()

        # 1 ステップのトレーニング
        train_step(model)

    return model

戦略 2: 層ごとのプルーニング

def layer_wise_prune(model, sparsity_per_layer):
    """異なる層に異なるスパース性を適用."""
    # 初期層: より少ないプルーニング (より重要)
    # 後期層: より多いプルーニング (重要度が低い)

    sparsity_schedule = {
        "layer.0": 0.3,   # 30% スパース性
        "layer.1": 0.4,
        "layer.2": 0.5,
        "layer.3": 0.6,   # 60% スパース性
    }

    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            # 層インデックスを検索
            for layer_name, sparsity in sparsity_schedule.items():
                if layer_name in name:
                    # 層固有のスパース性でプルーニング
                    prune_layer(module, sparsity)
                    break

    return model

戦略 3: 反復的なプルーニング + ファインチューニング

def iterative_prune_finetune(model, target_sparsity=0.5, iterations=5):
    """段階的にプルーニングし、各反復間でファインチューニング."""
    current_sparsity = 0.0
    sparsity_increment = target_sparsity / iterations

    for i in range(iterations):
        # スパース性を増加
        current_sparsity += sparsity_increment

        # プルーニング
        prune_model(model, sparsity=current_sparsity)

        # ファインチューニング (精度を回復)
        fine_tune(model, epochs=2, lr=1e-5)

    return model

# 結果: 高スパース性でワンショットより良い精度

本番環境デプロイ

完全なプルーニングパイプライン

from transformers import Trainer, TrainingArguments

def production_pruning_pipeline(
    model_name="meta-llama/Llama-2-7b-hf",
    target_sparsity=0.5,
    method="wanda",  # または "sparsegpt"
):
    # 1. モデルを読み込む
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # 2. キャリブレーションデータを読み込む
    calib_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1000]")

    # 3. プルーニングを適用
    if method == "wanda":
        pruned_model = wanda_prune(model, calib_dataset, sparsity=target_sparsity)
    elif method == "sparsegpt":
        pruner = SparseGPT(model)
        pruned_model = pruner.prune(calib_dataset, sparsity=target_sparsity)

    # 4. (オプション) ファインチューニングして精度を回復
    training_args = TrainingArguments(
        output_dir="./pruned-model",
        num_train_epochs=1,
        per_device_train_batch_size=4,
        learning_rate=1e-5,
        bf16=True,
    )

    trainer = Trainer(
        model=pruned_model,
        args=training_args,
        train_dataset=finetune_dataset,
    )

    trainer.train()

    # 5. 保存
    pruned_model.save_pretrained("./pruned-llama-7b-50")
    tokenizer.save_pretrained("./pruned-llama-7b-50")

    return pruned_model

# 使用方法
pruned_model = production_pruning_pipeline(
    model_name="meta-llama/Llama-2-7b-hf",
    target_sparsity=0.5,
    method="wanda"
)

評価

from lm_eval import evaluator

# プルーニング済みモデル vs 元のモデルを評価
original_results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=meta-llama/Llama-2-7b-hf",
    tasks=["arc_easy", "hellaswag", "winogrande"],
)

pruned_results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=./pruned-llama-7b-50",
    tasks=["arc_easy", "hellaswag", "winogrande"],
)

# 比較
print(f"元のモデル: {original_results['results']['arc_easy']['acc']:.3f}")
print(f"プルーニング済み: {pruned_results['results']['arc_easy']['acc']:.3f}")
print(f"精度低下: {(original_results - pruned_results):.3f}")

# 50% スパース性での典型的な結果:
# - Wanda: <1% の精度低下
# - SparseGPT: <0.5% の精度低下
# - 重みの大きさ: 2～3% の精度低下

ベストプラクティス

1. スパース性の選択

# 慎重 (安全)
sparsity = 0.3  # 30%, <0.5% 低下

# バランス (推奨)
sparsity = 0.5  # 50%, ~1% 低下

# 積極的 (リスク)
sparsity = 0.7  # 70%, 2～5% 低下

# 極端 (モデル依存)
sparsity = 0.9  # 90%, 著しい低下

2. メソッドの選択

# ワンショット、再トレーニングなし → Wanda または SparseGPT
if no_retraining_budget:
    use_method = "wanda"  # より高速

# 最高品質 → SparseGPT
if need_best_quality:
    use_method = "sparsegpt"  # より正確

# ハードウェア高速化 → N:M 構造化
if need_speedup:
    use_method = "nm_prune"  # 2:4 または 4:8

3. よくある落とし穴を避ける

# ❌ 悪い: キャリブレーションデータなしでプルーニング
prune_random(model)  # アクティベーション統計なし

# ✅ 良い: キャリブレーションデータを使用
prune_wanda(model, calib_data)

# ❌ 悪い: ワンショットで過度に高いスパース性
prune(model, sparsity=0.9)  # 著しい精度低下

# ✅ 良い: 段階的または反復的
iterative_prune(model, target=0.9, steps=10)

パフォーマンス比較

50% スパース性でのプルーニング方式 (LLaMA-7B):

メソッド	精度低下	スピード	メモリ	再トレーニング必須
重みの大きさ	-2.5%	1.0×	-50%	いいえ
Wanda	-0.8%	1.0×	-50%	いいえ
SparseGPT	-0.4%	1.0×	-50%	いいえ
N:M (2:4)	-1.0%	2.0×	-50%	いいえ
構造化	-3.0%	2.0×	-50%	いいえ

出典: Wanda 論文 (ICLR 2024)、SparseGPT 論文

リソース

Wanda 論文 (ICLR 2024): https://arxiv.org/abs/2306.11695
Wanda GitHub: https://github.com/locuslab/wanda
SparseGPT 論文: https://arxiv.org/abs/2301.00774
SparseGPT GitHub: https://github.com/IST-DASLab/sparsegpt
NVIDIA Sparse Tensor Cores: https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/

ライセンス: MIT(寛容ライセンスのため全文を引用しています) · 原本リポジトリ

詳細情報

作者: davila7
リポジトリ: davila7/claude-code-templates
ライセンス: MIT
最終更新: 不明

GitHubで原本を見る →フィードバックを送る

Source: https://github.com/davila7/claude-code-templates / ライセンス: MIT