Anthropic ClaudeLLM・AI開発⭐ リポ 0品質スコア 50/100

moe-training

Name: moe-training
Author: davila7

Mixture of Experts（MoE）モデルをDeepSpeedまたはHuggingFaceを使ってトレーニングするスキルです。密なモデルと比較して計算コストを最大5分の1に抑えながら大規模モデルを学習したい場合や、Mixtral 8x7BやDeepSeek-V3のようなスパースアーキテクチャを実装する際、あるいは計算量を比例して増やさずにモデルの容量をスケールさせたいケースで活躍します。MoEアーキテクチャ、ルーティング機構、負荷分散、エキスパート並列処理、推論最適化まで幅広くカバーします。

description の原文を見る

Train Mixture of Experts (MoE) models using DeepSpeed or HuggingFace. Use when training large-scale models with limited compute (5× cost reduction vs dense models), implementing sparse architectures like Mixtral 8x7B or DeepSeek-V3, or scaling model capacity without proportional compute increase. Covers MoE architectures, routing mechanisms, load balancing, expert parallelism, and inference optimization.

SKILL.md 本文

MoE Training: Mixture of Experts

このスキルを使用する場合

以下のような場合に MoE Training を使用します：

限られた計算リソースでより大規模なモデルを学習する（密集モデル比で 5 倍のコスト削減）
計算量を増やさずにモデル容量をスケーリングする
密集モデルより計算予算あたりのパフォーマンスを向上させる
異なるドメイン/タスク/言語向けにエキスパートを特化させる
スパース活性化で推論レイテンシを削減する（Mixtral では 13B/47B パラメータのみがアクティブ）
Mixtral 8x7B、DeepSeek-V3、Switch Transformers などの最先端モデルを実装する

注目すべき MoE モデル：Mixtral 8x7B (Mistral AI)、DeepSeek-V3、Switch Transformers (Google)、GLaM (Google)、NLLB-MoE (Meta)

インストール

# MoE サポート付き DeepSpeed
pip install deepspeed>=0.6.0

# 大規模学習用 Megatron-DeepSpeed
git clone https://github.com/microsoft/Megatron-DeepSpeed
cd Megatron-DeepSpeed
pip install -r requirements.txt

# 代替案：HuggingFace Transformers
pip install transformers accelerate

クイックスタート

基本的な MoE アーキテクチャ

import torch
import torch.nn as nn

class MoELayer(nn.Module):
    """Sparse Mixture of Experts レイヤー."""

    def __init__(self, hidden_size, num_experts=8, top_k=2):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k

        # エキスパートネットワーク (FFN)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_size, 4 * hidden_size),
                nn.GELU(),
                nn.Linear(4 * hidden_size, hidden_size)
            )
            for _ in range(num_experts)
        ])

        # ゲーティングネットワーク (ルーター)
        self.gate = nn.Linear(hidden_size, num_experts)

    def forward(self, x):
        # x シェイプ: (batch_size, seq_len, hidden_size)
        batch_size, seq_len, hidden_size = x.shape

        # ルーティング用にフラット化
        x_flat = x.view(-1, hidden_size)  # (batch_size * seq_len, hidden_size)

        # ゲートスコア計算
        gate_logits = self.gate(x_flat)  # (batch_size * seq_len, num_experts)

        # Top-k ルーティング
        gate_scores = torch.softmax(gate_logits, dim=-1)
        topk_scores, topk_indices = torch.topk(gate_scores, self.top_k, dim=-1)

        # Top-k スコアを正規化
        topk_scores = topk_scores / topk_scores.sum(dim=-1, keepdim=True)

        # エキスパート出力をディスパッチして結合
        output = torch.zeros_like(x_flat)

        for i in range(self.top_k):
            expert_idx = topk_indices[:, i]
            expert_scores = topk_scores[:, i].unsqueeze(-1)

            # トークンをエキスパートにルーティング
            for expert_id in range(self.num_experts):
                mask = (expert_idx == expert_id)
                if mask.any():
                    expert_input = x_flat[mask]
                    expert_output = self.experts[expert_id](expert_input)
                    output[mask] += expert_scores[mask] * expert_output

        # 形状を戻す
        return output.view(batch_size, seq_len, hidden_size)

DeepSpeed MoE 学習

# MoE を用いた学習スクリプト
deepspeed pretrain_gpt_moe.py \
  --num-layers 24 \
  --hidden-size 1024 \
  --num-attention-heads 16 \
  --seq-length 2048 \
  --max-position-embeddings 2048 \
  --micro-batch-size 4 \
  --global-batch-size 256 \
  --train-iters 500000 \
  --lr 0.0001 \
  --min-lr 0.00001 \
  --lr-decay-style cosine \
  --num-experts 128 \
  --moe-expert-parallel-size 4 \
  --moe-loss-coeff 0.01 \
  --moe-train-capacity-factor 1.25 \
  --moe-eval-capacity-factor 2.0 \
  --fp16 \
  --deepspeed_config ds_config.json

コア概念

1. MoE アーキテクチャ

主要コンポーネント：

エキスパート：複数の特化した FFN ネットワーク（通常 8～128 個）
ルーター/ゲート：どのエキスパートを使用するかを選択する学習済みネットワーク
Top-k ルーティング：トークンあたり k 個のエキスパートのみをアクティブ化（k=1 または k=2）
負荷分散：エキスパート使用率の均等を確保

入力トークン
    ↓
ルーター (ゲートネットワーク)
    ↓
Top-k エキスパート選択 (例：8 個中 2 個)
    ↓
エキスパート 1 (重み: 0.6) + エキスパート 5 (重み: 0.4)
    ↓
加重結合
    ↓
出力

2. ルーティングメカニズム

Top-1 ルーティング (Switch Transformer)：

# 最もシンプルなルーティング：トークンあたり 1 つのエキスパート
gate_logits = router(x)  # (batch, seq_len, num_experts)
expert_idx = torch.argmax(gate_logits, dim=-1)  # ハードルーティング

Top-2 ルーティング (Mixtral)：

# Top-2：トークンあたり 2 つのエキスパート
gate_scores = torch.softmax(router(x), dim=-1)
top2_scores, top2_indices = torch.topk(gate_scores, k=2, dim=-1)

# スコアを正規化
top2_scores = top2_scores / top2_scores.sum(dim=-1, keepdim=True)

# エキスパート出力を結合
output = (top2_scores[:, :, 0:1] * expert_outputs[top2_indices[:, :, 0]] +
          top2_scores[:, :, 1:2] * expert_outputs[top2_indices[:, :, 1]])

エキスパート選択ルーティング：

# エキスパートが top-k トークンを選択（トークンがエキスパートを選ぶのではなく）
# 完全な負荷分散を保証
expert_scores = router(x).transpose(-1, -2)  # (batch, num_experts, seq_len)
topk_tokens = torch.topk(expert_scores, k=capacity_per_expert, dim=-1)

3. 負荷分散

補助損失：

def load_balancing_loss(gate_logits, expert_indices, num_experts):
    """エキスパート使用率の均等化を促す."""
    # 各エキスパートにルーティングされたトークンの割合
    expert_counts = torch.bincount(expert_indices.flatten(), minlength=num_experts)
    expert_fraction = expert_counts.float() / expert_indices.numel()

    # 各エキスパートのゲート確率（トークン間の平均）
    gate_probs = torch.softmax(gate_logits, dim=-1).mean(dim=0)

    # 補助損失：調整を促す
    aux_loss = num_experts * (expert_fraction * gate_probs).sum()

    return aux_loss

# メイン損失に加算
total_loss = language_model_loss + 0.01 * load_balancing_loss(...)

ルーター Z 損失（安定性）：

def router_z_loss(logits):
    """ルーターがエントロピーを低下させることを促す（より決定的に）."""
    z_loss = torch.logsumexp(logits, dim=-1).pow(2).mean()
    return z_loss

total_loss = lm_loss + 0.01 * aux_loss + 0.001 * router_z_loss(gate_logits)

4. エキスパート並列化

# DeepSpeed 設定
{
  "train_batch_size": 256,
  "fp16": {"enabled": true},
  "moe": {
    "enabled": true,
    "num_experts": 128,
    "expert_parallel_size": 8,  # 128 個のエキスパートを 8 個の GPU に分散
    "capacity_factor": 1.25,    # エキスパート容量 = tokens_per_batch * capacity_factor / num_experts
    "drop_tokens": true,        # 容量を超えるトークンをドロップ
    "use_residual": false
  }
}

学習設定

DeepSpeed MoE 設定

{
  "train_batch_size": 256,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.0001,
      "betas": [0.9, 0.999],
      "eps": 1e-8
    }
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 16
  },
  "moe": {
    "enabled": true,
    "num_experts": 128,
    "expert_parallel_size": 8,
    "moe_loss_coeff": 0.01,
    "train_capacity_factor": 1.25,
    "eval_capacity_factor": 2.0,
    "min_capacity": 4,
    "drop_tokens": true,
    "use_residual": false,
    "use_tutel": false
  },
  "zero_optimization": {
    "stage": 1
  }
}

学習スクリプト

#!/bin/bash

# Mixtral スタイルの MoE 学習
deepspeed --num_gpus 8 pretrain_moe.py \
  --model-parallel-size 1 \
  --num-layers 32 \
  --hidden-size 4096 \
  --num-attention-heads 32 \
  --seq-length 2048 \
  --max-position-embeddings 4096 \
  --micro-batch-size 2 \
  --global-batch-size 256 \
  --train-iters 500000 \
  --save-interval 5000 \
  --eval-interval 1000 \
  --eval-iters 100 \
  --lr 0.0001 \
  --min-lr 0.00001 \
  --lr-decay-style cosine \
  --lr-warmup-iters 2000 \
  --clip-grad 1.0 \
  --weight-decay 0.1 \
  --num-experts 8 \
  --moe-expert-parallel-size 4 \
  --moe-loss-coeff 0.01 \
  --moe-train-capacity-factor 1.25 \
  --moe-eval-capacity-factor 2.0 \
  --disable-moe-token-dropping \
  --fp16 \
  --deepspeed \
  --deepspeed_config ds_config_moe.json \
  --data-path /path/to/data \
  --vocab-file /path/to/vocab.json \
  --merge-file /path/to/merges.txt

応用パターン

Mixtral 8x7B アーキテクチャ

class MixtralMoEBlock(nn.Module):
    """8 個のエキスパート、top-2 ルーティングを持つ Mixtral スタイルの MoE ブロック."""

    def __init__(self, config):
        super().__init__()
        self.hidden_dim = config.hidden_size
        self.ffn_dim = config.intermediate_size
        self.num_experts = config.num_local_experts  # 8
        self.top_k = config.num_experts_per_tok       # 2

        # 8 個のエキスパート FFN
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(self.hidden_dim, self.ffn_dim, bias=False),
                nn.SiLU(),
                nn.Linear(self.ffn_dim, self.hidden_dim, bias=False)
            )
            for _ in range(self.num_experts)
        ])

        # ルーター
        self.gate = nn.Linear(self.hidden_dim, self.num_experts, bias=False)

    def forward(self, hidden_states):
        batch_size, sequence_length, hidden_dim = hidden_states.shape

        # フラット化
        hidden_states = hidden_states.view(-1, hidden_dim)

        # ルーターロジット
        router_logits = self.gate(hidden_states)  # (batch * seq_len, num_experts)

        # Softmax と top-2
        routing_weights = torch.softmax(router_logits, dim=1)
        routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)

        # ルーティング重みを正規化
        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)

        # 出力を初期化
        final_hidden_states = torch.zeros_like(hidden_states)

        # エキスパートにルーティング
        for expert_idx in range(self.num_experts):
            expert_layer = self.experts[expert_idx]
            idx, top_x = torch.where(selected_experts == expert_idx)

            if idx.shape[0] == 0:
                continue

            # 現在のエキスパートトークン
            current_hidden_states = hidden_states[idx]

            # エキスパートの順伝播
            current_hidden_states = expert_layer(current_hidden_states)

            # ルーティングスコアによる加重
            current_hidden_states *= routing_weights[idx, top_x, None]

            # 累積
            final_hidden_states.index_add_(0, idx, current_hidden_states)

        # 形状を戻す
        return final_hidden_states.view(batch_size, sequence_length, hidden_dim)

PR-MoE (Pyramid-Residual-MoE)

# DeepSpeed PR-MoE：3 倍優れたパラメータ効率
deepspeed pretrain_gpt_moe.py \
  --num-layers 24 \
  --hidden-size 1024 \
  --num-attention-heads 16 \
  --num-experts "[128, 64, 32, 16]" \
  --mlp-type residual \
  --moe-expert-parallel-size 4 \
  --moe-loss-coeff 0.01 \
  --fp16

ベストプラクティス

1. エキスパート数の選択

# 経験則：エキスパート数が多い = 容量が多いが、収穫逓減
# 典型的な設定：
# - 小規模モデル (1B～7B)：8～16 エキスパート
# - 中規模モデル (7B～30B)：8～64 エキスパート
# - 大規模モデル (30B 以上)：64～256 エキスパート

# 例：Mixtral 8x7B
# 総パラメータ数：47B (8 エキスパート × 7B)
# アクティブパラメータ数：13B (2 エキスパート × 7B、top-2 ルーティング)
# 効率：13B の計算で 47B 容量

2. 容量係数チューニング

# 容量 = (tokens_per_batch / num_experts) * capacity_factor

# 学習：低容量（高速、いくつかのトークンがドロップ）
train_capacity_factor = 1.25  # 25% バッファ

# 評価：高容量（ドロップなし）
eval_capacity_factor = 2.0    # 100% バッファ

# 計算式：
expert_capacity = int((seq_len * batch_size / num_experts) * capacity_factor)

3. 学習率ガイドライン

# MoE モデルは密集モデルより低い学習率が必要
# - 密集モデル：lr = 6e-4
# - MoE モデル：lr = 1e-4 (3～6 倍低い)

# また減衰スケジュールを延長
dense_lr_decay_iters = 300000
moe_lr_decay_iters = 500000  # 1.5～2 倍長い

4. 損失係数チューニング

# 標準値で開始
moe_loss_coeff = 0.01    # 補助損失（負荷分散）
router_z_loss_coeff = 0.001  # ルーターエントロピー（安定性）

# 負荷不均衡が続く場合、補助損失を増加
if max_expert_usage / min_expert_usage > 2.0:
    moe_loss_coeff = 0.1  # より強い負荷分散

# 学習が不安定な場合、z 損失を増加
if grad_norm > 10.0:
    router_z_loss_coeff = 0.01

5. よくある落とし穴を避ける

# ❌ 悪い例：密集モデルと同じ学習率を使用
optimizer = Adam(model.parameters(), lr=6e-4)

# ✅ 良い例：MoE の学習率を低下
optimizer = Adam([
    {'params': model.non_moe_params, 'lr': 6e-4},
    {'params': model.moe_params, 'lr': 1e-4}
])

# ❌ 悪い例：負荷分散がない
loss = lm_loss

# ✅ 良い例：補助損失を追加
loss = lm_loss + 0.01 * aux_loss + 0.001 * z_loss

# ❌ 悪い例：小規模なデータセットに対してエキスパート数が多い
num_experts = 128  # 過学習リスク

# ✅ 良い例：データ多様性に合わせてエキスパート数を調整
num_experts = 8  # 小規模なデータセットに適している

推論最適化

スパース推論

# アクティブな top-k エキスパートのみ（メモリ使用量の大幅削減）
@torch.no_grad()
def moe_inference(x, model, top_k=2):
    """スパース MoE 推論：k 個のエキスパートのみをロード."""
    # ルーター
    gate_logits = model.gate(x)
    topk_scores, topk_indices = torch.topk(
        torch.softmax(gate_logits, dim=-1),
        k=top_k,
        dim=-1
    )

    # top-k エキスパートのみをロードして実行
    output = torch.zeros_like(x)
    for i in range(top_k):
        expert_idx = topk_indices[:, i]
        # 必要に応じてディスク/オフロードからエキスパートをロード
        expert = model.load_expert(expert_idx)
        output += topk_scores[:, i:i+1] * expert(x)

    return output

リソース

DeepSpeed MoE チュートリアル：https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/
Mixtral ペーパー：https://arxiv.org/abs/2401.04088
Switch Transformers：https://arxiv.org/abs/2101.03961
HuggingFace MoE ガイド：https://huggingface.co/blog/moe
NVIDIA MoE ブログ：https://developer.nvidia.com/blog/applying-mixture-of-experts-in-llm-architectures/

詳細情報

作者: davila7
リポジトリ: davila7/claude-code-templates
ライセンス: MIT
最終更新: 不明

GitHubで原本を見る →フィードバックを送る

Source: https://github.com/davila7/claude-code-templates / ライセンス: MIT