Anthropic Claudeその他⭐ リポ 0品質スコア 50/100

sentencepiece

Name: sentencepiece
Author: davila7

テキストを生のUnicodeとして扱う言語非依存のトークナイザーで、BPEおよびUnigramアルゴリズムに対応しています。T5・ALBERT・XLNet・mBARTでも採用されており、事前トークン化不要で生テキストから直接学習可能です。多言語対応や日中韓語処理、再現性のあるトークン化が必要な場合に活用してください。

description の原文を見る

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

SKILL.md 本文

SentencePiece - 言語非依存トークン化

言語固有の前処理なしで生テキストで動作する教師なしトークナイザー。

SentencePiece を使う場合

SentencePiece を使用してください：

多言語モデルを構築する場合（言語固有のルールなし）
CJK 言語（中国語、日本語、韓国語）を扱う場合
再現可能なトークン化が必要な場合（決定的な語彙）
生テキストで学習したい場合（前トークン化は不要）
軽量なデプロイメントが必要な場合（6MB メモリ、50k 文/秒）

パフォーマンス：

速度: 50,000 文/秒
メモリ: 読み込み済みモデル約 6MB
言語: すべて（言語非依存）

代わりに代替案を使用してください：

HuggingFace Tokenizers: より高速な学習、より多くの柔軟性
tiktoken: OpenAI モデル（GPT-3.5/4）
BERT WordPiece: 英語中心のタスク

クイックスタート

インストール

# Python
pip install sentencepiece

# C++ (CMake が必要)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install

モデルの学習

# コマンドライン (BPE、語彙 8000)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe

# Python API
import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='m',
    vocab_size=8000,
    model_type='bpe'
)

学習時間: 100MB コーパスで約 1～2 分

エンコードとデコード

import sentencepiece as spm

# モデルをロード
sp = spm.SentencePieceProcessor(model_file='m.model')

# ピースにエンコード
pieces = sp.encode('This is a test', out_type=str)
print(pieces)  # ['▁This', '▁is', '▁a', '▁test']

# ID にエンコード
ids = sp.encode('This is a test', out_type=int)
print(ids)  # [284, 47, 11, 1243]

# デコード
text = sp.decode(ids)
print(text)  # "This is a test"

言語非依存設計

シンボルとしての空白（▁）

text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁Hello', '▁world']

# デコードは空白を保持
decoded = sp.decode_pieces(pieces)
print(decoded)  # "Hello world"

主要な原則: テキストを生 Unicode として扱う、空白 = ▁（メタシンボル）

トークン化アルゴリズム

BPE (Byte-Pair Encoding)

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='bpe_model',
    vocab_size=16000,
    model_type='bpe'
)

使用者: mBART

Unigram（デフォルト）

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='unigram_model',
    vocab_size=8000,
    model_type='unigram'
)

使用者: T5、ALBERT、XLNet

学習設定

必須パラメータ

spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='m',
    vocab_size=32000,
    model_type='unigram',
    character_coverage=0.9995,  # CJK の場合は 1.0
    user_defined_symbols=['[SEP]', '[CLS]'],
    unk_piece='<unk>',
    num_threads=16
)

文字カバレッジ

言語タイプ	カバレッジ	理由
英語	0.9995	最も一般的な文字
CJK（中国語）	1.0	すべての文字が必要
多言語	0.9995	バランス

エンコードオプション

サブワード正則化

# 異なるトークン化をサンプリング
for _ in range(3):
    pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
    print(pieces)

# 出力（毎回異なる）：
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']

ユースケース: ロバスト性のためのデータ拡張。

一般的なパターン

T5 スタイル学習

spm.SentencePieceTrainer.train(
    input='c4_corpus.txt',
    model_prefix='t5',
    vocab_size=32000,
    model_type='unigram',
    user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
    unk_id=2,
    eos_id=1,
    pad_id=0
)

Transformers との統合

from transformers import T5Tokenizer

# T5 は内部的に SentencePiece を使用
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')

パフォーマンスベンチマーク

学習速度

コーパス	BPE (16k)	Unigram (8k)
100 MB	1～2 分	3～4 分
1 GB	10～15 分	30～40 分

トークン化速度

SentencePiece: 50,000 文/秒
HF Tokenizers: 200,000 文/秒（4 倍高速）

サポートされるモデル

T5 ファミリー: t5-base、t5-large（32k 語彙、Unigram） ALBERT: albert-base-v2（30k 語彙、Unigram） XLNet: xlnet-base-cased（32k 語彙、Unigram） mBART: facebook/mbart-large-50（250k 語彙、BPE）

参考資料

トレーニングガイド - 詳細オプション、コーパス準備
アルゴリズム - BPE vs Unigram、サブワード正則化

リソース

GitHub: https://github.com/google/sentencepiece ⭐ 10,000+
論文: https://arxiv.org/abs/1808.06226（EMNLP 2018）
バージョン: 0.2.0+

ライセンス: MIT(寛容ライセンスのため全文を引用しています) · 原本リポジトリ

詳細情報

作者: davila7
リポジトリ: davila7/claude-code-templates
ライセンス: MIT
最終更新: 不明

GitHubで原本を見る →フィードバックを送る

Source: https://github.com/davila7/claude-code-templates / ライセンス: MIT

sentencepiece

SKILL.md 本文

SentencePiece - 言語非依存トークン化

SentencePiece を使う場合

クイックスタート

インストール

モデルの学習

エンコードとデコード

言語非依存設計

シンボルとしての空白（▁）

トークン化アルゴリズム

BPE (Byte-Pair Encoding)

Unigram（デフォルト）

学習設定

必須パラメータ

文字カバレッジ

エンコードオプション

サブワード正則化

一般的なパターン

T5 スタイル学習

Transformers との統合

パフォーマンスベンチマーク

学習速度

トークン化速度

サポートされるモデル

参考資料

リソース

詳細情報

関連スキル

superfluid

civ-finish-quotes

nookplot

web3-polymarket

ethskills

xxyy-trade

SKILL.md 本文

SentencePiece - 言語非依存トークン化

SentencePiece を使う場合

クイックスタート

インストール

モデルの学習

エンコードとデコード

言語非依存設計

シンボルとしての空白（▁）

トークン化アルゴリズム

BPE (Byte-Pair Encoding)

Unigram（デフォルト）

学習設定

必須パラメータ

文字カバレッジ

エンコード オプション

サブワード正則化

一般的なパターン

T5 スタイル学習

Transformers との統合

パフォーマンス ベンチマーク

学習速度

トークン化速度

サポートされるモデル

参考資料

リソース

詳細情報

関連スキル

superfluid

civ-finish-quotes

nookplot

web3-polymarket

ethskills

xxyy-trade

エンコードオプション

パフォーマンスベンチマーク