Anthropic ClaudeLLM・AI開発⭐ リポ 0品質スコア 50/100

regex-vs-llm-structured-text

Name: regex-vs-llm-structured-text
Author: affaan-m

構造化テキストの解析に正規表現とLLMのどちらを使うかを判断するためのデシジョンフレームワークです。まず正規表現から始め、低信頼度のエッジケースに限りLLMを追加で活用するアプローチを提案します。

description の原文を見る

选择在解析结构化文本时使用正则表达式还是大型语言模型的决策框架——从正则表达式开始，仅在低置信度的边缘情况下添加大型语言模型。

SKILL.md 本文

構造化テキスト解析のための正規表現 vs LLM

クイズ、フォーム、請求書、ドキュメントなどの構造化テキストを解析するための実用的なデシジョンフレームワーク。主要な洞察は、正規表現が低コストで決定的な方法で95～98%のケースを処理できるということです。残りのエッジケース向けに高コストなLLM呼び出しを予約しておきます。

使用時機

繰り返されるパターンを持つ構造化テキスト（質問、フォーム、表）の解析
テキスト抽出時に正規表現またはLLMを使用するかの決定
両方のアプローチを組み合わせたハイブリッドパイプラインの構築
テキスト処理においてコスト/精度のトレードオフを最適化

デシジョンフレームワーク

テキスト形式は一貫性があり、繰り返されるか？
├── はい (90%以上がパターンに従う) → 正規表現から開始
│   ├── 正規表現が95%以上を処理 → 完了、LLM不要
│   └── 正規表現が95%未満を処理 → エッジケースのみLLMを追加
└── いいえ (自由形式、高度に可変) → LLMを直接使用

アーキテクチャパターン

[正規表現パーサー] ─── 構造を抽出（95-98%の精度）
    │
    ▼
[テキストクリーナー] ─── ノイズを除去（マーク、ページ番号、アーティファクト）
    │
    ▼
[信頼度スコアラー] ─── 低信頼度抽出をフラグ
    │
    ├── 高信頼度（≥0.95）→ 直接出力
    │
    └── 低信頼度（<0.95）→ [LLM検証器] → 出力

実装

1. 正規表現パーサー（ほとんどのケースを処理）

import re
from dataclasses import dataclass

@dataclass(frozen=True)
class ParsedItem:
    id: str
    text: str
    choices: tuple[str, ...]
    answer: str
    confidence: float = 1.0

def parse_structured_text(content: str) -> list[ParsedItem]:
    """Parse structured text using regex patterns."""
    pattern = re.compile(
        r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
        r"(?P<choices>(?:[A-D]\..+?\n)+)"
        r"Answer:\s*(?P<answer>[A-D])",
        re.MULTILINE | re.DOTALL,
    )
    items = []
    for match in pattern.finditer(content):
        choices = tuple(
            c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
        )
        items.append(ParsedItem(
            id=match.group("id"),
            text=match.group("text").strip(),
            choices=choices,
            answer=match.group("answer"),
        ))
    return items

2. 信頼度スコアリング

LLMレビューが必要な可能性がある項目をフラグします：

@dataclass(frozen=True)
class ConfidenceFlag:
    item_id: str
    score: float
    reasons: tuple[str, ...]

def score_confidence(item: ParsedItem) -> ConfidenceFlag:
    """Score extraction confidence and flag issues."""
    reasons = []
    score = 1.0

    if len(item.choices) < 3:
        reasons.append("few_choices")
        score -= 0.3

    if not item.answer:
        reasons.append("missing_answer")
        score -= 0.5

    if len(item.text) < 10:
        reasons.append("short_text")
        score -= 0.2

    return ConfidenceFlag(
        item_id=item.id,
        score=max(0.0, score),
        reasons=tuple(reasons),
    )

def identify_low_confidence(
    items: list[ParsedItem],
    threshold: float = 0.95,
) -> list[ConfidenceFlag]:
    """Return items below confidence threshold."""
    flags = [score_confidence(item) for item in items]
    return [f for f in flags if f.score < threshold]

3. LLM検証器（エッジケースのみ）

def validate_with_llm(
    item: ParsedItem,
    original_text: str,
    client,
) -> ParsedItem:
    """Use LLM to fix low-confidence extractions."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheapest model for validation
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": (
                f"Extract the question, choices, and answer from this text.\n\n"
                f"Text: {original_text}\n\n"
                f"Current extraction: {item}\n\n"
                f"Return corrected JSON if needed, or 'CORRECT' if accurate."
            ),
        }],
    )
    # Parse LLM response and return corrected item...
    return corrected_item

4. ハイブリッドパイプライン

def process_document(
    content: str,
    *,
    llm_client=None,
    confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
    """Full pipeline: regex -> confidence check -> LLM for edge cases."""
    # Step 1: Regex extraction (handles 95-98%)
    items = parse_structured_text(content)

    # Step 2: Confidence scoring
    low_confidence = identify_low_confidence(items, confidence_threshold)

    if not low_confidence or llm_client is None:
        return items

    # Step 3: LLM validation (only for flagged items)
    low_conf_ids = {f.item_id for f in low_confidence}
    result = []
    for item in items:
        if item.id in low_conf_ids:
            result.append(validate_with_llm(item, content, llm_client))
        else:
            result.append(item)

    return result

実際のメトリクス

本番環境のクイズ解析パイプライン（410項目）から：

メトリクス	値
正規表現の成功率	98.0%
低信頼度項目	8 (2.0%)
必要なLLM呼び出し	~5
全LLM比較でのコスト削減	~95%
テスト範囲	93%

ベストプラクティス

正規表現から始める — 完璧でない正規表現でも改善されたベースラインを提供します
信頼度スコアリングを使用 してLLMが必要な内容をプログラムで特定する
検証用に最も安いLLMを使用（Haikuクラスのモデルで十分）
解析済み項目を変更しない — クリーニング/検証ステップから新しいインスタンスを返す
パーサーではTDDが有効 — 既知パターン用のテストを最初に記述してからエッジケース
メトリクスを記録（正規表現の成功率、LLM呼び出し数）してパイプラインの健全性を追跡

避けるべきアンチパターン

正規表現が95%以上処理できるときに、すべてのテキストをLLMに送信する（コストと速度が遅い）
自由形式で高度に可変なテキストに正規表現を使用する（LLMがより適切）
信頼度スコアリングをスキップして、正規表現が「うまくいく」ことを期待する
クリーニング/検証ステップで解析されたオブジェクトを変更する
エッジケースをテストしない（形式が正しくない入力、欠落フィールド、エンコードの問題）

ユースケース

クイズ/試験問題の解析
フォームデータ抽出
請求書/領収書処理
ドキュメント構造の解析（見出し、セクション、表）
繰り返されるパターンがあり、コストが重要な構造化テキスト

ライセンス: MIT(寛容ライセンスのため全文を引用しています) · 原本リポジトリ

詳細情報

作者: affaan-m
リポジトリ: affaan-m/everything-claude-code
ライセンス: MIT
最終更新: 不明

GitHubで原本を見る →フィードバックを送る

Source: https://github.com/affaan-m/everything-claude-code / ライセンス: MIT

regex-vs-llm-structured-text

SKILL.md 本文

構造化テキスト解析のための正規表現 vs LLM

使用時機

デシジョンフレームワーク

アーキテクチャパターン

実装

1. 正規表現パーサー（ほとんどのケースを処理）

2. 信頼度スコアリング

3. LLM検証器（エッジケースのみ）

4. ハイブリッドパイプライン

実際のメトリクス

ベストプラクティス

避けるべきアンチパターン

ユースケース

詳細情報

関連スキル

agent-browser

anyskill

engram

skyvern

pinchbench

openui