実テキストをLLMを使って合成テキストに変換

実際の非構造化または半構造化テキストレコードを、セマンティック意図と構造を保持しながらすべての個人識別情報、日付固有情報、および位置情報を置き換えた合成対応物に変換します。明示的な「保持 / 変更」ルールでレコードをインテリジェントに書き直すためにClaudeを使用します。

使用時期

匿名化が必要な実テキストレコード(顧客レビュー、医療記録、サポートチケット、メール)がある
意図、感情、構造は保持しつつすべての詳細情報を変更したい
単純なPII置換ではなくLLM駆動のパラフレージングが必要
バッチ変換のためのAPI費用を負担できる

集める情報

実データパス(テキスト列を含むJSONLまたはCSV): ソースレコード
レコードタイプ: レコードが何を表しているかの説明(例:「顧客サポートチケット」「医療記録」)
保持するフィールド: 変更しないままにするフィールド(例:「category」「priority」)
変換するフィールド: 書き直すフィールド(例:「body」「customer_name」)
保持ルール: 明示的な制約条件(例:「長さの分布を保持」「トーンを保持」「すべての日付を削除」)
出力パス: 変換されたJSONLを保存する場所(デフォルト: ./synthetic-data-workspace/outputs/)
QAしきい値: セマンティック類似度上限(デフォルト: 0.75) — ソースに近すぎるレコードをフラグ

手順

Anthropicと依存関係をインストール:

pip install anthropic pandas tqdm numpy scikit-learn

明示的なルールを持つ変換プロンプトを作成:

You are anonymising {record_type} for research/testing purposes.

Read this real record:
{real_record_json}

Rewrite it as a synthetic record following these rules:
- PRESERVE: semantic intent, tone, structure, technical content, logical flow
- PRESERVE: {preservation_rules}
- CHANGE: all names, places, dates, email addresses, phone numbers, URLs, IDs
- CHANGE: specific quoted text, proper nouns, organizational names
- MAINTAIN: field schema, word count approximately

Return only valid JSON, no markdown.

進捗と再開機能を備えたバッチ変換スクリプトを記述:

import json
import anthropic
import pandas as pd
from pathlib import Path
import time

def transform_records(input_path, output_path, record_type,
                      preserve_rules, transform_fields, locale="en"):
    client = anthropic.Anthropic()
    
    # Load real records
    if input_path.endswith('.jsonl'):
        with open(input_path) as f:
            real_records = [json.loads(line) for line in f]
    else:  # CSV
        df = pd.read_csv(input_path)
        real_records = df.to_dict(orient='records')
    
    # Track progress and resume if interrupted
    completed = set()
    if Path(output_path).exists():
        with open(output_path) as f:
            completed = {i for i, _ in enumerate(f)}
    
    synthetic_records = []
    
    for idx, real_record in enumerate(real_records):
        if idx in completed:
            continue
        
        record_json = json.dumps(real_record, indent=2)
        
        prompt = f"""Transform this {record_type} into a synthetic version:

{record_json}

Rules:
- PRESERVE semantic intent, tone, structure, technical details
- PRESERVE: {preserve_rules}
- CHANGE ALL: names, locations, dates, emails, phone numbers, IDs, URLs
- CHANGE: specific quoted text and proper nouns
- MAINTAIN: approximate length and field schema

Return ONLY valid JSON, no markdown or explanation."""
        
        try:
            message = client.messages.create(
                model="claude-3-5-haiku-20241022",
                max_tokens=1000,
                messages=[{"role": "user", "content": prompt}]
            )
            
            response_text = message.content[0].text.strip()
            
            # Strip markdown if present
            if response_text.startswith('```'):
                response_text = response_text.split('```')[1].lstrip('json').strip()
            
            synthetic_record = json.loads(response_text)
            
            # Append to output (streaming write for resume capability)
            with open(output_path, 'a') as f:
                f.write(json.dumps(synthetic_record) + '\n')
            
            synthetic_records.append(synthetic_record)
            
            if (idx + 1) % 10 == 0:
                print(f"Transformed {idx + 1}/{len(real_records)} records...")
            
            # Politeness: small delay between API calls
            time.sleep(0.5)
        
        except json.JSONDecodeError as e:
            print(f"Warning: JSON parse error on record {idx}: {e}")
            continue
        except anthropic.APIError as e:
            print(f"API error on record {idx}: {e}")
            time.sleep(2)
            continue
    
    print(f"Transformed {len(synthetic_records)} records to {output_path}")
    return synthetic_records

if __name__ == '__main__':
    transform_records(
        input_path="real_tickets.jsonl",
        output_path="synthetic_tickets.jsonl",
        record_type="customer support ticket",
        preserve_rules="priority level, issue category, technical complexity",
        transform_fields=["customer_name", "body", "email"]
    )

オプション: 情報漏洩のためのQAチェック(ソースに近すぎるレコードをフラグ):

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def check_leakage(real_path, synth_path, threshold=0.75):
    with open(real_path) as f:
        real_records = [json.loads(line) for line in f]
    with open(synth_path) as f:
        synth_records = [json.loads(line) for line in f]
    
    # Concatenate all text fields
    real_texts = [' '.join(str(v) for v in r.values()) for r in real_records]
    synth_texts = [' '.join(str(v) for v in r.values()) for r in synth_records]
    
    vectorizer = TfidfVectorizer()
    all_texts = real_texts + synth_texts
    tfidf = vectorizer.fit_transform(all_texts)
    
    flagged = []
    for i, synth_idx in enumerate(range(len(real_texts), len(all_texts))):
        similarity = cosine_similarity(tfidf[synth_idx], tfidf[:len(real_texts)])
        max_sim = np.max(similarity)
        
        if max_sim > threshold:
            flagged.append({
                'synthetic_record_idx': i,
                'max_similarity': float(max_sim),
                'closest_real_idx': int(np.argmax(similarity))
            })
    
    if flagged:
        print(f"Flagged {len(flagged)} records with similarity > {threshold}")
        for f in flagged[:5]:
            print(f"  Record {f['synthetic_record_idx']}: similarity={f['max_similarity']:.3f}")
    else:
        print(f"All {len(synth_records)} records passed leakage check (similarity < {threshold})")
    
    return flagged

変換とQAを実行:

python transform_real_to_synth.py
python check_leakage.py  # オプションのQA

出力 / 副作用

変換された合成レコードを含むJSONLファイル
レコードは実データのスキーマに一致しているが匿名化されたコンテンツ
オプション: ソースへの残差類似度をフラグするQAレポート
API費用(Haikuを使用する場合、レコードあたり約$0.01～0.05、長さに応じて変動)

セキュリティ / 制約

情報漏洩リスク: LLMが誤って識別情報を保持する可能性がある — 常にQAチェックを実行します
セマンティック忠実度: 変換はニュアンスを変更する可能性がある。ドメイン要件に対して検証します
APIコストとレート制限: 使用状況を監視。レート制限エラーに対してバックオフを実装します
再開機能: スクリプトは段階的にレコードを書き込む。中断して再開しても安全です
データ処理: バージョン管理に実データをコミットしない。ローカルで処理します

ライセンス: MIT(寛容ライセンスのため全文を引用しています) · 原本リポジトリ

real-to-synth-llm

SKILL.md 本文

実テキストをLLMを使って合成テキストに変換

使用時期

集める情報

手順

出力 / 副作用

セキュリティ / 制約

詳細情報