イベント重複排除

複数ソースのデータに対する正規化選択、レピュテーションスコアリング、ハッシュベースのグループ化。

このスキルを使用する場合

複数のソース(ニュース、イベント、製品)からデータを集約する場合
同じコンテンツが異なるアウトレット/ソースから現れる場合
重複から「最良」版を選択する必要がある場合
最適化のために重複排除メトリクスを追跡する場合

コアコンセプト

単純なURLの重複排除では不十分です。本番環境では以下が必要です:

セマンティック類似性によるグループ化(同じストーリー、異なるアウトレット)
正規化選択(「最良」版を選択)
レピュテーションスコアリング(信頼性の高いソースを優先)
ID ベースとコンテンツベースの両方の重複排除

2つのモード:

IDベース: ソースがユニークIDを持つ場合、IDが衝突したときに「最良」版を保持
コンテンツベース: セマンティック類似性でグループ化し、各グループから正規化を選択

実装

TypeScript

import { createHash } from 'crypto';

interface DeduplicationResult<T> {
  items: T[];
  originalCount: number;
  dedupedCount: number;
  reductionPercent: number;
  duplicateGroups?: number;
}

// ============================================
// ID-Based Deduplication
// ============================================

function deduplicateById<T extends { id: string }>(
  items: T[],
  preferFn: (existing: T, candidate: T) => T
): DeduplicationResult<T> {
  const seen = new Map<string, T>();
  
  for (const item of items) {
    const existing = seen.get(item.id);
    if (existing) {
      seen.set(item.id, preferFn(existing, item));
    } else {
      seen.set(item.id, item);
    }
  }
  
  const dedupedItems = Array.from(seen.values());
  const reductionPercent = items.length > 0
    ? Math.round((1 - dedupedItems.length / items.length) * 100)
    : 0;
  
  return {
    items: dedupedItems,
    originalCount: items.length,
    dedupedCount: dedupedItems.length,
    reductionPercent,
  };
}

// ============================================
// Content-Based Deduplication
// ============================================

interface Article {
  title: string;
  url: string;
  domain: string;
  publishedAt: string;
  tone?: number;
}

/**
 * Generate deduplication key from content
 * Groups by: normalized title + source country + date
 */
function generateDedupKey(article: Article): string {
  const normalizedTitle = article.title
    .toLowerCase()
    .replace(/[^\w\s]/g, '')
    .trim()
    .slice(0, 50);

  const dateStr = article.publishedAt?.slice(0, 10).replace(/-/g, '') || 'unknown';

  return `${normalizedTitle}|${dateStr}`;
}

/**
 * Generate unique ID from URL
 */
function generateEventId(url: string): string {
  return createHash('md5').update(url).digest('hex').slice(0, 12);
}

/**
 * Source reputation scoring
 */
function getReputationScore(domain: string): number {
  // Tier 1: Wire services and major international
  const tier1 = ['reuters.com', 'apnews.com', 'bbc.com', 'bbc.co.uk', 
                 'aljazeera.com', 'france24.com', 'dw.com'];
  if (tier1.some(r => domain.includes(r))) return 100;
  
  // Tier 2: Major newspapers
  const tier2 = ['nytimes.com', 'washingtonpost.com', 'theguardian.com', 
                 'ft.com', 'economist.com', 'wsj.com'];
  if (tier2.some(r => domain.includes(r))) return 75;
  
  // Tier 3: Regional/national
  const tier3 = ['cnn.com', 'foxnews.com', 'nbcnews.com', 'abcnews.go.com'];
  if (tier3.some(r => domain.includes(r))) return 50;
  
  return 10;
}

/**
 * Select canonical article from duplicate group
 */
function selectCanonical<T extends Article>(
  group: { item: T; source: string }[]
): { item: T; source: string } {
  return group.reduce((best, current) => {
    const bestScore = getReputationScore(best.item.domain) + 
                      Math.abs(best.item.tone || 0);
    const currentScore = getReputationScore(current.item.domain) + 
                         Math.abs(current.item.tone || 0);
    
    return currentScore > bestScore ? current : best;
  });
}

/**
 * Deduplicate articles from multiple sources
 */
function deduplicateArticles<T extends Article>(
  sourceResults: { sourceName: string; articles: T[] }[]
): DeduplicationResult<T & { source: string }> {
  const groups = new Map<string, { item: T; source: string }[]>();
  let totalArticles = 0;

  // Group articles by dedup key
  for (const { sourceName, articles } of sourceResults) {
    for (const article of articles) {
      totalArticles++;
      const key = generateDedupKey(article);
      
      if (!groups.has(key)) {
        groups.set(key, []);
      }
      groups.get(key)!.push({ item: article, source: sourceName });
    }
  }

  // Select canonical article from each group
  const items: (T & { source: string })[] = [];
  
  for (const group of groups.values()) {
    const canonical = selectCanonical(group);
    items.push({ ...canonical.item, source: canonical.source });
  }

  const reductionPercent = totalArticles > 0 
    ? Math.round((1 - items.length / totalArticles) * 100)
    : 0;

  console.log(`[Dedup] ${totalArticles} → ${items.length} (${reductionPercent}% reduction)`);

  return {
    items,
    originalCount: totalArticles,
    dedupedCount: items.length,
    reductionPercent,
    duplicateGroups: groups.size,
  };
}

使用例

ID ベースの重複排除

const events = await fetchEvents();

const result = deduplicateById(events, (existing, candidate) => {
  // Prefer events with coordinates
  if (!existing.lat && candidate.lat) return candidate;
  // Prefer higher sentiment magnitude
  if (Math.abs(candidate.sentiment) > Math.abs(existing.sentiment)) {
    return candidate;
  }
  return existing;
});

console.log(`Reduced ${result.reductionPercent}% duplicates`);

複数ソースの集約

const results = await Promise.all([
  fetchFromSourceA(),
  fetchFromSourceB(),
  fetchFromSourceC(),
]);

const { items, reductionPercent } = deduplicateArticles([
  { sourceName: 'source-a', articles: results[0] },
  { sourceName: 'source-b', articles: results[1] },
  { sourceName: 'source-c', articles: results[2] },
]);

// items now contains canonical articles with source attribution

ベストプラクティス

セマンティックグループ化 - URLだけでなく、正規化されたコンテンツでグループ化します
レピュテーションスコアリング - 正規化の対象として信頼性の高いソースを優先します
最良版の選択 - ID が衝突するとき、最も多くのデータを持つバージョンを保持します
削減追跡 - 重複排除がどの程度役に立ったかをログに記録します
ソース属性 - 正規化がどのソースから来たかを追跡します

一般的な誤り

単純なURL重複排除(異なるアウトレットからの同じストーリーを見落とす)
重複からのランダム選択(品質信号を失う)
正規化がない(大文字小文字/句読点の違いが偽陰性を作成する)
削減メトリクスを追跡していない(最適化できない)
ハードコードされたソースリスト(設定可能にします)

deduplication

SKILL.md 本文

イベント重複排除

このスキルを使用する場合

コアコンセプト

実装

TypeScript

使用例

ID ベースの重複排除

複数ソースの集約

ベストプラクティス

一般的な誤り

関連パターン

詳細情報