汎用データ・分析⭐ リポ 2品質スコア 54/100

video-temporal-reasoning

Name: video-temporal-reasoning
Author: ADu2021

SpookyBenchを使用して、映像と言語を扱うモデルの時間パターン認識を診断・改善できます。SpookyBenchは、時間情報を空間的な手がかりから分離することで、モデルの真の時間理解能力を検証します。

description の原文を見る

Diagnose and improve temporal pattern recognition in video-language models using SpookyBench, which isolates temporal information from spatial cues.

SKILL.md 本文

空間情報が不明瞭な場合の時間推論の改善

ビデオ言語モデルは明らかな時空パターン認識に優れていますが、時間情報のみが利用可能な場合には苦戦します。SpookyBenchはこの盲点を明らかにします。人間は純粋な時間シーケンス（生物学的シグナルや通信プロトコルなど）から時間パターンを認識できますが、現在のモデルは失敗します。このギャップはモデルが時間関係をどのように処理するかの根本的な制限を表しています。

中核的な問題はアーキテクチャです。ほとんどのビジョン言語モデルはフレームをキー値キャッシュに一度エンコードし、その後テキスト空間で純粋に推論します。この単一パス方式のエンコーディングは、静的な空間特徴を優先する代わりに時間力学を破棄します。一方、人間は時間の変化を積極的に追跡し、推論に統合します。これに対処するには、空間情報に関係なく時間パターン抽出を可能にするためのアーキテクチャ変更が必要です。

コアコンセプト

時間的盲目性は、空間情報が時間パターン認識を支配する場合に発生します。SpookyBenchは視覚的に「ノイズの多い」フレームで時間情報を分離します。ここで:

空間的不明確性: 情報は明確な空間パターンのないノイズのような画像にエンコードされています
時間エンコーディング: 時間シーケンスはすべての意味のある情報を含みます
段階的な開示: 人間は徐々にパターンを認識します。モデルは一貫して失敗します

ベンチマークは以下をカバーします:

生物学的シグナリングパターン（ニューロン、DNAシーケンスを視覚フレームとして）
秘密通信プロトコル
時間状態機械
時系列パターン（株式の動き、オーディオのようなパターン）

時間推論を改善するには、モデルが空間エンコーディングの副産物ではなく、独立して時間シーケンスを抽出し推論することが必要です。

アーキテクチャ概要

時間的特徴抽出: フレーム間の時間微分、差分、またはパターンを計算するメカニズム
分離された空間時間パスウェイ: 空間と時間情報の個別処理
順序的フレーム集約: 相対フレーム位置と時間順序への注目
時間的注意メカニズム: 個別フレームではなくフレーム遷移に焦点を当てる
時間認識埋め込み: 時間関係をキャプチャする位置エンコーディング
SpookyBenchの評価: 純粋な時間タスクでテストして機能を分離する

実装

空間と時間処理を分離する時間認識ビデオエンコーダを作成します:

# Temporal-aware video understanding component
import torch
import torch.nn as nn
from einops import rearrange

class TemporalVideoEncoder(nn.Module):
    """
    Separate spatial and temporal feature extraction pathways.
    Enables reasoning about temporal patterns independent of spatial content.
    """
    def __init__(self, hidden_dim=768, num_frames=8, num_temporal_layers=4):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_frames = num_frames

        # Spatial encoder: standard vision features per frame
        self.spatial_encoder = nn.Linear(2048, hidden_dim)  # From ViT backbone

        # Temporal encoder: reasons about frame-to-frame relationships
        self.temporal_processor = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(
                d_model=hidden_dim,
                nhead=8,
                dim_feedforward=2048,
                batch_first=True,
                activation='gelu'
            ),
            num_layers=num_temporal_layers
        )

        # Temporal difference layers: explicitly compute frame deltas
        self.temporal_diff_layers = nn.ModuleList([
            nn.Linear(hidden_dim * 2, hidden_dim) for _ in range(3)
        ])

        # Time-aware positional encoding
        self.temporal_pos_encoding = self._create_temporal_positions(num_frames)

    def _create_temporal_positions(self, num_frames):
        """Create positional encodings that emphasize temporal structure"""
        # Sinusoidal encoding with temporal frequency emphasis
        positions = torch.arange(num_frames).float().unsqueeze(1)
        # Vary frequency to capture different temporal scales
        div_term = torch.exp(torch.arange(0, self.hidden_dim, 2).float() *
                            -(torch.log(torch.tensor(1000.0)) / self.hidden_dim))
        pe = torch.zeros(num_frames, self.hidden_dim)
        pe[:, 0::2] = torch.sin(positions * div_term)
        pe[:, 1::2] = torch.cos(positions * div_term)
        return pe

    def forward(self, frame_features):
        """
        Args:
            frame_features: (batch, num_frames, spatial_dim)
        Returns:
            temporal_features: (batch, num_frames, hidden_dim)
        """
        # Encode spatial features per frame
        batch, num_frames, spatial_dim = frame_features.shape
        spatial_encoded = self.spatial_encoder(frame_features)  # (B, T, H)

        # Add temporal position information
        device = spatial_encoded.device
        pos_enc = self.temporal_pos_encoding.to(device)
        spatial_encoded = spatial_encoded + pos_enc.unsqueeze(0)

        # Apply temporal transformer
        temporal_encoded = self.temporal_processor(spatial_encoded)

        # Compute explicit temporal differences
        for i, diff_layer in enumerate(self.temporal_diff_layers):
            # Concatenate each frame with the next frame
            frame_pairs = []
            for t in range(num_frames - 1):
                pair = torch.cat([temporal_encoded[:, t], temporal_encoded[:, t+1]], dim=-1)
                frame_pairs.append(pair)
            # For last frame, pair with itself (zero difference)
            frame_pairs.append(torch.cat([temporal_encoded[:, -1], temporal_encoded[:, -1]], dim=-1))
            pair_tensor = torch.stack(frame_pairs, dim=1)
            diff_features = diff_layer(pair_tensor)
            # Blend with original temporal features
            temporal_encoded = 0.7 * temporal_encoded + 0.3 * diff_features

        return temporal_encoded

時間理解をテストするSpookyBench評価ラッパーを実装します:

def create_spooky_benchmark_example(pattern_type='biological', length=8):
    """
    Create SpookyBench-style temporal pattern in images.
    Pure temporal information encoding.
    """
    import numpy as np
    from PIL import Image

    # Generate temporal pattern
    if pattern_type == 'biological':
        # Simulate neuron firing pattern (spike train)
        pattern = np.random.binomial(n=1, p=0.3, size=length)
    elif pattern_type == 'communication':
        # Morse-like encoding
        pattern = [1, 0, 1, 0, 1, 1, 1, 0][:length]
    elif pattern_type == 'timeseries':
        # Smooth oscillation with noise
        t = np.linspace(0, 2*np.pi, length)
        pattern = np.sin(t) + np.random.normal(0, 0.1, length)
        pattern = (pattern > 0.5).astype(int)

    # Encode as noisy images (spatial obscurity)
    frames = []
    for bit_value in pattern:
        # Create noise-dominant frame
        noise = np.random.normal(0.5, 0.2, (224, 224, 3))
        noise = np.clip(noise, 0, 1)

        # Add subtle temporal signal (hard to detect spatially)
        if bit_value == 1:
            # Slight brightness variation that's temporal, not spatial pattern
            noise = noise * 1.05  # 5% brightness increase
        else:
            noise = noise * 0.95

        # Convert to image
        frame_img = Image.fromarray((noise * 255).astype(np.uint8))
        frames.append(frame_img)

    return frames, pattern

# Evaluate model on SpookyBench
def evaluate_temporal_understanding(model, num_examples=50):
    """
    Test if model can recognize temporal patterns in noisy frames.
    Success metrics:
    - Classification of temporal pattern types
    - Prediction of next frame's bit value
    - Temporal sequence length estimation
    """
    pattern_types = ['biological', 'communication', 'timeseries']
    results = {ptype: {'correct': 0, 'total': 0} for ptype in pattern_types}

    for ptype in pattern_types:
        for _ in range(num_examples):
            frames, true_pattern = create_spooky_benchmark_example(ptype, length=8)

            # Ask model to recognize pattern
            prompt = f"What is the temporal pattern in these frames? Pattern type: {ptype}"
            response = model.predict_temporal_pattern(frames, prompt)
            predicted_pattern = parse_response_as_binary_sequence(response)

            # Check if model correctly identified temporal sequence
            if predicted_pattern == true_pattern:
                results[ptype]['correct'] += 1
            results[ptype]['total'] += 1

    # Report results
    for ptype in pattern_types:
        acc = results[ptype]['correct'] / max(1, results[ptype]['total'])
        print(f"{ptype}: {acc:.2%} temporal pattern recognition")

    return results

訓練中の時間推論を改善するデータ拡張戦略を作成します:

class TemporalAugmentation:
    """Augmentations that preserve temporal structure while obscuring spatial information"""

    @staticmethod
    def noise_injection(frames, noise_level=0.7):
        """Add overwhelming noise while preserving temporal signal"""
        noisy_frames = []
        for frame in frames:
            noise = torch.randn_like(frame) * noise_level
            noisy_frame = frame * 0.3 + noise  # Signal becomes subtle
            noisy_frames.append(noisy_frame)
        return noisy_frames

    @staticmethod
    def spatial_blur(frames, blur_sigma=5):
        """Blur spatial details while keeping temporal transitions sharp"""
        from torchvision.transforms import GaussianBlur
        blur_transform = GaussianBlur(kernel_size=9, sigma=(blur_sigma, blur_sigma))
        blurred = [blur_transform(f) for f in frames]
        return blurred

    @staticmethod
    def temporal_frequency_filter(frames):
        """Extract temporal frequencies (motion) independent of spatial structure"""
        filtered = []
        for i in range(1, len(frames)):
            # Frame difference captures temporal changes
            diff = frames[i] - frames[i-1]
            filtered.append(diff)
        return filtered

実践的なガイドライン

側面	推奨事項	注記
時間的注意ヘッド	4-8	時間推論用の専用ヘッド
フレームサンプリング戦略	N フレームごと	時間分解能と計算のバランス
時間コンテキスト長	8-16フレーム	パターン認識に十分だが過度ではない
時間的位置エンコーディング	サイン波+学習	モデルが順序を理解するのに役立つ
訓練データ拡張	ノイズ+ぼかし+時間フィルタリング	空間的不明確性に対する堅牢性

時間推論改善を使用する場合:

モデルが純粋な時間推論タスクで苦戦している
ビデオに微妙な時間パターン（異常、シーケンス）が含まれている
空間情報が信頼できないまたは遮蔽されている
時間理解がドメインにとって重要（生物学、通信）
時間アノテーション付きのデータセットにアクセスできる

使用しない場合:

空間情報が主要（物体検出、シーン理解）
時間推論機能が不要
計算予算が非常に限定的
ビデオデータセットが小さい（<10,000ビデオ）
時間パターンが明白（「SpookyBench」の課題がない）

一般的な落とし穴:

訓練中に時間と空間の情報を分離しない
フレームの差分を明示的にモデル化しない時間エンコーダ
パターン出現に不十分な時間コンテキスト長
空間優先的なデータセットのみで訓練（時間スキルを構築しない）
空間エンコーディングを報酬とする従来のビデオベンチマークのみで評価

参考文献

Time Blindness: Why Video-Language Models Can't See What Humans Can? https://arxiv.org/abs/2505.24867

ライセンス: MIT(寛容ライセンスのため全文を引用しています) · 原本リポジトリ

詳細情報

作者: ADu2021
リポジトリ: ADu2021/skillXiv
ライセンス: MIT
最終更新: 2026/3/26

GitHubで原本を見る →フィードバックを送る

Source: https://github.com/ADu2021/skillXiv / ライセンス: MIT

video-temporal-reasoning

SKILL.md 本文

空間情報が不明瞭な場合の時間推論の改善

コアコンセプト

アーキテクチャ概要

実装

実践的なガイドライン

参考文献

詳細情報

関連スキル

hugging-face-trackio

btc-bottom-model

protein_solubility_optimization

research-lookup

tree-formatting

querying-indonesian-gov-data