Anthropic ClaudeLLM・AI開発⭐ リポ 0品質スコア 50/100

computer-use-agents

Name: computer-use-agents
Author: sickn33

画面の確認、カーソル移動、ボタンのクリック、テキスト入力など、人間と同様の操作でコンピューターを扱うAIエージェントを構築します。Anthropicの Computer Use、OpenAIの Operator/CUA、およびオープンソースの代替手段を網羅しています。

description の原文を見る

Build AI agents that interact with computers like humans do - viewing screens, moving cursors, clicking buttons, and typing text. Covers Anthropic's Computer Use, OpenAI's Operator/CUA, and open-source alternatives.

SKILL.md 本文

Computer Use Agents

スクリーンの表示、カーソルの移動、ボタンのクリック、テキストの入力など、人間のようにコンピュータと相互作用するAIエージェントを構築します。AnthropicのComputer Use、OpenAIのOperator/CUA、およびオープンソースの代替案をカバーしています。サンドボックス化、セキュリティ、およびビジョンベースの制御の独特な課題への対応に重点を置きます。

パターン

知覚-推論-行動ループ

コンピュータ使用エージェントの基本的なアーキテクチャ：スクリーンを観察し、次のアクションについて推論し、アクションを実行し、繰り返します。このループは、ビジョンモデルを反復パイプラインを通じたアクション実行と統合します。

主要なコンポーネント：

知覚：スクリーンショットが現在のスクリーン状態をキャプチャ
推論：ビジョン言語モデルが分析と計画を実行
行動：マウス/キーボード操作を実行
フィードバック：結果を観察し、継続または修正

重要な洞察：ビジョンエージェントは「考える」フェーズ（1～5秒）中は完全に静止しており、検出可能な一時停止パターンが生成されます。

使用場面：ゼロからコンピュータ使用エージェントを構築する場合、ビジョンモデルをデスクトップ制御と統合する場合、エージェントの動作パターンを理解する場合

from anthropic import Anthropic
from PIL import Image
import base64
import pyautogui
import time

class ComputerUseAgent:
    """
    知覚-推論-行動ループの実装。
    AnthropicのComputer Useパターンに基づく。
    """

    def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
        self.client = client
        self.model = model
        self.max_steps = 50  # 暴走ループを防止
        self.action_delay = 0.5  # アクション間の秒数

    def capture_screenshot(self) -> str:
        """スクリーンをキャプチャしてbase64エンコード画像を返す。"""
        screenshot = pyautogui.screenshot()
        # トークン効率のためにリサイズ（1280x800がバランスの取れたサイズ）
        screenshot = screenshot.resize((1280, 800), Image.LANCZOS)

        import io
        buffer = io.BytesIO()
        screenshot.save(buffer, format="PNG")
        return base64.b64encode(buffer.getvalue()).decode()

    def execute_action(self, action: dict) -> dict:
        """コンピュータ上でマウス/キーボードアクションを実行。"""
        action_type = action.get("type")

        if action_type == "click":
            x, y = action["x"], action["y"]
            button = action.get("button", "left")
            pyautogui.click(x, y, button=button)
            return {"success": True, "action": f"clicked at ({x}, {y})"}

        elif action_type == "type":
            text = action["text"]
            pyautogui.typewrite(text, interval=0.02)
            return {"success": True, "action": f"typed {len(text)} chars"}

        elif action_type == "key":
            key = action["key"]
            pyautogui.press(key)
            return {"success": True, "action": f"pressed {key}"}

        elif action_type == "scroll":
            direction = action.get("direction", "down")
            amount = action.get("amount", 3)
            scroll = -amount if direction == "down" else amount
            pyautogui.scroll(scroll)
            return {"success": True, "action": f"scrolled {direction}"}

        elif action_type == "move":
            x, y = action["x"], action["y"]
            pyautogui.moveTo(x, y)
            return {"success": True, "action": f"moved to ({x}, {y})"}

        else:
            return {"success": False, "error": f"Unknown action: {action_type}"}

    def run(self, task: str) -> dict:
        """
        タスク完了まで知覚-推論-行動ループを実行。

        ループ：
        1. 現在の状態をスクリーンショット
        2. タスクコンテキストをビジョンモデルに送信
        3. レスポンスからアクションを解析
        4. アクションを実行
        5. 完了または最大ステップに達するまで繰り返す
        """
        messages = []
        step_count = 0

        system_prompt = """You are a computer use agent. You can see the screen
        and control mouse/keyboard.

        Available actions (respond with JSON):
        - {"type": "click", "x": 100, "y": 200, "button": "left"}
        - {"type": "type", "text": "hello world"}
        - {"type": "key", "key": "enter"}
        - {"type": "scroll", "direction": "down", "amount": 3}
        - {"type": "done", "result": "task completed successfully"}

        Always respond with ONLY a JSON action object.
        Be precise with coordinates - click exactly where needed.
        If you see an error, try to recover.
        """

        while step_count < self.max_steps:
            step_count += 1

            # 1. 知覚：現在のスクリーンをキャプチャ
            screenshot_b64 = self.capture_screenshot()

            # 2. 推論：ビジョンモデルに送信
            user_content = [
                {"type": "text", "text": f"Task: {task}\n\nStep {step_count}. What action should I take?"},
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": screenshot_b64
                }}
            ]

            messages.append({"role": "user", "content": user_content})

            response = self.client.messages.create(
                model=self.model,
                max_tokens=1024,
                system=system_prompt,
                messages=messages
            )

            assistant_message = response.content[0].text
            messages.append({"role": "assistant", "content": assistant_message})

            # 3. レスポンスからアクションを解析
            import json
            try:
                action = json.loads(assistant_message)
            except json.JSONDecodeError:
                # レスポンスからJSONを抽出しようとする
                import re
                match = re.search(r'\{[^}]+\}', assistant_message)
                if match:
                    action = json.loads(match.group())
                else:
                    continue

            # 完了かどうかを確認
            if action.get("type") == "done":
                return {
                    "success": True,
                    "result": action.get("result"),
                    "steps": step_count
                }

            # 4. 行動：実行
            result = self.execute_action(action)

            # UIが更新されるまでの小さな遅延
            time.sleep(self.action_delay)

        return {
            "success": False,
            "error": "Max steps reached",
            "steps": step_count
        }

# 使用例
agent = ComputerUseAgent(Anthropic())
result = agent.run("Open Chrome and search for 'weather today'")

アンチパターン

ステップ制限なしで実行（無限ループ）
アクション間に遅延がない（UIが対応できない）
フル解像度でのスクリーンショット（トークン爆発）
アクション失敗の無視（回復なし）

サンドボックス環境パターン

コンピュータ使用エージェントは、必ず隔離されたサンドボックス環境で実行する必要があります。エージェントにメインシステムへの直接アクセスを許可しないでください。セキュリティリスクが高すぎます。仮想デスクトップを使用するDockerコンテナを使用してください。

主要な隔離要件：

ネットワーク：必要なエンドポイントのみに制限
ファイルシステム：読み取り専用またはテンポラリディレクトリにスコープを限定
認証情報：ホストの認証情報へのアクセスなし
システムコール：危険なシステムコールをフィルタリング
リソース：CPU、メモリ、時間を制限

目標は「被害の最小化」です。エージェントが失敗した場合、損害はサンドボックスに限定されます。

使用場面：コンピュータ使用エージェントをデプロイする場合、エージェントの動作を安全にテストする場合、信頼されていない自動化タスクを実行する場合

# サンドボックス化されたコンピュータ使用環境用Dockerfile
# AnthropicのリファレンスImplementationパターンに基づく

FROM ubuntu:22.04

# デスクトップ環境をインストール
RUN apt-get update && apt-get install -y \
    xvfb \
    x11vnc \
    fluxbox \
    xterm \
    firefox \
    python3 \
    python3-pip \
    supervisor

# セキュリティ：非rootユーザを作成
RUN useradd -m -s /bin/bash agent && \
    mkdir -p /home/agent/.vnc

# Python依存関係をインストール
COPY requirements.txt /tmp/
RUN pip3 install -r /tmp/requirements.txt

# セキュリティ：ケーパビリティをドロップ
RUN apt-get install -y --no-install-recommends libcap2-bin && \
    setcap -r /usr/bin/python3 || true

# エージェントコードをコピー
COPY --chown=agent:agent . /app
WORKDIR /app

# Supervisorの設定（仮想ディスプレイ + VNC）
COPY supervisord.conf /etc/supervisor/conf.d/

# VNCポートのみを公開（デスクトップを直接公開しない）
EXPOSE 5900

# 非rootユーザとして実行
USER agent

CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]

# セキュリティ制約を持つdocker-compose.yml
version: '3.8'

services:
  computer-use-agent:
    build: .
    ports:
      - "5900:5900"  # VNC（監視用）
      - "8080:8080"  # API（制御用）

    # セキュリティ制約
    security_opt:
      - no-new-privileges:true
      - seccomp:seccomp-profile.json

    # リソース制限
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G
        reservations:
          cpus: '0.5'
          memory: 1G

    # ネットワーク隔離
    networks:
      - agent-network

    # ホストのファイルシステムへのアクセスなし
    volumes:
      - agent-tmp:/tmp

    # ルートファイルシステムを読み取り専用
    read_only: true
    tmpfs:
      - /run
      - /var/run

    # 環境
    environment:
      - DISPLAY=:99
      - NO_PROXY=localhost

networks:
  agent-network:
    driver: bridge
    internal: true  # デフォルトではインターネットなし

volumes:
  agent-tmp:

# 追加のランタイムサンドボックス処理を行うPythonラッパー
import subprocess
import os
from dataclasses import dataclass
from typing import Optional

@dataclass
class SandboxConfig:
    """エージェントサンドボックスの設定。"""
    network_allowed: list[str] = None  # 許可されたドメイン
    max_runtime_seconds: int = 300
    max_memory_mb: int = 2048
    allow_downloads: bool = False
    allow_clipboard: bool = False

class SandboxedAgent:
    """
    Dockerサンドボックスでコンピュータ使用エージェントを実行。
    """

    def __init__(self, config: SandboxConfig):
        self.config = config
        self.container_id: Optional[str] = None

    def start(self):
        """サンドボックス環境を開始。"""
        # ネットワークルールを構築
        network_rules = ""
        if self.config.network_allowed:
            for domain in self.config.network_allowed:
                network_rules += f"--add-host={domain}:$(dig +short {domain}) "
        else:
            network_rules = "--network=none"

        cmd = f"""
        docker run -d \
            --name computer-use-sandbox-$$ \
            --security-opt no-new-privileges \
            --cap-drop ALL \
            --memory {self.config.max_memory_mb}m \
            --cpus 2 \
            --read-only \
            --tmpfs /tmp \
            {network_rules} \
            computer-use-agent:latest
        """

        result = subprocess.run(cmd, shell=True, capture_output=True)
        self.container_id = result.stdout.decode().strip()

        # キルタイマーを設定
        subprocess.Popen([
            "sh", "-c",
            f"sleep {self.config.max_runtime_seconds} && docker kill {self.container_id}"
        ])

        return self.container_id

    def execute_task(self, task: str) -> dict:
        """サンドボックスでタスクを実行。"""
        if not self.container_id:
            self.start()

        # APIを経由してエージェントにタスクを送信
        import requests
        response = requests.post(
            f"http://localhost:8080/task",
            json={"task": task},
            timeout=self.config.max_runtime_seconds
        )

        return response.json()

    def stop(self):
        """サンドボックスを停止して削除。"""
        if self.container_id:
            subprocess.run(f"docker rm -f {self.container_id}", shell=True)
            self.container_id = None

アンチパターン

ホストシステム上で直接エージェントを実行
サンドボックスにフルネットワークアクセスを付与
コンテナ内でrootとして実行
リソース制限なし（サービス拒否）
永続的ストレージ（実行間でデータがリークする可能性）

AnthropicのComputer Use実装

ClaudeのComputer Use機能を使用した公式Implementationパターン。 Claude 3.5 SonnetがコンピュータUseを提供する最初のフロンティアモデルでした。 Claude Opus 4.5は現在「コンピュータUseに最適な世界最高のモデル」です。

主要な機能：

screenshot：現在のスクリーン状態をキャプチャ
mouse：クリック、移動、ドラッグ操作
keyboard：テキストを入力、キーを押す
bash：シェルコマンドを実行
text_editor：ファイルを表示および編集

ツールバージョン：

computer_20251124（Opus 4.5）：詳細な検査のためのズームアクションを追加
computer_20250124（その他すべてのモデル）：標準機能

重要な制限：「ドロップダウンやスクロールバーなどの一部のUI要素は、Claudeが操作するのが難しい場合があります」- Anthropicドキュメント

使用場面：本番環境のコンピュータ使用エージェントを構築する場合、最高品質のビジョン理解が必要な場合、完全なデスクトップ制御が必要な場合（ブラウザだけではなく）

from anthropic import Anthropic
from anthropic.types.beta import (
    BetaToolComputerUse20241022,
    BetaToolBash20241022,
    BetaToolTextEditor20241022,
)
import subprocess
import base64
from PIL import Image
import io

class AnthropicComputerUse:
    """
    公式AnthropicのComputer Use実装。

    必要なもの：
    - 仮想ディスプレイを備えたDockerコンテナ
    - エージェントアクションを表示するためのVNC
    - 適切なツール実装
    """

    def __init__(self):
        self.client = Anthropic()
        self.model = "claude-sonnet-4-20250514"  # コンピュータUseに最適
        self.screen_size = (1280, 800)

    def get_tools(self) -> list:
        """コンピュータUseツールを定義。"""
        return [
            BetaToolComputerUse20241022(
                type="computer_20241022",
                name="computer",
                display_width_px=self.screen_size[0],
                display_height_px=self.screen_size[1],
            ),
            BetaToolBash20241022(
                type="bash_20241022",
                name="bash",
            ),
            BetaToolTextEditor20241022(
                type="text_editor_20241022",
                name="str_replace_editor",
            ),
        ]

    def execute_tool(self, name: str, input: dict) -> dict:
        """ツールを実行して結果を返す。"""

        if name == "computer":
            return self._handle_computer_action(input)
        elif name == "bash":
            return self._handle_bash(input)
        elif name == "str_replace_editor":
            return self._handle_editor(input)
        else:
            return {"error": f"Unknown tool: {name}"}

    def _handle_computer_action(self, input: dict) -> dict:
        """コンピュータ制御アクションを処理。"""
        action = input.get("action")

        if action == "screenshot":
            # xdotool/scrotでキャプチャ
            subprocess.run(["scrot", "/tmp/screenshot.png"])

            with open("/tmp/screenshot.png", "rb") as f:
                img_data = f.read()

            # 効率のためにリサイズ
            img = Image.open(io.BytesIO(img_data))
            img = img.resize(self.screen_size, Image.LANCZOS)

            buffer = io.BytesIO()
            img.save(buffer, format="PNG")

            return {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": base64.b64encode(buffer.getvalue()).decode()
                }
            }

        elif action == "mouse_move":
            x, y = input.get("coordinate", [0, 0])
            subprocess.run(["xdotool", "mousemove", str(x), str(y)])
            return {"success": True}

        elif action == "left_click":
            subprocess.run(["xdotool", "click", "1"])
            return {"success": True}

        elif action == "right_click":
            subprocess.run(["xdotool", "click", "3"])
            return {"success": True}

        elif action == "double_click":
            subprocess.run(["xdotool", "click", "--repeat", "2", "1"])
            return {"success": True}

        elif action == "type":
            text = input.get("text", "")
            # 信頼性のためにxdotoolを遅延付きで使用
            subprocess.run(["xdotool", "type", "--delay", "50", text])
            return {"success": True}

        elif action == "key":
            key = input.get("key", "")
            # 一般的なキー名をマッピング
            key_map = {
                "return": "Return",
                "enter": "Return",
                "tab": "Tab",
                "escape": "Escape",
                "backspace": "BackSpace",
            }
            xdotool_key = key_map.get(key.lower(), key)
            subprocess.run(["xdotool", "key", xdotool_key])
            return {"success": True}

        elif action == "scroll":
            direction = input.get("direction", "down")
            amount = input.get("amount", 3)
            button = "5" if direction == "down" else "4"
            for _ in range(amount):
                subprocess.run(["xdotool", "click", button])
            return {"success": True}

        return {"error": f"Unknown action: {action}"}

    def _handle_bash(self, input: dict) -> dict:
        """bashコマンドを実行。"""
        command = input.get("command", "")

        # セキュリティ：コマンドをサニタイズして制限
        dangerous_patterns = ["rm -rf", "mkfs", "dd if=", "> /dev/"]
        for pattern in dangerous_patterns:
            if pattern in command:
                return {"error": "Dangerous command blocked"}

        try:
            result = subprocess.run(
                command,
                shell=True,
                capture_output=True,
                text=True,
                timeout=30
            )
            return {
                "stdout": result.stdout[:10000],  # 出力を制限
                "stderr": result.stderr[:1000],
                "returncode": result.returncode
            }
        except subprocess.TimeoutExpired:
            return {"error": "Command timed out"}

    def _handle_editor(self, input: dict) -> dict:
        """テキストエディタ操作を処理。"""
        command = input.get("command")
        path = input.get("path")

        if command == "view":
            try:
                with open(path, "r") as f:
                    content = f.read()
                return {"content": content[:50000]}  # サイズを制限
            except Exception as e:
                return {"error": str(e)}

        elif command == "str_replace":
            old_str = input.get("old_str")
            new_str = input.get("new_str")
            try:
                with open(path, "r") as f:
                    content = f.read()
                if old_str not in content:
                    return {"error": "old_str not found in file"}
                content = content.replace(old_str, new_str, 1)
                with open(path, "w") as f:
                    f.write(content)
                return {"success": True}
            except Exception as e:
                return {"error": str(e)}

        return {"error": f"Unknown editor command: {command}"}

    def run_task(self, task: str, max_steps: int = 50) -> dict:
        """エージェントループでコンピュータUseタスクを実行。"""
        messages = [{"role": "user", "content": task}]
        tools = self.get_tools()

        for step in range(max_steps):
            response = self.client.beta.messages.create(
                model=self.model,
                max_tokens=4096,
                tools=tools,
                messages=messages,
                betas=["computer-use-2024-10-22"]
            )

            # 完了を確認
            if response.stop_reason == "end_turn":
                return {
                    "success": True,
                    "result": response.content[0].text if response.content else "",
                    "steps": step + 1
                }

            # ツール使用を処理
            if response.stop_reason == "tool_use":
                messages.append({"role": "assistant", "content": response.content})

                tool_results = []
                for block in response.content:
                    if block.type == "tool_use":
                        result = self.execute_tool(block.name, block.input)
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": result
                        })

                messages.append({"role": "user", "content": tool_results})

        return {"success": False, "error": "Max steps reached"}

アンチパターン

betas=['computer-use-2024-10-22']フラグを使用していない
フル解像度スクリーンショット（無駄）
bashツールのコマンドサニタイズなし
無制限の実行時間

ブラウザ使用パターン（Playwriteベース）

ブラウザ専用の自動化では、ピクセルベースのコンピュータUseよりも構造化されたDOM アクセスがより効率的です。Playwright MCPにより、LLMはスクリーンショットではなくアクセシビリティスナップショットを使用してブラウザを制御できます。

ビジョンベースと比較した利点：

より高速：画像処理が不要
より安価：テキストトークン対ビジョントークン
より正確：直接要素ターゲティング
より信頼性が高い：座標の漂流がない

ビジョンと構造化いつ使うか：

ビジョン：デスクトップアプリケーション、複雑なUI、ビジュアル検証
構造化：Web自動化、フォーム入力、データ抽出

使用場面：ブラウザ専用の自動化タスク、フォーム入力とWebインタラクション、速度とコストがビジュアル理解よりも重要な場合

from playwright.async_api import async_playwright
from dataclasses import dataclass
from typing import Optional
import asyncio

@dataclass
class BrowserAction:
    """構造化ブラウザアクション。"""
    action: str  # click, type, navigate, scroll, extract
    selector: Optional[str] = None
    text: Optional[str] = None
    url: Optional[str] = None

class BrowserUseAgent:
    """
    構造化コマンドを使用したPlaywrightによるブラウザ自動化。
    Webタスク用ピクセルベースよりも効率的。
    """

    def __init__(self):
        self.browser = None
        self.page = None

    async def start(self, headless: bool = True):
        """ブラウザセッションを開始。"""
        self.playwright = await async_playwright().start()
        self.browser = await self.playwright.chromium.launch(headless=headless)
        self.page = await self.browser.new_page()

    async def get_page_snapshot(self) -> dict:
        """
        LLM用のページの構造化スナップショットを取得。
        効率性のためにアクセシビリティツリーを使用。
        """
        # アクセシビリティツリーを取得
        snapshot = await self.page.accessibility.snapshot()

        # 簡略化されたDOM情報を取得
        elements = await self.page.evaluate('''() => {
            const interactable = [];
            const selector = 'a, button, input, select, textarea, [role="button"]';
            document.querySelectorAll(selector).forEach((el, i) => {
                const rect = el.getBoundingClientRect();
                if (rect.width > 0 && rect.height > 0) {
                    interactable.push({
                        index: i,
                        tag: el.tagName.toLowerCase(),
                        text: el.textContent?.trim().slice(0, 100),
                        type: el.type,
                        placeholder: el.placeholder,
                        name: el.name,
                        id: el.id,
                        class: el.className
                    });
                }
            });
            return interactable;
        }''')

        return {
            "url": self.page.url,
            "title": await self.page.title(),
            "accessibility_tree": snapshot,
            "interactable_elements": elements[:50]  # トークン効率のため制限
        }

    async def execute_action(self, action: BrowserAction) -> dict:
        """構造化ブラウザアクションを実行。"""

        try:
            if action.action == "navigate":
                await self.page.goto(action.url, wait_until="domcontentloaded")
                return {"success": True, "url": self.page.url}

            elif action.action == "click":
                await self.page.click(action.selector, timeout=5000)
                await self.page.wait_for_load_state("networkidle", timeout=5000)
                return {"success": True}

            elif action.action == "type":
                await self.page.fill(action.selector, action.text)
                return {"success": True}

            elif action.action == "scroll":
                direction = action.text or "down"
                distance = 500 if direction == "down" else -500
                await self.page.evaluate(f"window.scrollBy(0, {distance})")
                return {"success": True}

            elif action.action == "extract":
                # テキスト内容を抽出
                if action.selector:
                    text = await self.page.text_content(action.selector)
                else:
                    text = await self.page.text_content("body")
                return {"success": True, "text": text[:5000]}

            elif action.action == "screenshot":
                # 必要に応じてビジョンにフォールバック
                screenshot = await self.page.screenshot(type="png")
                import base64
                return {
                    "success": True,
                    "image": base64.b64encode(screenshot).decode()
                }

        except Exception as e:
            return {"success": False, "error": str(e)}

        return {"success": False, "error": f"Unknown action: {action.action}"}

    async def run_with_llm(self, task: str, llm_client, max_steps: int = 20):
        """
        LLMの意思決定でブラウザタスクを実行。
        スクリーンショットの代わりに構造化DOMを使用。
        """

        system_prompt = """You are a browser automation agent. You receive
        page snapshots with interactable elements and decide actions.

        Respond with JSON action:
        - {"action": "navigate", "url": "https://..."}
        - {"action": "click", "selector": "button.submit"}
        - {"action": "type", "selector": "input[name='email']", "text": "..."}
        - {"action": "scroll", "text": "down"}
        - {"action": "extract", "selector": ".results"}
        - {"action": "done", "result": "task completed"}

        Use CSS selectors based on the element info provided.
        Prefer id > name > class > text content for selectors.
        """

        messages = []

        for step in range(max_steps):
            # 現在のページの状態を取得
            snapshot = await self.get_page_snapshot()

            user_message = f"""Task: {task}

            Current page:
            URL: {snapshot['url']}
            Title: {snapshot['title']}

            Interactable elements:
            {snapshot['interactable_elements']}

            What action should I take?"""

            messages.append({"role": "user", "content": user_message})

            # LLMの意思決定を取得
            response = llm_client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                system=system_prompt,
                messages=messages
            )

            assistant_text = response.content[0].text
            messages.append({"role": "assistant", "content": assistant_text})

            # 解析して実行
            import json
            action_dict = json.loads(assistant_text)

            if action_dict.get("action") == "done":
                return {"success": True, "result": action_dict.get("result")}

            action = BrowserAction(**action_dict)
            result = await self.execute_action(action)

            if not result.get("success"):
                messages.append({
                    "role": "user",
                    "content": f"Action failed: {result.get('error')}"
                })

            await asyncio.sleep(0.5)  # レート制限

        return {"success": False, "error": "Max steps reached"}

    async def close(self):
        """ブラウザをクリーンアップ。"""
        if self.browser:
            await self.browser.close()
        if hasattr(self, 'playwright'):
            await self.playwright.stop()

# 使用例
async def main():
    agent = BrowserUseAgent()
    await agent.start(headless=False)

    from anthropic import Anthropic
    result = await agent.run_with_llm(
        "Go to weather.com and find the weather for New York",
        Anthropic()
    )

    print(result)
    await agent.close()

asyncio.run(main())

アンチパターン

DOMアクセスが機能する場合のスクリーンショット使用
ページ読み込みの待機なし
破損するハードコードされたセレクタ
古い要素でのエラー回復なし

ユーザー確認パターン

機密アクションの場合、エージェントは一時停止して人間の確認を求める必要があります。「ChatGPTエージェントも購入を完了するなど機密のステップを実行する前に一時停止して確認を求めます。」

感度レベル：

低：ナビゲーション、読み取り（自動承認）
中：フォーム入力、クリック（ログ、場合によっては確認）
高：購入、認証、ファイル操作（常に確認）
重大：認証情報の入力、金融取引（確認+レビュー）

使用場面：実際の結果を伴うアクション、金融取引、認証フロー、ファイル変更

from enum import Enum
from dataclasses import dataclass
from typing import Callable, Optional
import asyncio

class ActionSeverity(Enum):
    LOW = "low"           # 自動承認
    MEDIUM = "medium"     # ログ、オプション確認
    HIGH = "high"         # 常に確認
    CRITICAL = "critical" # 確認+詳細レビュー

@dataclass
class SensitiveAction:
    """ユーザー確認が必要な場合があるアクション。"""
    action_type: str
    description: str
    severity: ActionSeverity
    details: dict

class ConfirmationGate:
    """
    機密アクションをユーザー確認を通じてゲート。
    """

    # アクションタイプ→感度のマッピング
    ACTION_SEVERITY = {
        # 低 - 自動承認
        "navigate": ActionSeverity.LOW,
        "scroll": ActionSeverity.LOW,
        "read": ActionSeverity.LOW,
        "screenshot": ActionSeverity.LOW,

        # 中 - ログして確認することもある
        "click": ActionSeverity.MEDIUM,
        "type": ActionSeverity.MEDIUM,
        "search": ActionSeverity.MEDIUM,

        # 高 - 常に確認
        "download": ActionSeverity.HIGH,
        "submit_form": ActionSeverity.HIGH,
        "login": ActionSeverity.HIGH,
        "file_write": ActionSeverity.HIGH,

        # 重大 - 完全レビュー付きで確認
        "purchase": ActionSeverity.CRITICAL,
        "enter_password": ActionSeverity.CRITICAL,
        "enter_credit_card": ActionSeverity.CRITICAL,
        "send_money": ActionSeverity.CRITICAL,
        "delete": ActionSeverity.CRITICAL,
    }

    def __init__(
        self,
        confirm_callback: Callable[[SensitiveAction], bool] = None,
        auto_confirm_low: bool = True,
        auto_confirm_medium: bool = False
    ):
        self.confirm_callback = confirm_callback or self._default_confirm
        self.auto_confirm_low = auto_confirm_low
        self.auto_confirm_medium = auto_confirm_medium
        self.action_log = []

    def _default_confirm(self, action: SensitiveAction) -> bool:
        """CLIプロンプト経由のデフォルト確認。"""
        print(f"\n{'='*60}")
        print(f"ACTION CONFIRMATION REQUIRED")
        print(f"{'='*60}")
        print(f"Type: {action.action_type}")
        print(f"Severity: {action.severity.value.upper()}")
        print(f"Description: {action.description}")
        print(f"Details: {action.details}")
        print(f"{'='*60}")

        while True:
            response = input("Allow this action? [y/n]: ").lower().strip()
            if response in ['y', 'yes']:
                return True
            elif response in ['n', 'no']:
                return False

    def classify_action(self, action_type: str, context: dict) -> ActionSeverity:
        """コンテキストを考慮したアクション感度を分類。"""
        base_severity = self.ACTION_SEVERITY.get(action_type, ActionSeverity.MEDIUM)

        # コンテキストに基づいてエスカレート
        if context.get("involves_credentials"):
            return ActionSeverity.CRITICAL
        if context.get("involves_money"):
            return ActionSeverity.CRITICAL
        if context.get("irreversible"):
            return max(base_severity, ActionSeverity.HIGH, key=lambda x: x.value)

        return base_severity

    def check_action(
        self,
        action_type: str,
        description: str,
        details: dict = None
    ) -> tuple[bool, str]:
        """
        アクションが続行すべきかを確認。
        (承認、理由)を返す。
        """
        details = details or {}
        severity = self.classify_action(action_type, details)

        action = SensitiveAction(
            action_type=action_type,
            description=description,
            severity=severity,
            details=details
        )

        # すべてのアクションをログ
        self.action_log.append({
            "action": action,
            "timestamp": __import__('datetime').datetime.now().isoformat()
        })

        # 低感度を自動承認
        if severity == ActionSeverity.LOW and self.auto_confirm_low:
            return True, "auto-approved (low severity)"

        # 中程度を確認するかもしれない
        if severity == ActionSeverity.MEDIUM and self.auto_confirm_medium:
            return True, "auto-approved (medium severity)"

        # 確認を要求
        approved = self.confirm_callback(action)

        if approved:
            return True, "user approved"
        else:
            return False, "user rejected"

class ConfirmedComputerUseAgent:
    """
    確認ゲートを備えたコンピュータ使用エージェント。
    """

    def __init__(self, base_agent, confirmation_gate: ConfirmationGate):
        self.agent = base_agent
        self.gate = confirmation_gate

    def execute_action(self, action: dict) -> dict:
        """確認チェック付きでアクションを実行。"""
        action_type = action.get("type", "unknown")

        # 説明を構築
        if action_type == "click":
            desc = f"Click at ({action.get('x')}, {action.get('y')})"
        elif action_type == "type":
            text = action.get('text', '')
            # パスワードに見える場合はマスク
            if self._looks_sensitive(text):
                desc = f"Type sensitive text ({len(text)} chars)"
            else:
                desc = f"Type: {text[:50]}..."
        else:
            desc = f"Execute: {action_type}"

        # 感度分類用のコンテキスト
        context = {
            "involves_credentials": self._looks_sensitive(action.get("text", "")),
            "involves_money": self._mentions_money(action),
        }

        # ゲートで確認
        approved, reason = self.gate.check_action(
            action_type, desc, context
        )

        if not approved:
            return {
                "success": False,
                "error": f"Action blocked: {reason}",
                "action": action_type
            }

        # 承認されたら実行
        return self.agent.execute_action(action)

    def _looks_sensitive(self, text: str) -> bool:
        """テキストが機密データに見えるかを確認。"""
        if not text:
            return False
        # 一般的なパターン
        patterns = [
            r'\b\d{16}\b',  # クレジットカード
            r'\b\d{3,4}\b.*\b\d{3,4}\b',  # CVVのような
            r'password',
            r'secret',
            r'api.?key',
            r'token'
        ]
        import re
        return any(re.search(p, text.lower()) for p in patterns)

    def _mentions_money(self, action: dict) -> bool:
        """アクションがお金に関連しているかを確認。"""
        text = str(action)
        money_patterns = [
            r'\$\d+', r'pay', r'purchase', r'buy', r'checkout',
            r'credit', r'debit', r'invoice', r'payment'
        ]
        import re
        return any(re.search(p, text.lower()) for p in money_patterns)

# 使用例
gate = ConfirmationGate(
    auto_confirm_low=True,
    auto_confirm_medium=False  # クリック、入力を確認
)

agent = ConfirmedComputerUseAgent(base_agent, gate)
result = agent.execute_action({"type": "click", "x": 500, "y": 300})

アンチパターン

すべてのアクションを自動承認
拒否されたアクションのログを取得しない
確認でフルパスワードを表示
確認のタイムアウトなし（永久にハング）

アクションログパターン

すべてのコンピュータ使用エージェントアクションは以下の理由でログに記録される必要があります：

失敗した自動化のデバッグ
セキュリティ監査
再現性
コンプライアンス要件

ログ形式は以下をキャプチャする必要があります：

タイムスタンプ
アクションタイプとパラメータ
前後のスクリーンショット
成功/失敗ステータス
モデル推論（利用可能な場合）

使用場面：本番環境のコンピュータ使用デプロイメント、自動化失敗のデバッグ、セキュリティ上重要な環境

from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional, Any
import json
import os

@dataclass
class ActionLogEntry:
    """単一のアクションログエントリ。"""
    timestamp: datetime
    action_type: str
    parameters: dict
    success: bool
    error: Optional[str] = None
    screenshot_before: Optional[str] = None  # スクリーンショットへのパス
    screenshot_after: Optional[str] = None
    model_reasoning: Optional[str] = None
    duration_ms: Optional[int] = None

    def to_dict(self) -> dict:
        return {
            "timestamp": self.timestamp.isoformat(),
            "action_type": self.action_type,
            "parameters": self._sanitize_params(self.parameters),
            "success": self.success,
            "error": self.error,
            "screenshot_before": self.screenshot_before,
            "screenshot_after": self.screenshot_after,
            "model_reasoning": self.model_reasoning,
            "duration_ms": self.duration_ms
        }

    def _sanitize_params(self, params: dict) -> dict:
        """パラメータから機密データを削除。"""
        sanitized = {}
        sensitive_keys = ['password', 'secret', 'token', 'key', 'credit_card']

        for k, v in params.items():
            if any(s in k.lower() for s in sensitive_keys):
                sanitized[k] = "[REDACTED]"
            elif isinstance(v, str) and len(v) > 100:
                sanitized[k] = v[:100] + "...[truncated]"
            else:
                sanitized[k] = v

        return sanitized

@dataclass
class TaskSession:
    """完全なタスク実行セッション。"""
    session_id: str
    task: str
    start_time: datetime
    end_time: Optional[datetime] = None
    actions: list[ActionLogEntry] = field(default_factory=list)
    success: bool = False
    final_result: Optional[str] = None

class ActionLogger:
    """
    コンピュータ使用エージェント用の包括的なアクションログ。
    """

    def __init__(self, log_dir: str = "./agent_logs"):
        self.log_dir = log_dir
        self.screenshot_dir = os.path.join(log_dir, "screenshots")
        os.makedirs(self.screenshot_dir, exist_ok=True)

        self.current_session: Optional[TaskSession] = None

    def start_session(self, task: str) -> str:
        """新しいタスクセッションを開始。"""
        import uuid
        session_id = str(uuid.uuid4())[:8]

        self.current_session = TaskSession(
            session_id=session_id,
            task=task,
            start_time=datetime.now()
        )

        return session_id

    def log_action(
        self,
        action_type: str,
        parameters: dict,
        success: bool,
        error: Optional[str] = None,
        screenshot_before: bytes = None,
        screenshot_after: bytes = None,
        model_reasoning: str = None,
        duration_ms: int = None
    ):
        """単一のアクションをログに記録。"""
        if not self.current_session:
            raise RuntimeError("No active session")

        # 提供された場合はスクリーンショットを保存
        screenshot_paths = {}
        timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S_%f")

        if screenshot_before:
            path = os.path.join(
                self.screenshot_dir,
                f"{self.current_session.session_id}_{timestamp_str}_before.png"
            )
            with open(path, "wb") as f:
                f.write(screenshot_before)
            screenshot_paths["before"] = path

        if screenshot_after:
            path = os.path.join(
                self.screenshot_dir,
                f"{self.current_session.session_id}_{timestamp_str}_after.png"
            )
            with open(path, "wb") as f:
                f.write(screenshot_after)
            screenshot_paths["after"] = path

        # ログエントリを作成
        entry = ActionLogEntry(
            timestamp=datetime.now(),
            action_type=action_type,
            parameters=parameters,
            success=success,
            error=error,
            screenshot_before=screenshot_paths.get("before"),
            screenshot_after=screenshot_paths.get("after"),
            model_reasoning=model_reasoning,
            duration_ms=duration_ms
        )

        self.current_session.actions.append(entry)

        # ランニングログファイルにも追加
        self._append_to_log(entry)

    def _append_to_log(self, entry: ActionLogEntry):
        """エントリをJSONLログファイルに追加。"""
        log_file = os.path.join(
            self.log_dir,
            f"session_{self.current_session.session_id}.jsonl"
        )

        with open(log_file, "a") as f:
            f.write(json.dumps(entry.to_dict()) + "\n")

    def end_session(self, success: bool, result: str = None):
        """現在のセッションを終了。"""
        if not self.current_session:
            return

        self.current_session.end_time = datetime.now()
        self.current_session.success = success
        self.current_session.final_result = result

        # セッションサマリーを作成
        summary_file = os.path.join(
            self.log_dir,
            f"session_{self.current_session.session_id}_summary.json"
        )

        summary = {
            "session_id": self.current_session.session_id,
            "task": self.current_session.task,
            "start_time": self.current_session.start_time.isoformat(),
            "end_time": self.current_session.end_time.isoformat(),
            "duration_seconds": (
                self.current_session.end_time -
                self.current_session.start_time
            ).total_seconds(),
            "total_actions": len(self.current_session.actions),
            "successful_actions": sum(
                1 for a in self.current_session.actions if a.success
            ),
            "failed_actions": sum(
                1 for a in self.current_session.actions if not a.success
            ),
            "success": success,
            "final_result": result
        }

        with open(summary_file, "w") as f:
            json.dump(summary, f, indent=2)

        self.current_session = None

    def get_session_replay(self, session_id: str) -> list[dict]:
        """セッションからすべてのアクションを取得して再生/デバッグ。"""
        log_file = os.path.join(self.log_dir, f"session_{session_id}.jsonl")

        actions = []
        with open(log_file, "r") as f:
            for line in f:
                actions.append(json.loads(line))

        return actions

# エージェントとの統合
class LoggedComputerUseAgent:
    """包括的なログを備えたコンピュータ使用エージェント。"""

    def __init__(self, base_agent, logger: ActionLogger):
        self.agent = base_agent
        self.logger = logger

    def run_task(self, task: str) -> dict:
        """完全なログでタスクを実行。"""
        session_id = self.logger.start_session(task)

        try:
            result = self._run_with_logging(task)
            self.logger.end_session(
                success=result.get("success", False),
                result=result.get("result")
            )
            return result
        except Exception as e:
            self.logger.end_session(success=False, result=str(e))
            raise

    def _run_with_logging(self, task: str) -> dict:
        """アクションログ付きの内部実行。"""
        # これはベースエージェントのrunメソッドをラップし、
        # 各アクションをログするだろう
        pass

アンチパターン

ログで機密データをサニタイズしない
スクリーンショットを無期限に保存（ストレージコスト）
ログファイルのローテーションなし
同期的にログ記録（エージェントをブロック）

鋭いエッジ

Webコンテンツがエージェントをハイジャックできる

重大度：致命的

状況：コンピュータ使用エージェントがWebを閲覧

症状：エージェントが突然予期しないアクションを実行します。悪意のあるリンクをクリックします。フィッシングサイトに認証情報を入力します。ファイルをダウンロードしてはいけません。指示を無視して埋め込みコマンドに従う代わりに。

なぜこれが破断するか：「信頼されていないコンテンツを処理するすべてのエージェントはプロンプトインジェクションリスクの対象ですが、ブラウザUseはこのリスクを2つの方法で増加させます。まず、攻撃面は広大です：すべてのWebページ、埋め込みドキュメント、広告、および動的に読み込まれるスクリプトは、悪意のある命令のための潜在的なベクトルを表します。第二に、ブラウザエージェントは多くの異なるアクションを実行できます。 URL、フォームの入力、ボタンのクリック、ファイルのダウンロード攻撃者が悪用できる。」

実際の攻撃はすでに発生しています：

「Microsoft Copilotエージェントは悪意のある命令を含むメールでハイジャックされ、攻撃者が全体のCRMデータベースを抽出することを許可しました。」
「Googleのワークスペースサービスが操作されました。カレンダーの招待状とメール内の隠されたプロンプトがGeminiエージェントをイベントを削除し、機密メッセージを公開するようにだましました。」

1% の攻撃成功率でさえ、規模が大きい場合、有意なリスクを表します。

推奨される修正：

複数層の防御 - 単一の解決策は機能しません

サンドボックス化（最も効果的）：

# 厳密な隔離を備えたDocker
docker run \
    --security-opt no-new-privileges \
    --cap-drop ALL \
    --network none \  # インターネットなし！
    --read-only \
    computer-use-agent

分類器ベースの検出：

def scan_for_injection(content: str) -> bool:
    """プロンプトインジェクション試行を検出。"""
    patterns = [
        r"ignore.*instructions",
        r"disregard.*previous",
        r"new.*instructions",
        r"you are now",
        r"act as if",
        r"pretend to be",
    ]
    return any(re.search(p, content.lower()) for p in patterns)

# 処理の前にページコンテンツをチェック
page_text = await page.text_content("body")
if scan_for_injection(page_text):
    return {"error": "Potential injection detected"}

機密アクションのユーザー確認：

SENSITIVE_ACTIONS = {"download", "submit", "login", "purchase"}

if action_type in SENSITIVE_ACTIONS:
    if not await get_user_confirmation(action):
        return {"error": "User rejected action"}

スコープ付きの認証情報：

すべての認証情報へのアクセス権をエージェントに付与しない
一時的で限定されたトークンを使用
タスク完了後に失効

ビジョンエージェントが正確な中心をクリック

重大度：中

状況：UIエレメントをエージェントがクリック

症状：エージェントのクリックが人間として検出可能です。ウェブサイトはエージェントをブロックまたは CAPTCHAしたい場合があります。アンチボットシステムがインタラクションにフラグを立てます。

なぜこれが破断するか：「ビジョンモデルがボタンを識別する場合、中心を計算します。クリック座標は数学的に正確な位置に着地します。エレメント中心またはグリッド配置のピクセル値。人間はセンターをクリックしません。彼らのクリック配布はターゲット周辺のガウス分布に従う。」

スクリーンショットループも検出可能なパターンを作成します：「予測可能な一時停止。ビジョンエージェントは「思考」フェーズ。パターンは次のようになります：アクション→完全な静止状態（1～5秒）→アクション→完全な静止状態→アクション。」

洗練されたアンチボットシステムは検出：

完璧なセンタークリック
「思考中」のマウス移動なし
アクション間の一貫したタイミング
マイクロムーブメントと躊躇の欠如

推奨される修正：

アクションに人間らしい分散を追加

import random
import time

def humanized_click(x: int, y: int) -> tuple[int, int]:
    """クリック座標に人間らしい分散を追加。"""
    # ターゲット周辺のガウス分布
    # 人間は通常ターゲットの約10px以内に着地
    x_offset = int(random.gauss(0, 5))
    y_offset = int(random.gauss(0, 5))

    return (x + x_offset, y + y_offset)

def humanized_delay():
    """アクション間に人間らしい遅延を追加。"""

ライセンス: MIT(寛容ライセンスのため全文を引用しています) · 原本リポジトリ

詳細情報

作者: sickn33
リポジトリ: sickn33/antigravity-awesome-skills
ライセンス: MIT
最終更新: 不明

GitHubで原本を見る →フィードバックを送る

Source: https://github.com/sickn33/antigravity-awesome-skills / ライセンス: MIT

computer-use-agents

SKILL.md 本文

Computer Use Agents

パターン

知覚-推論-行動ループ

アンチパターン

サンドボックス環境パターン

アンチパターン

AnthropicのComputer Use実装

アンチパターン

ブラウザ使用パターン（Playwriteベース）

アンチパターン

ユーザー確認パターン

アンチパターン

アクションログパターン

アンチパターン

鋭いエッジ

Webコンテンツがエージェントをハイジャックできる

複数層の防御 - 単一の解決策は機能しません

ビジョンエージェントが正確な中心をクリック

アクションに人間らしい分散を追加

詳細情報

関連スキル

agent-browser

anyskill

engram

skyvern

pinchbench

openui