How to Use Claude for Desktop Automation with Accessibility Tree

Mar 17, 2026

I was frustrated. Every time I tried to automate repetitive tasks on my desktop, I ended up with brittle scripts that broke the moment an application updated its UI. Recording macros? Forget it. Writing Applescripts or AutoHotkey scripts? They work until a button moves or a label changes.

Then I stumbled upon a Reddit post where someone was using Claude Code as the “brain” for desktop automation. The key insight: instead of relying on fragile selectors, they were feeding Claude the accessibility tree of the active application and letting Claude figure out what to click.

The Problem with Traditional Desktop Automation

Traditional approaches have serious limitations:

Macro recorders capture exact coordinates and timing. Move a window, and your automation fails.
Selector-based automation (like Selenium for web) requires maintaining element locators. UI changes break everything.
Visual recognition (computer vision, OCR) is slow and error-prone.
Scripting languages (AppleScript, AutoHotkey) require knowing application-specific dictionaries and APIs.

What I needed was something that could understand intent and adapt to UI changes automatically.

What is an Accessibility Tree?

The accessibility tree is a semantic representation of a UI that screen readers use. It contains:

Element types (button, text field, menu)
Labels and descriptions
States (enabled, focused, expanded)
Hierarchy and relationships
Actions available on each element

Here’s what a simple accessibility tree might look like for a mail application:

Application: Mail
├── Window: "Inbox - 42 messages"
│   ├── Toolbar
│   │   ├── Button: "New Message" (enabled)
│   │   ├── Button: "Reply" (disabled)
│   │   └── Button: "Delete" (disabled)
│   ├── Table: Message List
│   │   ├── Row: "John Smith - Project Update - Today"
│   │   ├── Row: "Jane Doe - Meeting Notes - Yesterday"
│   │   └── ...
│   └── StaticText: "Select a message to read"

This is structured, semantic data - perfect for an LLM to understand and act upon.

Building the Automation Pipeline

Step 1: Extract the Accessibility Tree

On macOS, you can use the Accessibility API through Python with pyobjc:

from AppKit import NSWorkspace
from ApplicationServices import (
    AXUIElementCreateApplication,
    AXUIElementCopyAttributeValue,
    kAXChildrenAttribute
)

def get_accessibility_tree(app_pid: int) -> dict:
    """Extract the accessibility tree from an application by PID."""
    app_element = AXUIElementCreateApplication(app_pid)

    def build_tree(element):
        error, children = AXUIElementCopyAttributeValue(
            element, kAXChildrenAttribute, None
        )

        error, role = AXUIElementCopyAttributeValue(
            element, kAXRoleAttribute, None
        )
        error, title = AXUIElementCopyAttributeValue(
            element, kAXTitleAttribute, None
        )

        node = {
            "role": role or "unknown",
            "title": title or "",
            "children": []
        }

        if children:
            for child in children:
                node["children"].append(build_tree(child))

        return node

    return build_tree(app_element)

def get_frontmost_app():
    """Get the PID of the frontmost application."""
    workspace = NSWorkspace.sharedWorkspace()
    frontmost = workspace.frontmostApplication()
    return frontmost.processIdentifier()

Step 2: Send to Claude for Interpretation

Now comes the magic - let Claude understand the UI and plan actions:

import anthropic
import json

def get_action_plan(tree: dict, instruction: str) -> list:
    """Ask Claude to plan actions based on accessibility tree and user instruction."""

    client = anthropic.Anthropic()

    prompt = f"""You are a desktop automation assistant. Given an accessibility tree and a user instruction, determine what actions to take.

Accessibility Tree (JSON):
{json.dumps(tree, indent=2)}

User Instruction: {instruction}

Return a JSON array of actions. Each action should have:
- type: "click", "type", "key", or "wait"
- For click: elementId or description
- For type: text to type
- For key: key name (e.g., "return", "tab", "escape")
- For wait: duration in seconds

Example response:
[
  {{"type": "click", "description": "New Message button"}},
  {{"type": "type", "text": "[email protected]"}},
  {{"type": "key", "key": "tab"}},
  {{"type": "type", "text": "Subject line"}}
]

Only return the JSON array, no other text."""

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )

    return json.loads(response.content[0].text)

Step 3: Execute the Actions

Execute Claude’s plan using pyautogui:

import pyautogui
import time

pyautogui.FAILSAFE = True  # Move mouse to corner to abort

def execute_actions(actions: list, tree: dict):
    """Execute a list of actions from Claude's plan."""

    for action in actions:
        if action["type"] == "click":
            # Find element coordinates from tree
            coords = find_element_coordinates(tree, action.get("description"))
            if coords:
                pyautogui.click(coords["x"], coords["y"])
            else:
                # Fallback: click at current position or use description
                print(f"Could not find: {action.get('description')}")

        elif action["type"] == "type":
            pyautogui.write(action["text"])

        elif action["type"] == "key":
            pyautogui.press(action["key"])

        elif action["type"] == "wait":
            time.sleep(action["duration"])

        # Small delay between actions for app to respond
        time.sleep(0.5)

def find_element_coordinates(tree: dict, description: str) -> dict | None:
    """Search tree for element matching description and return coordinates."""
    # This requires getting AXPosition and AXSize from elements
    # Simplified here - actual implementation would traverse tree
    # and match by role/title/description
    pass

Creating an MCP Server for Desktop Control

The Model Context Protocol (MCP) provides a standardized way for Claude to interact with tools. Here’s how to create an MCP server for desktop automation:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import {
  CallToolRequestSchema,
  ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";

const server = new Server(
  { name: "desktop-automation", version: "1.0.0" },
  { capabilities: { tools: {} } }
);

// Tool definitions
server.setRequestHandler(ListToolsRequestSchema, async () => ({
  tools: [
    {
      name: "get_accessibility_tree",
      description: "Get the accessibility tree of the active application",
      inputSchema: { type: "object", properties: {} },
    },
    {
      name: "click_element",
      description: "Click an element by description or coordinates",
      inputSchema: {
        type: "object",
        properties: {
          description: {
            type: "string",
            description: "Description of element to click (e.g., 'New Message button')",
          },
          x: { type: "number" },
          y: { type: "number" },
        },
      },
    },
    {
      name: "type_text",
      description: "Type text into the focused element",
      inputSchema: {
        type: "object",
        properties: {
          text: { type: "string" },
        },
        required: ["text"],
      },
    },
    {
      name: "press_key",
      description: "Press a keyboard key",
      inputSchema: {
        type: "object",
        properties: {
          key: {
            type: "string",
            description: "Key name: return, tab, escape, command, etc.",
          },
        },
        required: ["key"],
      },
    },
  ],
}));

// Tool execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const { name, arguments: args } = request.params;

  switch (name) {
    case "get_accessibility_tree":
      const tree = await getActiveAppTree();
      return { content: [{ type: "text", text: JSON.stringify(tree, null, 2) }] };

    case "click_element":
      if (args.description) {
        await clickByDescription(args.description);
      } else if (args.x !== undefined && args.y !== undefined) {
        await clickAtPosition(args.x, args.y);
      }
      return { content: [{ type: "text", text: "Clicked" }] };

    case "type_text":
      await typeText(args.text);
      return { content: [{ type: "text", text: "Typed" }] };

    case "press_key":
      await pressKey(args.key);
      return { content: [{ type: "text", text: "Pressed" }] };

    default:
      throw new Error(`Unknown tool: ${name}`);
  }
});

// Start server
async function main() {
  const transport = new StdioServerTransport();
  await server.connect(transport);
}

main();

Why Claude Works Better Than Other Models

The Reddit user who pioneered this approach noted that Claude’s interpretation of accessibility tree data was significantly better than other models. Here’s why:

Context window handling: Accessibility trees can be large. Claude handles them gracefully without truncating important context.
Structured data understanding: Claude excels at parsing and reasoning about JSON/structured data formats.
Instruction following: When you say “find the compose button and click it”, Claude understands the intent and can reason about multiple ways to accomplish it.
Error recovery: When something doesn’t work, Claude can explain why and suggest alternatives.

Common Mistakes to Avoid

I made several mistakes before getting this working reliably:

Not handling timing correctly: Applications need time to update their state between actions. I added delays after every action and verification steps:

import time

def execute_with_retry(action, max_retries=3):
    for attempt in range(max_retries):
        result = execute_action(action)
        time.sleep(0.5)  # Let app update

        # Verify the action succeeded
        if verify_action_completed(action):
            return True

        print(f"Retry {attempt + 1}/{max_retries}")

    return False

Over-specifying instructions: I initially wrote instructions like “click the button at coordinates (342, 156)”. Now I write “click the Send button” and let Claude figure it out:

# BAD: Too specific, brittle
instruction = "Move mouse to x=342, y=156 and click, then wait 0.3 seconds"

# GOOD: Natural, intent-based
instruction = "Click the Send button to submit the form"

Ignoring accessibility tree changes: During long workflows, the tree changes. Re-read it between major steps:

def automate_workflow(instruction: str):
    steps = parse_into_steps(instruction)

    for step in steps:
        # Always get fresh tree before each action
        current_tree = get_accessibility_tree(get_frontmost_app())

        actions = get_action_plan(current_tree, step)
        execute_actions(actions, current_tree)

        # Verify and adjust if needed
        verify_result(step)

Assuming all apps have good accessibility support: Some applications (especially Electron apps and older software) have incomplete or incorrect accessibility trees. Test your target applications first:

def check_app_accessibility_support(app_pid: int) -> dict:
    tree = get_accessibility_tree(app_pid)

    # Count meaningful elements
    def count_elements(node):
        count = 1
        if node.get("children"):
            for child in node["children"]:
                count += count_elements(child)
        return count

    element_count = count_elements(tree)

    # Check for common issues
    issues = []
    if element_count < 10:
        issues.append("Very few elements - poor accessibility support")
    if not tree.get("children"):
        issues.append("No children found - tree might be empty")

    return {
        "element_count": element_count,
        "issues": issues,
        "quality": "good" if not issues else "problematic"
    }

A Complete Example: Automating Email

Here’s a complete example that composes and sends an email:

#!/usr/bin/env python3
"""Automate composing and sending an email using Claude."""

import anthropic
import json
from accessibility import get_accessibility_tree, get_frontmost_app
from executor import execute_actions, find_element_coordinates
import time

def send_email(to: str, subject: str, body: str):
    """Compose and send an email through the frontmost mail app."""

    client = anthropic.Anthropic()

    # Get current state
    tree = get_accessibility_tree(get_frontmost_app())

    # Step 1: Click compose/new message
    prompt = f"""
    Current accessibility tree:
    {json.dumps(tree, indent=2)}

    User wants to compose a new email.
    Find and click the button to create a new message.
    Return actions as JSON array.
    """

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )

    actions = json.loads(response.content[0].text)
    execute_actions(actions, tree)
    time.sleep(1)  # Wait for compose window

    # Step 2: Fill in the email
    tree = get_accessibility_tree(get_frontmost_app())

    prompt = f"""
    Current accessibility tree:
    {json.dumps(tree, indent=2)}

    Fill in this email:
    To: {to}
    Subject: {subject}
    Body: {body}

    Tab between fields as needed. Return actions as JSON array.
    """

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )

    actions = json.loads(response.content[0].text)
    execute_actions(actions, tree)
    time.sleep(0.5)

    # Step 3: Send
    tree = get_accessibility_tree(get_frontmost_app())

    prompt = f"""
    Current accessibility tree:
    {json.dumps(tree, indent=2)}

    Click the send button to send this email.
    Return actions as JSON array.
    """

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )

    actions = json.loads(response.content[0].text)
    execute_actions(actions, tree)

    print("Email sent!")

if __name__ == "__main__":
    send_email(
        to="[email protected]",
        subject="Test from Claude Automation",
        body="This email was composed and sent by Claude!"
    )

Platform-Specific Notes

macOS: Use the Accessibility API via pyobjc or directly call AppleScript for some tasks. Enable accessibility permissions for your terminal/IDE.

Windows: Use the UI Automation framework via pywinauto or the Windows Accessibility API. Some apps require running as administrator.

Linux: Use AT-SPI (Assistive Technology Service Provider Interface). Works best with GTK applications; Qt and Electron apps have varying support.

Next Steps

To start building your own desktop automation:

Explore accessibility tools: Use Accessibility Inspector (macOS), Accessibility Insights (Windows), or Accerciser (Linux) to see what data your target apps expose.
Start small: Automate a single repetitive task first - like moving emails to folders or filling in forms.
Build an MCP server: This gives you a standardized interface that works with Claude Code and other MCP-compatible tools.
Add verification: After each action, re-read the accessibility tree to confirm the expected state change occurred.
Handle errors gracefully: When Claude can’t find an element or an action fails, provide clear feedback and recovery options.

The combination of Claude’s reasoning capabilities with accessibility tree data creates a powerful automation system that adapts to UI changes automatically. No more brittle selectors or broken macros.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!