How to Use Claude for Desktop Automation with Accessibility Tree
I was frustrated. Every time I tried to automate repetitive tasks on my desktop, I ended up with brittle scripts that broke the moment an application updated its UI. Recording macros? Forget it. Writing Applescripts or AutoHotkey scripts? They work until a button moves or a label changes.
Then I stumbled upon a Reddit post where someone was using Claude Code as the “brain” for desktop automation. The key insight: instead of relying on fragile selectors, they were feeding Claude the accessibility tree of the active application and letting Claude figure out what to click.
The Problem with Traditional Desktop Automation
Traditional approaches have serious limitations:
- Macro recorders capture exact coordinates and timing. Move a window, and your automation fails.
- Selector-based automation (like Selenium for web) requires maintaining element locators. UI changes break everything.
- Visual recognition (computer vision, OCR) is slow and error-prone.
- Scripting languages (AppleScript, AutoHotkey) require knowing application-specific dictionaries and APIs.
What I needed was something that could understand intent and adapt to UI changes automatically.
What is an Accessibility Tree?
The accessibility tree is a semantic representation of a UI that screen readers use. It contains:
- Element types (button, text field, menu)
- Labels and descriptions
- States (enabled, focused, expanded)
- Hierarchy and relationships
- Actions available on each element
Here’s what a simple accessibility tree might look like for a mail application:
Application: Mail├── Window: "Inbox - 42 messages"│ ├── Toolbar│ │ ├── Button: "New Message" (enabled)│ │ ├── Button: "Reply" (disabled)│ │ └── Button: "Delete" (disabled)│ ├── Table: Message List│ │ ├── Row: "John Smith - Project Update - Today"│ │ ├── Row: "Jane Doe - Meeting Notes - Yesterday"│ │ └── ...│ └── StaticText: "Select a message to read"This is structured, semantic data - perfect for an LLM to understand and act upon.
Building the Automation Pipeline
Step 1: Extract the Accessibility Tree
On macOS, you can use the Accessibility API through Python with pyobjc:
from AppKit import NSWorkspacefrom ApplicationServices import ( AXUIElementCreateApplication, AXUIElementCopyAttributeValue, kAXChildrenAttribute)
def get_accessibility_tree(app_pid: int) -> dict: """Extract the accessibility tree from an application by PID.""" app_element = AXUIElementCreateApplication(app_pid)
def build_tree(element): error, children = AXUIElementCopyAttributeValue( element, kAXChildrenAttribute, None )
error, role = AXUIElementCopyAttributeValue( element, kAXRoleAttribute, None ) error, title = AXUIElementCopyAttributeValue( element, kAXTitleAttribute, None )
node = { "role": role or "unknown", "title": title or "", "children": [] }
if children: for child in children: node["children"].append(build_tree(child))
return node
return build_tree(app_element)
def get_frontmost_app(): """Get the PID of the frontmost application.""" workspace = NSWorkspace.sharedWorkspace() frontmost = workspace.frontmostApplication() return frontmost.processIdentifier()Step 2: Send to Claude for Interpretation
Now comes the magic - let Claude understand the UI and plan actions:
import anthropicimport json
def get_action_plan(tree: dict, instruction: str) -> list: """Ask Claude to plan actions based on accessibility tree and user instruction."""
client = anthropic.Anthropic()
prompt = f"""You are a desktop automation assistant. Given an accessibility tree and a user instruction, determine what actions to take.
Accessibility Tree (JSON):{json.dumps(tree, indent=2)}
User Instruction: {instruction}
Return a JSON array of actions. Each action should have:- type: "click", "type", "key", or "wait"- For click: elementId or description- For type: text to type- For key: key name (e.g., "return", "tab", "escape")- For wait: duration in seconds
Example response:[ {{"type": "click", "description": "New Message button"}}, {{"type": "key", "key": "tab"}}, {{"type": "type", "text": "Subject line"}}]
Only return the JSON array, no other text."""
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": prompt}] )
return json.loads(response.content[0].text)Step 3: Execute the Actions
Execute Claude’s plan using pyautogui:
import pyautoguiimport time
pyautogui.FAILSAFE = True # Move mouse to corner to abort
def execute_actions(actions: list, tree: dict): """Execute a list of actions from Claude's plan."""
for action in actions: if action["type"] == "click": # Find element coordinates from tree coords = find_element_coordinates(tree, action.get("description")) if coords: pyautogui.click(coords["x"], coords["y"]) else: # Fallback: click at current position or use description print(f"Could not find: {action.get('description')}")
elif action["type"] == "type": pyautogui.write(action["text"])
elif action["type"] == "key": pyautogui.press(action["key"])
elif action["type"] == "wait": time.sleep(action["duration"])
# Small delay between actions for app to respond time.sleep(0.5)
def find_element_coordinates(tree: dict, description: str) -> dict | None: """Search tree for element matching description and return coordinates.""" # This requires getting AXPosition and AXSize from elements # Simplified here - actual implementation would traverse tree # and match by role/title/description passCreating an MCP Server for Desktop Control
The Model Context Protocol (MCP) provides a standardized way for Claude to interact with tools. Here’s how to create an MCP server for desktop automation:
import { Server } from "@modelcontextprotocol/sdk/server/index.js";import { CallToolRequestSchema, ListToolsRequestSchema,} from "@modelcontextprotocol/sdk/types.js";
const server = new Server( { name: "desktop-automation", version: "1.0.0" }, { capabilities: { tools: {} } });
// Tool definitionsserver.setRequestHandler(ListToolsRequestSchema, async () => ({ tools: [ { name: "get_accessibility_tree", description: "Get the accessibility tree of the active application", inputSchema: { type: "object", properties: {} }, }, { name: "click_element", description: "Click an element by description or coordinates", inputSchema: { type: "object", properties: { description: { type: "string", description: "Description of element to click (e.g., 'New Message button')", }, x: { type: "number" }, y: { type: "number" }, }, }, }, { name: "type_text", description: "Type text into the focused element", inputSchema: { type: "object", properties: { text: { type: "string" }, }, required: ["text"], }, }, { name: "press_key", description: "Press a keyboard key", inputSchema: { type: "object", properties: { key: { type: "string", description: "Key name: return, tab, escape, command, etc.", }, }, required: ["key"], }, }, ],}));
// Tool executionserver.setRequestHandler(CallToolRequestSchema, async (request) => { const { name, arguments: args } = request.params;
switch (name) { case "get_accessibility_tree": const tree = await getActiveAppTree(); return { content: [{ type: "text", text: JSON.stringify(tree, null, 2) }] };
case "click_element": if (args.description) { await clickByDescription(args.description); } else if (args.x !== undefined && args.y !== undefined) { await clickAtPosition(args.x, args.y); } return { content: [{ type: "text", text: "Clicked" }] };
case "type_text": await typeText(args.text); return { content: [{ type: "text", text: "Typed" }] };
case "press_key": await pressKey(args.key); return { content: [{ type: "text", text: "Pressed" }] };
default: throw new Error(`Unknown tool: ${name}`); }});
// Start serverasync function main() { const transport = new StdioServerTransport(); await server.connect(transport);}
main();Why Claude Works Better Than Other Models
The Reddit user who pioneered this approach noted that Claude’s interpretation of accessibility tree data was significantly better than other models. Here’s why:
-
Context window handling: Accessibility trees can be large. Claude handles them gracefully without truncating important context.
-
Structured data understanding: Claude excels at parsing and reasoning about JSON/structured data formats.
-
Instruction following: When you say “find the compose button and click it”, Claude understands the intent and can reason about multiple ways to accomplish it.
-
Error recovery: When something doesn’t work, Claude can explain why and suggest alternatives.
Common Mistakes to Avoid
I made several mistakes before getting this working reliably:
Not handling timing correctly: Applications need time to update their state between actions. I added delays after every action and verification steps:
import time
def execute_with_retry(action, max_retries=3): for attempt in range(max_retries): result = execute_action(action) time.sleep(0.5) # Let app update
# Verify the action succeeded if verify_action_completed(action): return True
print(f"Retry {attempt + 1}/{max_retries}")
return FalseOver-specifying instructions: I initially wrote instructions like “click the button at coordinates (342, 156)”. Now I write “click the Send button” and let Claude figure it out:
# BAD: Too specific, brittleinstruction = "Move mouse to x=342, y=156 and click, then wait 0.3 seconds"
# GOOD: Natural, intent-basedinstruction = "Click the Send button to submit the form"Ignoring accessibility tree changes: During long workflows, the tree changes. Re-read it between major steps:
def automate_workflow(instruction: str): steps = parse_into_steps(instruction)
for step in steps: # Always get fresh tree before each action current_tree = get_accessibility_tree(get_frontmost_app())
actions = get_action_plan(current_tree, step) execute_actions(actions, current_tree)
# Verify and adjust if needed verify_result(step)Assuming all apps have good accessibility support: Some applications (especially Electron apps and older software) have incomplete or incorrect accessibility trees. Test your target applications first:
def check_app_accessibility_support(app_pid: int) -> dict: tree = get_accessibility_tree(app_pid)
# Count meaningful elements def count_elements(node): count = 1 if node.get("children"): for child in node["children"]: count += count_elements(child) return count
element_count = count_elements(tree)
# Check for common issues issues = [] if element_count < 10: issues.append("Very few elements - poor accessibility support") if not tree.get("children"): issues.append("No children found - tree might be empty")
return { "element_count": element_count, "issues": issues, "quality": "good" if not issues else "problematic" }A Complete Example: Automating Email
Here’s a complete example that composes and sends an email:
#!/usr/bin/env python3"""Automate composing and sending an email using Claude."""
import anthropicimport jsonfrom accessibility import get_accessibility_tree, get_frontmost_appfrom executor import execute_actions, find_element_coordinatesimport time
def send_email(to: str, subject: str, body: str): """Compose and send an email through the frontmost mail app."""
client = anthropic.Anthropic()
# Get current state tree = get_accessibility_tree(get_frontmost_app())
# Step 1: Click compose/new message prompt = f""" Current accessibility tree: {json.dumps(tree, indent=2)}
User wants to compose a new email. Find and click the button to create a new message. Return actions as JSON array. """
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=256, messages=[{"role": "user", "content": prompt}] )
actions = json.loads(response.content[0].text) execute_actions(actions, tree) time.sleep(1) # Wait for compose window
# Step 2: Fill in the email tree = get_accessibility_tree(get_frontmost_app())
prompt = f""" Current accessibility tree: {json.dumps(tree, indent=2)}
Fill in this email: To: {to} Subject: {subject} Body: {body}
Tab between fields as needed. Return actions as JSON array. """
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=512, messages=[{"role": "user", "content": prompt}] )
actions = json.loads(response.content[0].text) execute_actions(actions, tree) time.sleep(0.5)
# Step 3: Send tree = get_accessibility_tree(get_frontmost_app())
prompt = f""" Current accessibility tree: {json.dumps(tree, indent=2)}
Click the send button to send this email. Return actions as JSON array. """
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=256, messages=[{"role": "user", "content": prompt}] )
actions = json.loads(response.content[0].text) execute_actions(actions, tree)
print("Email sent!")
if __name__ == "__main__": send_email( subject="Test from Claude Automation", body="This email was composed and sent by Claude!" )Platform-Specific Notes
macOS: Use the Accessibility API via pyobjc or directly call AppleScript for some tasks. Enable accessibility permissions for your terminal/IDE.
Windows: Use the UI Automation framework via pywinauto or the Windows Accessibility API. Some apps require running as administrator.
Linux: Use AT-SPI (Assistive Technology Service Provider Interface). Works best with GTK applications; Qt and Electron apps have varying support.
Next Steps
To start building your own desktop automation:
-
Explore accessibility tools: Use Accessibility Inspector (macOS), Accessibility Insights (Windows), or Accerciser (Linux) to see what data your target apps expose.
-
Start small: Automate a single repetitive task first - like moving emails to folders or filling in forms.
-
Build an MCP server: This gives you a standardized interface that works with Claude Code and other MCP-compatible tools.
-
Add verification: After each action, re-read the accessibility tree to confirm the expected state change occurred.
-
Handle errors gracefully: When Claude can’t find an element or an action fails, provide clear feedback and recovery options.
The combination of Claude’s reasoning capabilities with accessibility tree data creates a powerful automation system that adapts to UI changes automatically. No more brittle selectors or broken macros.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments