Bridging the gap between natural language and precise computer automation
The Problem
Modern computer automation faces a fundamental challenge: bridging the gap between natural language commands and precise UI interactions. Traditional automation tools require explicit scripting with exact coordinates or element identifiers, making them brittle and inaccessible to non-technical users.
Natural Language Ambiguity
Users express tasks in varied, informal ways like "open notepad and type hello"
Dynamic UI Elements
Screen layouts change based on resolution, themes, and application state
Visual Element Identification
Clicking the right button requires understanding visual context
Cross-Application Automation
Different apps have different UI patterns and interaction models
Our Solution
ASTR implements a Two-Model Pipeline architecture that separates task planning from visual execution. This separation allows each model to specialize, improving accuracy and maintainability.
Planner Model (Gemini Flash Lite)
Understands WHAT to do — converts natural language into structured execution plans with keyboard and visual click steps.
Vision Mapper (Gemini 2.5 Flash)
Understands WHERE to click — uses FastSAM Set-of-Mark technique to identify UI elements in annotated screenshots.
Local Execution Client
Executes the plan using pyautogui for mouse/keyboard control with real-time WebSocket communication.
Technology Stack
Click on any technology to learn more about how it powers ASTR
Performance
1-2s
Plan Generation
3-5s
Vision Pass
0.3s
Per Step
5-15s
Total Task