ASTR logo

Bridging the gap between natural language and precise computer automation

The Problem

Modern computer automation faces a fundamental challenge: bridging the gap between natural language commands and precise UI interactions. Traditional automation tools require explicit scripting with exact coordinates or element identifiers, making them brittle and inaccessible to non-technical users.

Natural Language Ambiguity

Users express tasks in varied, informal ways like "open notepad and type hello"

Dynamic UI Elements

Screen layouts change based on resolution, themes, and application state

Visual Element Identification

Clicking the right button requires understanding visual context

Cross-Application Automation

Different apps have different UI patterns and interaction models

Our Solution

ASTR implements a Two-Model Pipeline architecture that separates task planning from visual execution. This separation allows each model to specialize, improving accuracy and maintainability.

1

Planner Model (Gemini Flash Lite)

Understands WHAT to do — converts natural language into structured execution plans with keyboard and visual click steps.

2

Vision Mapper (Gemini 2.5 Flash)

Understands WHERE to click — uses FastSAM Set-of-Mark technique to identify UI elements in annotated screenshots.

3

Local Execution Client

Executes the plan using pyautogui for mouse/keyboard control with real-time WebSocket communication.

Technology Stack

Click on any technology to learn more about how it powers ASTR

Gemini
Gemini AI
Planner & Vision
Python
FastSAM
UI Detection
React
React Native
Mobile App
Flask
Flask
Backend Server
Socket
WebSocket
Real-time Comm
Python
Python
Automation

Performance

1-2s

Plan Generation

3-5s

Vision Pass

0.3s

Per Step

5-15s

Total Task