ASTR - AI Computer Automation Assistant

About Features How It Works

Download

Discover How It Works

A deep dive into ASTR's Two-Model Pipeline architecture

AI Models

1-2s

Plan Generation

3-5s

Vision Pass

0.3s

Per Step

Main Pipeline

From natural language command to task completion

Two-Model Architecture

Separating task planning from visual execution for maximum accuracy

Planner Model (Gemini Flash Lite)

Understands WHAT to do — converts natural language into structured execution plans.

•Supports General and FlexiSIGN modes
•Outputs JSON sequence of steps
•Handles keyboard and visual click actions
•~1-2 seconds response time

Vision Mapper (Gemini 2.5 Flash)

Understands WHERE to click — identifies UI elements in annotated screenshots.

•Uses Set-of-Mark (SoM) technique
•FastSAM detects all UI elements
•Returns element ID mappings
•Single-pass for efficiency

Communication Flow

How components interact during task execution

System Architecture

How the components communicate and work together

Single-Pass Vision Pipeline

Efficient visual processing with caching for multi-step tasks

Execution Modes

Two strategies for different scenarios

Vision Mode

Screenshot-based automation using FastSAM + Gemini Vision

✓ Works with any application

✓ No pre-configuration needed

✓ Adapts to UI changes

Direct Mode

Windows UI Automation for known applications

✓ Faster execution

✓ More reliable

✓ Direct element access

Planner Output Format

Structured JSON execution plans

planner_output.json

$ cat planner_output.json
{
  "mode": "general",
  "sequence": [
    {
      "order": 1, "type": "keyboard", "value": "win", "desc": "Open Start menu"
    },
    {
      "order": 2, "type": "keyboard", "value": "chrome", "desc": "Type app name"
    },
    {
      "order": 3, "type": "keyboard", "value": "enter", "desc": "Launch"
    },
    {
      "order": 4, "type": "visual_click", "target_name": "address_bar", "desc": "Click URL bar"
    }
  ]
}