ASTR logo
Discover How It Works

A deep dive into ASTR's Two-Model Pipeline architecture

2

AI Models

1-2s

Plan Generation

3-5s

Vision Pass

0.3s

Per Step

Main Pipeline

From natural language command to task completion

Two-Model Architecture

Separating task planning from visual execution for maximum accuracy

01

Planner Model (Gemini Flash Lite)

Understands WHAT to do — converts natural language into structured execution plans.

  • Supports General and FlexiSIGN modes
  • Outputs JSON sequence of steps
  • Handles keyboard and visual click actions
  • ~1-2 seconds response time
02

Vision Mapper (Gemini 2.5 Flash)

Understands WHERE to click — identifies UI elements in annotated screenshots.

  • Uses Set-of-Mark (SoM) technique
  • FastSAM detects all UI elements
  • Returns element ID mappings
  • Single-pass for efficiency

Communication Flow

How components interact during task execution

System Architecture

How the components communicate and work together

Single-Pass Vision Pipeline

Efficient visual processing with caching for multi-step tasks

Execution Modes

Two strategies for different scenarios

Vision Mode

Screenshot-based automation using FastSAM + Gemini Vision

✓ Works with any application

✓ No pre-configuration needed

✓ Adapts to UI changes

Direct Mode

Windows UI Automation for known applications

✓ Faster execution

✓ More reliable

✓ Direct element access

Planner Output Format

Structured JSON execution plans

planner_output.json
$ cat planner_output.json
{
  "mode": "general",
  "sequence": [
    {
      "order": 1, "type": "keyboard", "value": "win", "desc": "Open Start menu"
    },
    {
      "order": 2, "type": "keyboard", "value": "chrome", "desc": "Type app name"
    },
    {
      "order": 3, "type": "keyboard", "value": "enter", "desc": "Launch"
    },
    {
      "order": 4, "type": "visual_click", "target_name": "address_bar", "desc": "Click URL bar"
    }
  ]
}