A deep dive into ASTR's Two-Model Pipeline architecture
2
AI Models
1-2s
Plan Generation
3-5s
Vision Pass
0.3s
Per Step
Main Pipeline
From natural language command to task completion
Two-Model Architecture
Separating task planning from visual execution for maximum accuracy
Planner Model (Gemini Flash Lite)
Understands WHAT to do — converts natural language into structured execution plans.
- •Supports General and FlexiSIGN modes
- •Outputs JSON sequence of steps
- •Handles keyboard and visual click actions
- •~1-2 seconds response time
Vision Mapper (Gemini 2.5 Flash)
Understands WHERE to click — identifies UI elements in annotated screenshots.
- •Uses Set-of-Mark (SoM) technique
- •FastSAM detects all UI elements
- •Returns element ID mappings
- •Single-pass for efficiency
Communication Flow
How components interact during task execution
System Architecture
How the components communicate and work together
Single-Pass Vision Pipeline
Efficient visual processing with caching for multi-step tasks
Execution Modes
Two strategies for different scenarios
Vision Mode
Screenshot-based automation using FastSAM + Gemini Vision
✓ Works with any application
✓ No pre-configuration needed
✓ Adapts to UI changes
Direct Mode
Windows UI Automation for known applications
✓ Faster execution
✓ More reliable
✓ Direct element access
Planner Output Format
Structured JSON execution plans
$ cat planner_output.json { "mode": "general", "sequence": [ { "order": 1, "type": "keyboard", "value": "win", "desc": "Open Start menu" }, { "order": 2, "type": "keyboard", "value": "chrome", "desc": "Type app name" }, { "order": 3, "type": "keyboard", "value": "enter", "desc": "Launch" }, { "order": 4, "type": "visual_click", "target_name": "address_bar", "desc": "Click URL bar" } ] }