SmudgeAI is a Windows-native autonomous desktop AI agent that understands screen context through vision models and UI element trees, enabling it to navigate any application and execute complex multi-step workflows with minimal human intervention. It leverages LLM-based reasoning to adapt to unknown interfaces while maintaining security through built-in permission systems and safeguards.
- Overview
- Architecture
- Core Components
- AI Engine & Multimodal Processing
- Desktop Automation Stack
- UI Detection & Element Matching
- Security & Safety Systems
- Error Handling & Recovery
- Multi-Monitor & DPI Scaling Support
- Internationalization & Localization
- Logging & Observability
- Capabilities Summary
- Comparison with OpenClaw
- Bug Fixes & Security Hardening
- Getting Started
- Configuration
- Roadmap
SmudgeAI is designed to be a fully autonomous desktop control agent that can:
- Understand screen context through vision models and UI element trees
- Navigate any Windows application using UIA-based element discovery
- Execute multi-step workflows with verification and rollback
- Adapt to unknown interfaces through LLM-based reasoning
- Operate safely with permission systems and safeguards
- Reliability First - Every action is verified before proceeding
- Security by Default - Permission systems and input sanitization on all dangerous operations
- Cross-UI Adaptability - Works with any Windows application through UIA and CV fallback
- Production Ready - Structured logging, error recovery, and observability
┌─────────────────────────────────────────────────────────────────────────────┐
│ SmudgeAI Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────────────────────────────────────┐ │
│ │ GUI │────▶│ AI Engine │ │
│ │ (PyQt5) │ │ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ │ │ Groq Client │ │ Gemini Client │ │ │
│ │ - Input │ │ │ (Primary) │ │ (Fallback) │ │ │
│ │ - Display │ │ └─────────────┘ └─────────────────────┘ │ │
│ │ - Status │ │ ┌─────────────────────────────────────┐ │ │
│ └──────────────┘ │ │ Rate Limiter & Model Cycling │ │ │
│ │ │ └─────────────────────────────────────┘ │ │
│ │ │ ┌─────────────────────────────────────┐ │ │
│ ▼ │ │ Conversation History Manager │ │ │
│ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ Task Manager │ │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────────────┐ │ │ │
│ │ │ Permission │ │ Tool │ │ Error │ │ │ │
│ │ │ System │ │ Registry │ │ Classifier │ │ │ │
│ │ └────────────┘ └────────────┘ └────────────────────┘ │ │ │
│ └──────────────────────────────────────────────────────────────┘ │ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │
│ │ Desktop │ │ CV/UI │ │ Local VLM │ │
│ │ State │ │ Integration │ │ (Optional) │ │
│ │ (UIA) │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Action │ │
│ │ Execution │ │
│ │ (pyautogui) │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Component | Responsibility | Language/Framework |
|---|---|---|
| GUI | User input, status display, permission dialogs | PyQt5 |
| AI Engine | LLM orchestration, vision analysis, tool calling | Python (async) |
| Task Manager | Tool execution, permission checks, caching | Python |
| Desktop State | UIA element tree, window management | pywinauto + pygetwindow |
| CV/UI Integration | Template matching, LLM coordinate verification | OpenCV + PIL |
| Error Handler | Error classification, retry strategies | Python |
| Structured Logging | Correlation IDs, action tracking | Python (custom) |
The AI Engine is the brain of SmudgeAI, orchestrating all LLM interactions.
- Multi-Provider Support: Groq (primary), Google Gemini (fallback)
- Vision Analysis: Llama 3.2 Vision (Groq) → Gemini 1.5 Flash fallback
- Model Cycling: Automatic failover when rate limits hit
- Rate Limiting: Built-in rate limiter with exponential backoff
- Conversation History: Properly serialized message history (dict-based)
- Tool Schema Generation: Dynamic tool schema from Python functions
class RateLimiter:
requests_in_window: int # Requests in current window
window_size: float = 60.0 # 60-second window
max_requests: int = 30 # Max 30 requests per window
blocked_until: float # Timestamp when block expires
consecutive_errors: int # Track consecutive failuresBackoff Strategy:
- Base delay: 30 seconds
- Exponential: 2^consecutive_errors
- Jitter: Random 0-10 seconds
- Max block: 300 seconds (5 minutes)
Screenshot → Groq Llama 3.2 Vision → JSON Elements → Coordinate Verification → Click
↓ (fallback)
Gemini 1.5 Flash Vision
Captures and maintains the UI element hierarchy of all windows.
Previously, pywinauto initialization caused race conditions. Now:
def _ensure_pywinauto_com_init():
import pythoncom
pythoncom.CoInitializeEx(None, pythoncom.COINIT_MULTITHREADED)- WINDOW, BUTTON, EDIT, MENU, MENU_ITEM
- TAB, CHECKBOX, RADIO_BUTTON, COMBOBOX
- LIST, LIST_ITEM, TEXT, UNKNOWN
@dataclass
class UIElement:
title: str
element_type: ElementType
rect: tuple # (x, y, width, height)
automation_id: str # UIA AutomationId
class_name: str # Window class name
is_visible: bool
is_enabled: bool
children: List[UIElement]
@dataclass
class WindowInfo:
title: str
process_name: str
rect: tuple
is_active: bool
elements: List[UIElement]Hybrid UI detection combining UIA accessibility with computer vision.
- UIA Detection: Uses Windows Accessibility API for precise element location
- Template Matching: OpenCV-based image matching for visual elements
- LLM Vision: Groq/Gemini vision for complex UI interpretation
- Coordinate Verification: All LLM coordinates verified against UIA elements
Handles multi-monitor and DPI scaling:
class ScreenHelper:
_dpi_scale: float # System DPI (1.0 = 100%)
_monitor_info: Dict # Per-monitor bounds
def adjust_coords_for_monitor(x, y, window_rect):
# Offset coordinates for secondary monitors
def scale_screenshot_for_dpi(screenshot_path):
# Scale screenshot to match actual pixel coordinatesclass RobustClicker:
_max_retries: int = 3
_circuit_breaker_max_iterations: int = 10
_circuit_breaker_time_budget: float = 30.0 # seconds
# Exponential backoff: 0.5s → 1.0s → 2.0s (capped at 5s)
def _get_retry_delay(attempt) -> floatExecutes tools/actions requested by the AI engine.
DANGEROUS_ACTIONS_REGEX patterns detect:
- File deletion:
delete,remove - System commands:
shutdown,restart,kill - Registry modifications:
reg delete,reg add - Shell execution:
exec,cmd,powershell
class PermissionSystem:
DANGEROUS_PATTERNS = [
(r"delete", r"(file|folder|directory)", "delete_file", ...),
(r"shutdown", r".*", "shutdown_pc", ...),
# ... 17 total patterns
]
SYSTEM_DIRS = [
r"C:\Windows", r"C:\Program Files",
r"C:\System32", r"C:\Boot", r"C:\Recovery"
]- Expiry: 24 hours
- Max Size: 50 entries
- Validation: Schema validation, sensitive data redaction
Structured error classification and recovery.
| Category | Description | Action |
|---|---|---|
| RETRYABLE | Transient failures (timeout, network) | Retry with backoff |
| FATAL | Unrecoverable errors | Log and escalate |
| ESCALATION | Requires human intervention | Block and alert |
class RetryStrategy:
max_retries: int = 3
base_delay: float = 0.5
max_delay: float = 10.0
exponential_base: float = 2.0
jitter: bool = TrueUser Request: "Click the Save button"
┌─────────────────────────────────────────────────────────────┐
│ 1. Screenshot Capture │
│ - pyautogui.screenshot() │
│ - Scale for DPI if needed │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Element Detection │
│ a) UIA Lookup (if window context available) │
│ b) Template Matching (OpenCV) │
│ c) LLM Vision Analysis │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Coordinate Verification │
│ - Check if LLM coords fall within any UIA element │
│ - Penalize confidence if unverified │
│ - Filter elements below 0.3 confidence threshold │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Action Execution │
│ - pyautogui.click(x, y) │
│ - Adjust for monitor offset/DPI │
│ - Verify state change post-click │
└─────────────────────────────────────────────────────────────┘
- Text + Tools: Groq
llama-3.3-70b-versatile(primary) - Vision: Groq
llama-3.2-11b-vision-preview(primary) - Fallback: Gemini 1.5 Flash for both
All messages stored as plain dictionaries to prevent type confusion:
def _message_to_dict(msg) -> dict:
return {
"role": getattr(msg, "role", "assistant"),
"content": getattr(msg, "content", None),
"tool_calls": getattr(msg, "tool_calls", None),
"tool_call_id": getattr(msg, "tool_call_id", None)
}AI Decision → Tool Call → Permission Check → Action → Verification → State Update
Permission Check:
├─ Dangerous pattern detected?
├─ System directory access?
├─ Path traversal attempt?
└─ Callback to user if needed
| Category | Actions |
|---|---|
| Application | open_application, close_application, switch_to_window |
| Navigation | click_element, double_click, right_click, hover |
| Text Input | type_text, press_key, hotkey, select_all |
| File System | create_file, read_file, delete_file, list_directory |
| Clipboard | read_clipboard, write_clipboard |
| System | shutdown_pc, restart_pc, get_system_info |
| Web | search_google, open_url, get_page_text |
| Office | send_email, create_document |
Every click action is verified by comparing before/after state:
async def _execute_click_with_verification(element):
pre_state = desktop_state.get_state_summary()
pyautogui.click(x, y)
await asyncio.sleep(0.2)
post_state = desktop_state.get_state_summary()
return pre_state != post_state or _verify_element_exists(element)- UIA (Windows Accessibility): Fastest, most reliable for standard Windows apps
- Template Matching (OpenCV): Good for custom/non-standard UI
- LLM Vision: Last resort for complex or dynamic UIs
UI element matching supports localized button text:
_LOCALIZED_UI_TERMS = {
"de_DE": {"save": ["Speichern"], "open": ["Öffnen"], ...},
"fr_FR": {"save": ["Enregistrer"], "open": ["Ouvrir"], ...},
"ja_JP": {"save": ["保存"], "open": ["開く"], ...},
# ... 6 languages supported
}LLM-provided coordinates are verified against UIA elements:
def _verify_llm_coordinates(elements):
for elem in elements:
if _is_coord_in_any_element_bounds(elem):
elem.confidence += 0.3 # Boost verified coords
else:
elem.confidence -= 0.4 # Penalize hallucinated
return [e for e in elements if e.confidence > 0.3]All dangerous operations require explicit permission:
DANGEROUS_PATTERNS = [
(r"delete", r"(file|folder)", "delete_file"),
(r"shutdown", r".*", "shutdown_pc"),
(r"exec", r"(shell|cmd|powershell)", "execute_shell"),
# ... 17 patterns total
]Built-in Safety:
- SAFE_MODE flag for human-in-the-loop
- System directory protection
- Path traversal detection
- Confirmation dialogs for destructive actions
Clipboard Injection Prevention:
PROMPT_INJECTION_PATTERNS = [
r"(?i)ignore\s+(previous|all|your)\s+instructions",
r"(?i)jailbreak",
r"(?i) DAN\s+mode",
# ... 15+ patterns
]SKILL.md Template Injection Prevention:
def _sanitize_skill_content(content):
# Remove {{...}}, {%...%}, <script>, javascript:
sanitized = re.sub(r'\{\{[^}]+\}\}', '[TEMPLATE_REMOVED]')
sanitized = re.sub(r'(?i)<script[^>]*>.*?</script>', '[SCRIPT_REMOVED]')
return sanitizeddef create_file(path):
real_path = os.path.realpath(path)
if any('..' in part for part in path.split(os.sep)):
return "Error: Path traversal detected"
if is_system_directory(real_path):
return "Error: Access denied to system directory"def _validate_cache_entry(entry):
if not isinstance(entry, dict):
return False
for k in ["tool", "args", "result", "timestamp"]:
if k not in entry:
return False
# Redact sensitive keys
for k in SENSITIVE_CACHE_KEYS:
if k in entry:
entry[k] = "***REDACTED***"
return TrueALLOWED_API_KEY_PREFIXES = {
"GROQ_API_KEY": ["gsk_"],
"GOOGLE_API_KEY": ["AIza"],
}
def _validate_api_key(key_name, value):
if not any(value.startswith(p) for p in ALLOWED_PREFIXES[key_name]):
return False # Invalid prefix = potential fake keyclass ErrorClassifier:
RETRY_PATTERNS = {
"timeout": {
"patterns": ["timeout", "timed out"],
"severity": ErrorSeverity.MEDIUM,
"suggested_fix": "Retry with exponential backoff"
},
"rate_limit": {
"patterns": ["rate limit", "429"],
"severity": ErrorSeverity.HIGH,
"suggested_fix": "Wait and retry with longer backoff"
},
"connection": {
"patterns": ["connection", "network"],
"severity": ErrorSeverity.MEDIUM,
"suggested_fix": "Check network, retry after delay"
}
}class RobustClicker:
_circuit_breaker_max_iterations = 10
_circuit_breaker_time_budget = 30.0 # seconds
def _check_circuit_breaker(self):
if self._circuit_breaker_iterations >= MAX:
return False # Block further attempts
if time.time() - self._start_time > BUDGET:
return False # Time exceeded
return Trueclass ScreenHelper:
def get_primary_monitor_offset(self) -> Tuple[int, int]:
# Returns (x, y) offset of primary monitor
def adjust_coords_for_monitor(self, x, y, window_rect):
# Adjust for windows on secondary monitors
# pyautogui uses primary monitor coords
# Secondary monitors may have negative coordsdef _init_dpi(self):
user32 = ctypes.windll.user32
user32.SetProcessDPIAware()
self._dpi_scale = user32.GetDpiForSystem() / 96.0
def scale_screenshot_for_dpi(self, screenshot_path):
if self._dpi_scale != 1.0:
img = Image.open(screenshot_path)
new_size = (int(img.width * self._dpi_scale), ...)
img_scaled = img.resize(new_size, Image.LANCZOS)
img_scaled.save(screenshot_path)| Locale | Language | Coverage |
|---|---|---|
| en_US | English | Default (no mapping needed) |
| de_DE | German | 24 common UI terms |
| fr_FR | French | 24 common UI terms |
| es_ES | Spanish | 24 common UI terms |
| ja_JP | Japanese | 24 common UI terms |
| zh_CN | Chinese (Simplified) | 24 common UI terms |
| ko_KR | Korean | 24 common UI terms |
def _match_element(elements, description):
search_terms = set(description.lower().split())
# Expand with localized terms
for word in search_terms:
localized = get_localized_terms(word) # ["Speichern", "Save"]
search_terms.update(localized)
# Match against element labels
for elem in elements:
if _fuzzy_match(search_terms, elem.label):
return elemclass CorrelationLogger:
session_id: str # Unique per application run
correlation_id: str # Unique per action/request
def log_action(action, target, result, duration_ms):
# Format: [SESSION_ID][CORR_ID][INFO] ACTION: action | target=x result=y duration_ms=z
def log_api_call(provider, model, success, duration_ms, error):
# Track API performance and failures
def log_tool_execution(tool_name, args, success, duration_ms):
# Monitor tool performancewith LogContext(user_id="123", action="click"):
# All logs within this block include user_id=123 and action=click
click_element("Save")| Capability | Implementation | Status |
|---|---|---|
| Open/close applications | subprocess + pyautogui |
✅ |
| Click UI elements | UIA + pyautogui | ✅ |
| Type text | pyautogui | ✅ |
| Read clipboard | pyperclip (sanitized) | ✅ |
| Navigate file system | os/shutil modules | ✅ |
| Web search | Google search API | ✅ |
| Send emails | SMTP | ✅ |
| Handle multi-monitor | ScreenHelper class | ✅ |
| Handle DPI scaling | ctypes DPI detection | ✅ |
| Localized UI matching | 7 languages | ✅ |
| Permission system | 17 dangerous patterns | ✅ |
| Error recovery | Circuit breaker + retry | ✅ |
| Vision analysis | Groq Vision + Gemini | ✅ |
| Coordinate verification | UIA validation | ✅ |
| Prompt injection detection | Regex patterns | ✅ |
| Path traversal prevention | realpath + validation | ✅ |
| Command cache | Validated + redacted | ✅ |
| Capability | Limitation | Priority |
|---|---|---|
| Browser automation | No Playwright/CDP | High |
| Cross-platform | Windows only | Medium |
| Drag-and-drop verification | Basic implementation | Medium |
| Voice input | Basic STT support | Low |
| Memory persistence | Session only | Medium |
| Aspect | SmudgeAI | OpenClaw |
|---|---|---|
| UI Understanding | UIA + CV + Vision (hybrid) | CDP + ARIA snapshots |
| Action Execution | pyautogui (coordinate-based) | Playwright (ref-based) |
| Verification | Before/after state diff | Explicit state contracts |
| Cross-platform | Windows only | macOS/Linux/Windows |
| Planning | LLM-based | Attempt-based with compaction |
| Rate Limiting | Built-in circuit breaker | API-level |
| Security | Permission system + injection prevention | Not evaluated |
| Localization | 7 languages | Not evaluated |
- Hybrid Detection: Combines UIA, CV, and Vision for robustness
- Security: Comprehensive permission and injection prevention
- Localization: Native multi-language UI matching
- Error Recovery: Circuit breaker and retry strategies built-in
- Browser Automation: OpenClaw's CDP-based approach is more reliable
- Action Verification: OpenClaw's state contracts are more explicit
- Cross-platform: SmudgeAI is Windows-only
| Bug | Root Cause | Fix |
|---|---|---|
| COM Threading Conflict | pywinauto initialized before COM | Explicit CoInitializeEx at usage time |
| Hallucinated Coordinates | LLM provided unverified coords | UIA verification + confidence adjustment |
| History Type Mixing | Object/dict type inconsistency | Always serialize to dict |
| Infinite Loops | No circuit breaker | Max iterations + time budget |
| Silent Exception Crashes | No exception handler in Worker | Signal-based error propagation |
| Vulnerability | Attack Vector | Mitigation |
|---|---|---|
| Prompt Injection | Malicious clipboard content | Regex pattern detection + redaction |
| Template Injection | Malicious SKILL.md | Template tag sanitization |
| API Key Exposure | .env commit | Key prefix validation + masked logging |
| Path Traversal | ../ in file paths |
realpath validation + forbidden dirs |
| Unsafe Deserialization | Malicious cache JSON | Schema validation + redaction |
- Windows 10/11
- Python 3.9+
- API Keys (Groq and/or Google Gemini)
# Clone the repository
git clone https://github.com/your-repo/SmudgeAI.git
cd SmudgeAI
# Install dependencies
pip install -r requirements.txt
# Create .env file
echo "GROQ_API_KEY=gsk_xxxxx" > .env
echo "GOOGLE_API_KEY=xxxxx" >> .env
# Run the application
python main.pypyqt5>=5.15.0
pywinauto>=0.6.8
pyautogui>=0.9.54
opencv-python>=4.8.0
Pillow>=10.0.0
groq>=0.4.0
google-generative-ai>=0.3.0
python-dotenv>=1.0.0
pyperclip>=1.8.2
pygetwindow>=0.0.9
# AI Provider
AI_PROVIDER=groq # or "gemini"
# API Keys
GROQ_API_KEY=gsk_xxxxx
GOOGLE_API_KEY=AIzaxxxxx
# Search APIs (Optional)
TAVILY_API_KEY=tvly-xxxxx
SERPER_API_KEY=serper_xxxxx
# Email (Optional)
EMAIL_USER=your_email@gmail.com
EMAIL_PASSWORD=app_specific_password
# Logging
LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR
LOG_FILE=smudgeai.log
# Safety
SAFE_MODE=True # Require confirmation for dangerous actions- UIA-based element discovery
- pyautogui action execution
- Vision-based UI analysis
- Multi-monitor support
- DPI scaling handling
- Coordinate verification
- Circuit breaker pattern
- Error classification
- Permission system
- Command caching
- Prompt injection prevention
- Path traversal prevention
- Safe deserialization
- API key validation
- SKILL.md sanitization
- Multi-step workflow execution
- Context compaction for long tasks
- Learning from user corrections
- Cross-application workflows
- Browser automation (Playwright/CDP)
- Cross-platform support (macOS, Linux)
- Voice input/output
- Persistent memory
- Web dashboard for monitoring
MIT License - See LICENSE file for details.
Contributions welcome! Please read CONTRIBUTING.md for guidelines.
Built with ❤️ for Windows desktop automation