Computer Use enables CortexPrism AI agents to interact with graphical user interfaces through screenshots, mouse control, and keyboard input. This allows agents to automate GUI-based tasks, interact with web browsers, and control desktop applications.
The computer use feature provides agents with the ability to:
- See the screen via screenshots
- Control the mouse (move, click, drag, scroll)
- Type using keyboard input and shortcuts
- Interact with any GUI application or web browser
This implementation is based on Anthropic's Computer Use API and provides a standardized interface for desktop automation.
Computer use requires the following system packages:
# Debian/Ubuntu
sudo apt-get install xvfb xdotool scrot x11-utils
# Fedora/RHEL
sudo dnf install xorg-x11-server-Xvfb xdotool scrot xorg-x11-utils
# Arch Linux
sudo pacman -S xorg-server-xvfb xdotool scrot xorg-utilsPackage descriptions:
xvfb- X Virtual Frame Buffer (creates virtual displays)xdotool- Command-line X11 automation (mouse and keyboard control)scrot- Screenshot utilityx11-utils- X11 utilities
For a fully isolated environment, use the provided Docker image:
# Build the computer use Docker image
docker build -f docker/computer-use.Dockerfile -t cortex/computer-use .
# Run a container
docker run -d --name cortex-computer cortex/computer-use
# Execute commands inside
docker exec cortex-computer xdotool getmouselocationThe computer tool is available as a built-in tool named computer with the following actions:
// Take a screenshot
{
"action": "screenshot"
}
// Click at coordinates
{
"action": "left_click",
"coordinate": [100, 200]
}
// Type text
{
"action": "type",
"text": "Hello, world!"
}
// Press keyboard shortcut
{
"action": "key",
"text": "ctrl+s"
}
// Scroll
{
"action": "scroll",
"scroll_direction": "down",
"scroll_amount": 5
}| Action | Description | Required Parameters | Optional Parameters |
|---|---|---|---|
screenshot |
Capture current display | None | None |
left_click |
Left-click at position | None | coordinate |
right_click |
Right-click at position | None | coordinate |
middle_click |
Middle-click at position | None | coordinate |
double_click |
Double-click at position | None | coordinate |
triple_click |
Triple-click at position | None | coordinate |
mouse_move |
Move cursor to position | coordinate |
None |
left_click_drag |
Drag from one point to another | coordinate, drag_to |
None |
left_mouse_down |
Press left mouse button | None | None |
left_mouse_up |
Release left mouse button | None | None |
type |
Type text string | text |
None |
key |
Press key or key combination | text |
None |
hold_key |
Hold key for duration | text, duration |
None |
scroll |
Scroll in direction | scroll_direction |
scroll_amount |
wait |
Pause execution | None | duration |
Modifiers:
ctrl,control- Control keyalt,option- Alt keyshift- Shift keysuper,win,cmd- Windows/Command/Super key
Special Keys:
Return,Enter- Enter/Return keyEscape,Esc- Escape keyTab- Tab keySpace- Space barBackspace- Backspace keyDelete,Del- Delete key
Arrow Keys:
Up,Down,Left,Right
Function Keys:
F1throughF12
Key Combinations:
Use + to combine keys: ctrl+s, alt+tab, shift+Return
Computer use configuration can be customized (future enhancement - currently uses defaults):
{
"computer_use": {
"enabled": true,
"display_width": 1024,
"display_height": 768,
"runtime": "native", // or "docker"
"screenshot_format": "png", // or "jpeg"
"screenshot_quality": 85,
"action_timeout_ms": 5000,
"save_screenshots": true
}
}Computer use operations are subject to several security measures:
- Approval Gate: All computer use actions require user approval by default
- Policy Validation: Actions are checked against security policies before execution
- Audit Logging: All computer use operations are logged to the Cortex Lens audit system
- Sensitive Data Detection: Attempts to type potentially sensitive data (passwords, API keys) are blocked
- Virtual Display Isolation: Operations run in isolated virtual displays, not the host's main display
Computer use policies can be defined in the security system:
{
"id": "computer-use-require-approval",
"kind": "computer",
"pattern": "*",
"effect": "allow",
"conditions": ["requires_approval"]
}src/computer-use/
├── types.ts # Type definitions
├── display.ts # Virtual display management (Xvfb)
├── screenshot.ts # Screenshot capture
├── mouse.ts # Mouse control
├── keyboard.ts # Keyboard control
└── executor.ts # Action coordinator
- Agent requests a computer use action through the
computertool - Tool validates the request and checks approval gate
- Security validator checks policies
- Virtual display is started (if not already running)
- Action is executed via appropriate controller (mouse/keyboard/screenshot)
- Result is returned to agent (screenshot path, cursor position, etc.)
- Operation is logged to audit system
Screenshots are saved to disk by default to avoid the 8,000 character tool output truncation limit:
- Saved to:
~/.cortex/data/screenshots/ - Format:
screenshot_<timestamp>.png(or.jpeg) - Tool returns file path instead of base64 data
// 1. Take initial screenshot
{ action: 'screenshot' }
// 2. Click Firefox icon
{ action: 'left_click', coordinate: [100, 50] }
// 3. Wait for Firefox to load
{ action: 'wait', duration: 2 }
// 4. Focus address bar
{ action: 'key', text: 'ctrl+l' }
// 5. Type URL
{ action: 'type', text: 'https://example.com' }
// 6. Navigate
{ action: 'key', text: 'Return' }
// 7. Wait for page load
{ action: 'wait', duration: 3 }
// 8. Capture result
{ action: 'screenshot' }// 1. Open text editor
{ action: 'left_click', coordinate: [150, 100] }
// 2. Wait for editor
{ action: 'wait', duration: 1 }
// 3. Type content
{ action: 'type', text: 'Hello, world!' }
// 4. Save file
{ action: 'key', text: 'ctrl+s' }
// 5. Enter filename
{ action: 'type', text: 'document.txt' }
// 6. Confirm
{ action: 'key', text: 'Return' }Error: Xvfb not found. Install with: apt-get install xvfb
Solution: Install the required system packages as described in the Requirements section.
Error: Display :99 already running
Solution: The display manager reuses existing displays. This is not an error and can be ignored.
Error: No screenshot tool available
Solution: Install either scrot or imagemagick:
sudo apt-get install scrot
# OR
sudo apt-get install imagemagickError: Cannot create virtual display: Permission denied
Solution: Ensure your user has permission to create X11 displays. May require running in a container or with appropriate permissions.
- Platform: Currently Linux-only (macOS and Windows support planned for future releases)
- Display Resolution: Fixed resolution per session (default: 1024x768)
- Single Display: One virtual display per executor instance
- No GUI Observation: Agent cannot directly see GUI elements, only screenshots
- Coordinate-based: Actions use pixel coordinates, no element identification
Planned improvements include:
- OCR Integration - Extract text from screenshots
- Visual Element Detection - Find UI elements by description
- Smart Waiting - Wait for UI elements to appear
- Recording/Playback - Record and replay action sequences
- Browser DevTools - Direct browser automation via Chrome DevTools Protocol
- Accessibility APIs - Use platform accessibility APIs for better element identification
- Multi-monitor Support - Handle multiple displays
- macOS/Windows Support - Native support for other platforms