Agents & Tools

Computer-Using Agent (Computer Use)

A computer-using agent is an AI agent that operates software the way a person does: it views the screen through screenshots and acts through mouse and keyboard, rather than through dedicated APIs. It runs an agentic loop of seeing the screen, deciding an action, executing it, and taking a new screenshot.

What is a Computer-Using Agent?

A computer-using agent is an AI system that controls a computer through the same interface a human uses: it looks at the screen and acts with a mouse and keyboard. Instead of calling a purpose-built API for each task, it takes screenshots to perceive the current state, then issues low-level actions like moving the cursor, clicking, typing, and pressing keys. This lets it operate arbitrary software, including applications that expose no API at all.

The category became widely available in late 2024 when Anthropic released a computer use capability for Claude, and other vendors shipped comparable agents. Anthropic's computer use tool gives the model screenshot capture, mouse control, and keyboard input so it can interact with desktop environments and browsers. On WebArena, a benchmark for autonomous web navigation across real websites, Anthropic reports state-of-the-art results among single-agent systems.

The appeal is generality. Because the agent works through pixels and input events, the same agent can in principle drive any GUI: filling forms, navigating a web app, operating a legacy desktop tool, or chaining steps across several applications, without bespoke integrations for each one.

An AI agent that operates software via screenshots and mouse/keyboard, like a human.
Perceives the screen by taking screenshots and acts with low-level input events.
Can drive any GUI, including applications that expose no API.
Became widely available in late 2024 with Anthropic's computer use for Claude.
Generality is the appeal: one agent can in principle operate many applications.

How does the agentic loop work?

Computer use runs as a loop. The model is given a goal and a screenshot of the current screen. It decides on the next action and returns a tool call, the host system executes that action against the real or virtual environment, captures the resulting screenshot, and sends it back to the model. The loop repeats until the task is complete or a stopping condition is hit.

A typical setup has a few moving parts: a virtual display (often an X11 display via Xvfb) that the agent sees and controls, the tool definitions the model can call, integration code that translates abstract requests such as move mouse or take screenshot into real operations, and the agent loop that shuttles actions and screenshots between the model and the environment.

A known weakness is that the model can assume an action succeeded without checking. A common mitigation is to prompt the agent to take a screenshot after each step and explicitly evaluate whether the intended outcome was achieved before moving on. Placing instruction text before the screenshot in the request also improves click accuracy.

Loop: see screenshot, decide action, execute it, capture a new screenshot, repeat.
Stops when the task is complete or a stopping condition is reached.
Setup includes a virtual display, tool definitions, integration code, and the agent loop.
Prompt the agent to screenshot and verify after each step to avoid assumed success.
Put instruction text before the screenshot image to improve click accuracy.

Available actions and tool configuration

Anthropic's computer use tool exposes a set of basic actions available in all versions, including screenshot to capture the display, mouse_move and left_click at given coordinates, type to enter text, key to press a key or combination such as ctrl+s, left_click_drag for click-and-drag, and hold_key to hold a key for a duration. Later tool versions add enhanced actions: the computer_20250124 version introduced scroll, double_click, triple_click, and wait, and the newest computer_20251124 version adds a zoom action for inspecting fine screen detail.

The tool is configured with a type string, the display dimensions in pixels, and the screen resolution the agent should target. Anthropic recommends a modest resolution (its examples use 1024 by 768) because very high resolutions can hurt accuracy and increase cost. The computer use tool is in beta and requires a beta header on the API request.

As an example, recent Claude models use the computer_20251124 tool type with the beta header computer-use-2025-11-24, while earlier models used computer_20250124 with the beta header computer-use-2025-01-24. Computer use is commonly combined with bash and text-editor tools to give the agent broader control of the machine.

Basic actions (all versions): screenshot, mouse_move, left_click, left_click_drag, right_click, type, key, hold_key.
computer_20250124 adds enhanced actions: scroll, double_click, triple_click, and wait.
computer_20251124 adds a zoom action for inspecting fine screen detail.
Configured with a tool type string, display width/height in pixels, and target resolution.
Anthropic recommends a modest resolution (examples use 1024 by 768) for accuracy and cost.

Code: defining the computer use tool

The snippet below configures the computer use tool with Anthropic's Python SDK, setting the tool type, display dimensions, and the required beta header. The host application is responsible for running the agent loop: executing each requested action against the environment and returning a fresh screenshot as a tool result.

python

import anthropic

client = anthropic.Anthropic()

response = client.beta.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    tools=[
        {
            "type": "computer_20251124",
            "name": "computer",
            "display_width_px": 1024,
            "display_height_px": 768,
            "display_number": 1,
        }
    ],
    messages=[
        {"role": "user", "content": "Open the browser and search for the weather."}
    ],
    betas=["computer-use-2025-11-24"],
)

# response.content may contain a tool_use block describing the next action
# (for example screenshot, left_click, or type). The host executes it and
# returns a new screenshot as a tool_result, then the loop continues.
print(response.content)

Configure Anthropic's computer use tool with the required beta header.

Risks and safety considerations

Letting a model click and type on a real machine raises distinct risks. The most prominent is prompt injection: a webpage, document, or image the agent views can contain instructions that try to override the user's intent. Anthropic notes that the model may, in some circumstances, follow commands found in on-screen content, so it trains the model to resist injection and runs classifiers that can flag suspected injections and steer the agent to ask for user confirmation before acting.

Recommended precautions include isolating the agent from sensitive data and high-impact actions, running it in a sandboxed or virtual environment, keeping a human in the loop for consequential steps, and being cautious about providing credentials, which Anthropic suggests passing inside dedicated tags and only after reviewing injection guidance. Limiting what the agent can reach limits the damage if it is misled.

Reliability is the other constraint. Computer use can still misread UI elements such as dropdowns and scrollbars, where keyboard shortcuts often work better, and it can be slower and costlier than an API-based integration when one exists. For repeatable, high-volume tasks a direct API is usually preferable; computer use is most useful where no API is available or where a task spans many disparate applications.

Prompt injection is the main risk: on-screen content can try to hijack the agent.
Anthropic trains the model to resist injection and runs classifiers that can request confirmation.
Precautions: sandbox the agent, limit data and actions, keep a human in the loop.
Handle credentials carefully, inside dedicated tags and after reviewing guidance.
Computer use can misread UI elements and is slower and costlier than a direct API.

Key takeaways

A computer-using agent operates software through screenshots and mouse/keyboard, so it can drive any GUI, including applications with no API.
It runs an agentic loop: see the screen, decide an action, execute it, capture a new screenshot, and repeat until done.
Anthropic's computer use tool exposes actions like screenshot, click, type, key, and scroll, is in beta, and requires a beta header (for example computer-use-2025-11-24).
A modest display resolution (Anthropic's examples use 1024 by 768) improves accuracy and reduces cost.
Prompt injection is the key risk; mitigations include sandboxing, limiting data and actions, keeping a human in the loop, and Anthropic's injection classifiers.

Frequently asked questions

A computer-using agent is an AI agent that controls a computer the way a person does, by viewing the screen through screenshots and acting through mouse and keyboard. This lets it operate any graphical application, including software that exposes no API, rather than relying on dedicated integrations.

Anthropic's computer use tool gives the model screenshot capture plus mouse and keyboard control. The model sees a screenshot, returns an action like click or type, the host executes it and sends back a new screenshot, and this loop repeats. It is a beta feature requiring a beta header.

Basic actions available in all versions include taking a screenshot, moving the mouse, left-clicking at coordinates, click-and-drag, typing text, pressing keys or combinations like ctrl+s, and holding a key. The computer_20250124 version added enhanced actions such as scroll, double_click, triple_click, and wait, and computer_20251124 adds a zoom action.

It carries real risks, chiefly prompt injection, where on-screen content tries to hijack the agent. Anthropic trains the model to resist injection and runs classifiers that can ask for confirmation. Recommended safeguards are sandboxing, limiting accessible data and actions, and keeping a human in the loop.

Use computer use when no API exists, or when a task spans many disparate applications that lack a unified interface. For repeatable, high-volume tasks where a direct API is available, the API is usually faster, cheaper, and more reliable than driving the GUI through screenshots.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free