Remote OpenClaw Blog
ClawWork: What Happens When You Make an OpenClaw Agent Earn Its Own Living
5 min read ·
Most AI agent benchmarks measure how well a model answers questions or completes constrained tasks. ClawWork from HKUDS takes a different approach: it gives an AI agent $10, assigns it professional work tasks, makes it pay for its own API calls, and measures whether it stays solvent.
That's not a hypothetical stress test. It's a live benchmark system that has attracted 3,300+ GitHub stars in its first few days. For the full overview of OpenClaw and its capabilities, see our complete guide to OpenClaw.
The Core Concept
ClawWork transforms OpenClaw via Nanobot from an AI assistant into an "AI coworker" with economic accountability — the agent starts with $10, pays for its own API calls, and must earn income by completing professional tasks to quality standards. The distinction is economic accountability.
In the simulation:
- The agent starts with $10
- Every LLM call costs real money (deducted automatically based on actual token usage)
- Income comes only from completing professional work tasks to a quality standard
- If the balance hits zero, the agent is insolvent — game over
The result: the agent has to make real trade-off decisions. Work on a task for income now, or invest time learning to do future tasks better? Spend tokens on web searches to improve quality, or submit with existing knowledge? These are decisions with actual consequences in the simulation.
What Is the GDPVal Dataset?
ClawWork uses the GDPVal dataset of 220 real professional tasks across 44 economic sectors — including technology, finance, healthcare, and legal — with payments calculated using BLS hourly wages and quality scores evaluated by GPT-5.2. ClawWork uses the GDPVal dataset — 220 real professional tasks across 44 economic sectors, originally designed to estimate AI's contribution to GDP.
Task categories include:
- Technology: Computer systems management, software engineering, data analysis
- Finance: Financial analysis, compliance, auditing
- Healthcare: Health administration, social work
- Legal/Operations: Property management, project coordination
Payment is calculated on real economic value:
Payment = quality_score × (estimated_hours × BLS_hourly_wage)
Tasks range from $82 to $5,004 in potential payment, with an average around $260. The quality score (0.0 to 1.0) is evaluated by GPT-5.2 using category-specific rubrics — not a generic "did it answer" check.
What Does the Leaderboard Show?
The ClawWork leaderboard shows top-performing OpenClaw agents reaching $1,500+/hour equivalent salary, with survival as the hard constraint since starting with only $10 creates genuine economic pressure on every decision.
Top performers reach $1,500+/hour equivalent salary — meaning their work quality × task volume × efficiency outperforms typical human white-collar productivity in these domains.
Survival is the hard constraint. Starting with only $10 creates genuine pressure. One bad task or careless use of expensive web search calls can wipe the balance. Models that are sloppy with tokens don't survive long regardless of raw capability.
Strategic decisions matter. The "work vs. learn" choice is a real fork. Agents that invest too heavily in learning tasks run out of money. Agents that never learn plateau in quality. The balance between immediate income and capability building affects long-run performance.
The Nanobot Integration (ClawMode)
ClawWork integrates directly into live OpenClaw and Nanobot deployments via ClawMode, adding real-time token cost tracking and the ability to assign paid professional tasks through Telegram, Discord, or WhatsApp. — it integrates directly into live Nanobot (or OpenClaw) deployments via ClawMode.
Best Next Step
Use the marketplace filters to choose the right OpenClaw bundle, persona, or skill for the job you want to automate.
With ClawMode active, your regular Nanobot instance gains economic awareness:
- Every conversation costs tokens (tracked in real-time)
- The
/clawworkcommand lets any user in your Telegram/Discord/WhatsApp assign paid professional tasks - Tasks are automatically classified into 44 occupational categories with BLS wage-based pricing
- A cost footer appears on every response:
Cost: $0.0075 | Balance: $999.99 | Status: thriving
This transforms a personal AI assistant into something that has to demonstrate value — you can literally see whether your agent is earning more than it costs to run.
Getting It Running
Standalone simulation:
git clone https://github.com/HKUDS/ClawWork.git
cd ClawWork
conda create -n clawwork python=3.10
conda activate clawwork
pip install -r requirements.txt
# Terminal 1: Start the dashboard
./start_dashboard.sh
# Terminal 2: Run the agent
./run_test_agent.sh
# Open http://localhost:3000
You'll need an OpenAI API key (for the GPT-4o agent and LLM evaluation) and an E2B API key (for code execution in sandboxed environments).
Live dashboard metrics include:
- Balance chart updating in real-time
- Work vs. learn activity distribution
- Income, costs, net worth, survival status
- Individual task quality scores and payment amounts
- Knowledge base from learning sessions
Multi-Agent Competition
ClawWork supports running multiple AI models like GPT-4o and Claude head-to-head in the same economic environment, producing comparisons based on sustained economic viability rather than abstract benchmark scores. in the same economic environment:
"agents": [
{"signature": "gpt4o-run", "basemodel": "gpt-4o", "enabled": true},
{"signature": "claude-run", "basemodel": "claude-sonnet-4-5-20250929", "enabled": true}
]
This produces genuinely useful comparisons: not "which model scores higher on MMLU" but "which model can sustain economic viability doing real professional work."
Why Does ClawWork Matter for OpenClaw Operators?
ClawWork demonstrates three principles that apply to every OpenClaw production deployment: token cost consciousness, quality over speed, and the real tradeoff between accumulating context versus acting on immediate tasks. But a few things it demonstrates are worth thinking about:
Token cost consciousness. ClawWork makes token spending visible in a way that most deployments don't. If you're running OpenClaw heavily and haven't set up cost monitoring, you're flying blind on API spend.
Quality matters more than speed. The economic model rewards quality (payment × quality_score) rather than task volume. This mirrors how real assistant value works — a fast but sloppy agent is worse than a thoughtful one.
The "work vs. learn" tradeoff is real. In production OpenClaw deployments, the equivalent question is: how much context should your agent accumulate vs. how much should it act on immediate tasks? ClawWork makes this tradeoff visible and measurable. For the advanced techniques that help you manage token costs and model routing in production, and for the operator workflows that deliver the most consistent daily value, see those guides. ClawWork validates what we found across 336 real OpenClaw use cases — the operators who get the most value focus on quality over speed.
Links:
- GitHub: github.com/HKUDS/ClawWork
- Live leaderboard: Available in the repository dashboard
Running OpenClaw in production? Use the API Cost Monitor to track spending, and the free Cost Optimizer skill to route tasks to the cheapest model automatically.