Methodology
How a score is made.
Trust comes from showing our work. Here is exactly how every AgentRank score is produced.
The corpus
Each category has 50 hand-authored standardized tasks — the same questions for every product. A task asks the agent to perform a real-world operation (create a customer with email and name) against the target product. Tasks have a weight between 0.5 and 2.0; high-weight tasks (canonical happy paths, security-critical flows) pull harder on the final score.
The harness
For each (agent, product, task) triple we spin up an isolated agent session, hand the agent a tool to invoke the product's API, and capture every tool call across a multi-turn plan loop. The agent's choice of operation, parameters, and reasoning is recorded into a trace.
Pass / fail
Each task ships with explicit success criteria: which operation must be invoked (any of N idiomatic SDK or HTTP forms), what semantic keywords must appear in parameters, and whether errors are tolerated. If any criterion fails, we classify the failure into one of ten categories (wrong endpoint, schema mismatch, picked a competitor, hallucinated capability, …).
Scoring
agent_score(product, agent) = weighted_sum(task_results) / total_weight × 100 product_score(product) = weighted_avg(agent_scores) weights: Claude Code 0.30 Cursor 0.30 Codex 0.15 Windsurf 0.15 ChatGPT 0.10
Agents tested today (and what's next)
AgentRank's public launch starts with two agents: Claude Code and OpenAI Codex. Until the remaining three agents (Cursor, Windsurf, ChatGPT) are wired into the harness, scores published before launch are computed from the two live agents only and re-weighted so the published total still sums to 1.0. We label scores produced this way as preliminary on the score page; once the full agent set is live, every product is re-run and the published score is replaced with the full-set result. Our goal is transparency: you can always see exactly which agents contributed to the number you are looking at.
Reproducibility
Paid plans include the full per-task trace for every run — every tool call, every argument, every model reply. You can re-run any product on demand and compare diffs across re-runs.
Questions or disputes? Email us — every score is reproducible. Back to home.