1 UC Berkeley · 2 Embodied Science · 3 AMD
RHO: Your Coding Agent is Secretly a Roboticist
RHO (Robotics Harness Optimization) unlocks tool-enabled LLM coding agents to unleash their inner roboticist. Through reflective evolutionary search, it writes and rewrites a robot's control code in simulation until the code itself solves the task, reaching state of the art among Code-as-Policies methods.
Past research
Recent CaP systems rely on iterative code generation loops at test time, often leaving them unsuitable for real-time robotics tasks.
The idea
RHO hands the job to a tool-enabled coding agent and lets it practice in a simulator through reflective evolutionary search: it edits, runs, and debugs whole programs against a scalar reward, keeping the variants that help. The end product is one self-contained multi-file code repository. Because the final product is just code, you can read it, debug it, and reuse it. (For experts: this relocates the model's expensive search out of deployment and into a one-time training phase, so what ships on primitive-only benchmarks are frozen Repositories-as-Policies.)
A candidate repository runs in the simulator, earning a reward plus detailed feedback on what went wrong.
The repository and that feedback go to the coding agent, which mutates many files at once.
A child is kept only if it beats its parent on a paired minibatch of trials (the acceptance gate).
The best variants are retained as a diverse library, a Pareto frontier over task coverage, not a single greedy chain.
Unlike prior reflective optimization methods which optimize only a single prompt, function, or delimited code region, RHO optimizes entire multi-file repositories of code at once.
30% 70%
A single coding agent (single-turn), no matter how long it is allowed to think, tops out at just 30.0% on the held-out tasks. The reflective evolutionary loop is what lifts the same model and tools to 70%. The loop is the method, not just the agent.
The proof
A single execution of the candidate repository, no VDM, and no LLM code-generation calls.
This contrast is scoped to LIBERO-PRO (perturbed): task-level and position perturbations that test generalization beyond the standard LIBERO training distribution.
On standard LIBERO, OpenVLA and π0.5 both score over 90.0%; under LIBERO-PRO's perturbations OpenVLA scores 0.0% and π0.5 scores 12.83%.
Reflective evolution · CaP-Bench Robosuite
Keyboard: Tab to a point, ←/→ to scan generations, Home/End to jump, Esc to dismiss.
| candidate | generation | mean reward | best-by-validation | note |
|---|
7 Robosuite tasks · 87 accepted candidates · the deployed solver is selected by validation at generation 134. These curves are best-by-train running maxima of mean shaped reward on the 10 training trial IDs, distinct from the held-out per-task success counts: five tasks reach training reward 1.0, two_arm_lift plateaus near 0.73, and nut_assembly stays near 0.06.
Lines are best-by-train running maxima, so each curve is non-decreasing. Hover, focus, or tap a line for the exact value. Keyboard: Tab to a line to hear its task and final reward; use the legend buttons to toggle lines. A full data table is below.
Here the deployment is multi-turn: a language-model agent keeps running while the robot works. Rather than write standalone code, RHO optimizes the agent's harness: its system prompt and the bodies of its tools (the agent's control interface). On the hard held-out split, success nearly doubles from 23.5% to 44.3% (p < 0.001, 3-run average), while using 27% fewer tool calls and 20% less wall-clock time (single representative run; the sign holds under the 3-run average).
On Robosuite, mutating a full multi-file repository wins in fewer generations (200 vs 250), and far outperforms rigid structured representations (behavior trees and finite-state machines).
Only the top two bars share the same backend (Codex GPT-5.5). The behavior-tree baseline used Claude Sonnet 4.6 and the finite-state-machine baseline used Qwen3.6-27B, so the gap to those two bars reflects model capability as well as representation, not a pure structure-only comparison.
Aggregate gains come with per-task tradeoffs. RHO is below the multi-turn baseline on object-swap on LIBERO-PRO (12.2%) and on cube_restack and two_arm_lift on Robosuite, and nut_assembly remains an open challenge for both RHO and the baseline.
Run it yourself
Karim Elmaaroufi, Justin Svegliato, Sarunas Kalade, Graham Schelle, Sanjit A. Seshia, and Matei Zaharia. "RHO: Your Coding Agent is Secretly a Roboticist." arXiv:2606.16458, 2026.
@misc{elmaaroufi2026rho,
title = {{$\rho$}: Your Coding Agent is Secretly a Roboticist},
author = {Elmaaroufi, Karim and Svegliato, Justin and Kalade, Sarunas and Schelle, Graham and Seshia, Sanjit A. and Zaharia, Matei},
year = {2026},
howpublished = {arXiv preprint},
eprint = {2606.16458},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2606.16458},
note = {Robotics Harness Optimization (RHO)}
}