The Price of Thinking: CostAwareToolEnv
The Price of Thinking
CostAwareToolEnv teaches LLM agents when a tool call is worth its cost, and when the smarter move is to stop spending.
The Problem with Free Tools
Modern LLM agents have access to tools — search engines, calculators, code interpreters, databases. In every real deployment, these tools cost something: API fees, latency, rate limits, compute time. But almost every existing RL benchmark treats tools as free and unlimited.
This creates a gap between research and reality. An agent trained on "use whatever tools you want" behaves terribly in production where every call costs money. CostAwareToolEnv closes that gap.
The agent is given a fixed budget of 50 cost units to spend across 10 questions. Every tool call deducts from that budget. The agent must decide: which tool is worth calling for this question? How many times should I call tools before committing an answer? Is it worth spending 2.0 on an LLM call, or can a 0.1 calculator solve this?
In one sentence: CostAwareToolEnv is a deployed, test-covered OpenEnv environment for studying whether reinforcement learning can teach LLM agents to route across heterogeneous tools under a shared budget, rather than treating every tool call as free.
Where this came from
CostAwareToolEnv generalizes SearchEconomicsEnv (Yashaswi Sharma, University of Southern California / Ceramic AI), which posed a simpler version: given a fixed number of search calls, can an RL agent learn to answer HotpotQA questions efficiently? That work showed agents could learn non-trivial search strategies. We ask the harder question: can the same principle scale to multiple tools and multiple domains?
information has a cost"] --> B["SearchEconomicsEnv
1 search tool + HotpotQA"] B --> C["CostAwareToolEnv
6 tools + 4 domains + shared budget"] C --> D["AgentBeats Phase 2
OpenEnv submission"]
| SearchEconomicsEnv | CostAwareToolEnv | |
|---|---|---|
| Tools | 1 (search) | 6 (search, wiki, calc, code, LLM, commit) |
| Datasets | HotpotQA only | HotpotQA + MATH + GPQA + HumanEval |
| Budget unit | # of search calls | Cost units per tool |
| Core challenge | How many searches? | Which tool, when, under budget pressure? |
Environment Design
An OpenEnv-native sequential MDP where an LLM agent selects from 6 tools with heterogeneous costs under a shared episode budget, across 10 questions from 4 domains.
Episode structure
budget = 50 units"] --> B["Sample 10 questions
HotpotQA / MATH / GPQA / HumanEval"] B --> C["Show observation
question + domain + budget + context"] C --> D{"Agent action"} D -->|"tool call"| E["Run selected tool
charge cost + append result"] E --> F{"Budget exhausted
or max 8 steps?"} F -->|"no"| C F -->|"yes"| I["Advance or end episode"] D -->|"commit"| G["Grade answer
Exact Match + token F1"] G --> H["Compute commit reward
quality + efficiency bonus"] H --> I I --> J{"Questions remain
and budget remains?"} J -->|"yes"| C J -->|"no"| K["Episode done"]
START EPISODE
Budget = 50.0 units
Draw 10 questions (mix: 40% HotpotQA, 30% MATH, 20% GPQA, 10% HumanEval)
FOR each question:
Show agent: question text, domain, remaining budget, context window
LOOP (max 8 steps per question):
Agent picks a tool + sends a query
Environment runs the tool, charges the cost, returns results
Results added to agent's context window
IF agent calls "commit" → grade answer, compute reward, next question
IF budget exhausted → episode ends immediately
END EPISODE
The six tools
| Tool | Cost | What it does | Best for |
|---|---|---|---|
calculator | 0.1 | Safe AST-based math expression evaluator | MATH arithmetic |
code_executor | 0.3 | Sandboxed Python exec() with import blocking | HumanEval, complex algebra |
wiki_lookup | 0.5 | Wikipedia REST API, first paragraph | Entity lookups |
ceramic_search | 1.0 | Ceramic AI web search API, top-5 results | HotpotQA multi-hop |
llm_reason | 2.0 | Together AI LLM call (Llama-3-8B), 512 tokens | GPQA graduate-level |
commit | 0.0 | Submit answer for grading | Always free |
Costs span a 20:1 ratio from calculator to llm_reason. A single LLM reasoning call burns 4% of the entire episode budget. The agent must learn that this is sometimes worth it (GPQA) and sometimes wasteful (simple arithmetic).
0.1"] R -->|"code execution"| X["code_executor
0.3"] R -->|"entity fact"| W["wiki_lookup
0.5"] R -->|"multi-hop factual"| S["ceramic_search
1.0"] R -->|"hard science reasoning"| L["llm_reason
2.0"] C --> A["Context window"] X --> A W --> A S --> A L --> A A --> M{"Confident?"} M -->|"yes"| K["commit
0.0"] M -->|"no"| R
Observation space
At every step, the agent sees: the question text and domain tag, remaining budget and fraction thereof, tool call history and results for the current question, number of questions remaining, and running accuracy. The agent emits a structured action specifying tool selection, query/expression/code, and (for commit) an answer.
Four domains
| Domain | Mix | Why it matters for tool selection |
|---|---|---|
| HotpotQA | 40% | Multi-hop factual QA — needs ceramic_search or wiki_lookup (multiple calls) |
| MATH | 30% | Competition math — calculator for arithmetic, code_executor for algebra, llm_reason for proofs |
| GPQA | 20% | Graduate-level science — often requires llm_reason, which costs 2.0 |
| HumanEval | 10% | Code generation — needs code_executor to verify, maybe llm_reason to plan |
The Reward Formula — Deep Dive
This is the core intellectual contribution. The reward has two components that create constant pressure to be both correct and frugal.
Part 1: Step reward (every tool call)
Every tool call produces a negative reward equal to its cost. This creates a running penalty — the agent pays for every action it takes.
Part 2: Commit reward (on answer submission)
lowercase, punctuation, articles"] B --> C["Compute Exact Match"] B --> D["Compute token F1"] C --> E["quality = 1.0 if EM
otherwise token F1"] D --> E E --> F["base = -0.5 + quality * 1.5"] E --> G{"quality >= 0.5?"} G -->|"yes"| H["bonus = 0.1 * remaining_budget_ratio"] G -->|"no"| I["bonus = 0"] F --> J["commit reward = base + bonus"] H --> J I --> J K["tool costs already charged
as step rewards"] --> L["episode return"] J --> L
where:
- $r_{\text{wrong}} = -0.5$, $r_{\text{right}} = 1.0$ — wrong answers are punished, correct ones rewarded
- $\text{quality} \in [0,1]$ — computed from Exact Match (1.0) or Token F1 (partial credit)
- $\eta = \mathbb{1}[\text{quality} \geq 0.5]$ — efficiency bonus gate: only awarded if answer is at least half-right
- $\gamma = 0.1$ — efficiency weight
- $B_{\text{remaining}} / B_{\text{total}}$ — fraction of budget still unspent
Worked examples
| Scenario | Tools used | Rstep | Quality | Rcommit | Total |
|---|---|---|---|---|---|
| A: Right, cheap | 1× calculator (0.1) | −0.1 | 1.0 | +1.10 | +1.00 |
| B: Right, expensive | 3× ceramic_search (3.0) | −3.0 | 1.0 | +1.09 | −1.91 |
| C: Wrong | 1× wiki_lookup (0.5) | −0.5 | 0.0 | −0.50 | −1.00 |
| D: Partial (F1=0.6) | 1× llm_reason (2.0) | −2.0 | 0.6 | +0.49 | −1.51 |
Scenario A is the dream: right answer, cheap tool, big total reward. Scenario B shows that even a correct answer with excessive tool use produces a negative total. The formula makes cost-awareness unavoidable.
Why this formula shape
- The efficiency bonus gate ($\eta$): Prevents a degenerate strategy where the agent commits immediately with a random guess to collect efficiency bonus without trying.
- Linear quality scaling: Partial credit (via Token F1) provides gradient signal even for close-but-not-exact answers, making learning easier.
- Budget-ratio efficiency: As budget drains, each correct answer is worth slightly less bonus, pushing the agent to be consistently frugal across all 10 questions.
Answer grading
Grading produces a quality score in $[0,1]$. The pipeline: (1) extract the answer from the agent's response (JSON parsing → prefix matching → last-line fallback), (2) normalize both prediction and ground truth (lowercase, remove articles and punctuation, tokenize), (3) compute Exact Match (binary) and Token F1 (precision × recall harmonic mean), (4) quality = 1.0 if EM, else F1.
Why GRPO
We train with Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath (Shao et al., 2024) and used to train DeepSeek-R1.
No critic model needed. PPO requires a separate value network, doubling memory. GRPO eliminates it by estimating advantages from the relative quality of sampled completions within each batch.
Natural fit for verifiable rewards. Our reward is deterministic arithmetic — no learned reward model needed.
Accessible training. Single-node, significantly less VRAM than PPO.
The GRPO objective
For each prompt $q$, GRPO samples $G$ completions $\{o_1, \ldots, o_G\}$. Each gets reward $r_i$. Advantage is a z-score:
The policy maximizes:
In our setup, each completion is an entire multi-question episode trajectory: the agent's sequence of tool selections, queries, and commits across 10 questions under a shared budget. A trajectory that routes correctly and cheaply gets high reward; one that wastes budget or answers wrong gets low reward.
Why RL is the right framework
- Delayed rewards. You don't know if a tool call was helpful until you commit. The agent must assign credit backwards.
- Exploration. The agent must try different tool combinations to discover which work best per domain. No labeled "correct tool sequence" exists.
- Multi-step planning. 10 questions share one budget. A good agent plans across the whole episode — spending too much early leaves nothing for later.
What a Trained Agent Should Learn
A well-trained agent should exhibit these behaviors — none of which are explicitly programmed:
- Domain routing. Math question → calculator. Factual multi-hop → search. Graduate science → llm_reason. The agent learns the domain→tool mapping from reward signal alone.
- Confidence-based committing. If the calculator returns a clean number for an arithmetic question, commit immediately. Don't waste 0.5 on a Wikipedia lookup you don't need.
- Budget awareness. In early questions with plenty of budget, use ceramic_search. By question 8 with only 5 units left and 3 questions remaining, switch to calculator-only even for non-math questions.
- Failure recovery. If the first tool returns garbage, try a different tool rather than committing a bad answer.
These are the behaviors that baselines cannot exhibit — they require learning from feedback across thousands of episodes.
What We Actually Built
- A complete environment server. FastAPI endpoints, WebSocket client support, per-session state, Docker/Hugging Face deployment metadata, and a browser demo route.
- A six-tool action space. Ceramic search, Wikipedia lookup, calculator, Python executor, LLM reasoning, and commit, each with explicit costs and normalized error handling.
- A verifiable reward function. Every tool call is penalized by cost; commits are scored with Exact Match, token F1, and an efficiency bonus gated by answer quality.
- Reference baselines and tests. Random, cheapest-first, and domain-oracle policies ship with unit tests covering the API, tools, sandbox behavior, and reward-facing contracts.
Baselines and Honest Results Status
Three shipped baselines
| Baseline | Policy | What it isolates |
|---|---|---|
| Random tool | Picks tool uniformly at random; commits after 3 steps with "I don't know" | Absolute floor — any RL agent that can't beat this is broken |
| Cheapest first | Calls tools in ascending cost order: calc→code→wiki→search→LLM | Great budget efficiency, terrible accuracy on factual/science questions |
| Domain oracle | Hardcoded domain→tool mapping (HotpotQA→search, MATH→calc, GPQA→LLM, HumanEval→code) | Performance ceiling for non-learning approaches |
Expected performance targets
| Metric | Random | Cheapest-first | Domain oracle | Target: RL-trained |
|---|---|---|---|---|
| Accuracy (avg) | ~20-30% | ~40-50% | ~65-75% | ≥ oracle accuracy |
| Avg budget spent | ~35/50 | ~8/50 | ~25/50 | < oracle, ≥ cheapest |
| Cost-adjusted reward | negative | low-positive | medium | highest |
The core claim fails if the trained policy cannot beat the domain oracle on cost-adjusted reward.
Honest status of the trained policy
We do not claim a converged, baseline-beating trained checkpoint. The research contribution in this submission is the environment, reward design, baselines, deployment path, and the training-ready interface. What we have is:
- Environment validated end-to-end. Reset/step API, tool dispatch, answer grading, reward calculation, concurrent session handling, and browser demo are implemented and covered by tests.
- Environment deployed and tested. HF Space serves concurrent sessions. WebSocket client connects, steps episodes, returns structured observations.
- All three baselines functional. They provide sanity checks for random exploration, low-cost heuristics, and hardcoded domain routing.
- No training logs yet. We were unable to complete Env Factory integration during the submission window because the current interface did not support our multi-tool action flow cleanly enough for reliable rollouts.
- Training risks identified honestly. See the next section for the concrete failure modes this environment is designed to expose.
GRPO Training: What This Environment Is Built to Test
The next research step is to train with TRL's GRPO via an OpenEnv-compatible rollout loop. We are careful about the claim: this submission ships the environment and baselines, not a final trained policy. The limiting factor was not the reward design or environment server; it was Env Factory integration. Our environment requires a model to make structured, repeated multi-tool calls across an episode, and we were not able to make that interaction reliable enough inside the current Env Factory path to produce trustworthy training logs before submission. We plan to continue the experiments as Env Factory stabilizes and as more post-training model series become available.
The failure modes below are the concrete behaviors the environment is designed to make measurable once that post-training loop is stable.
spend nothing, answer poorly"] A --> C["Overuse expensive tools
solve early, lose budget"] A --> D["Collapse to one domain
same tool everywhere"] A --> E["Variable trajectory lengths
harder batching and credit assignment"] B --> F["Measured by quality gate
and wrong-answer penalty"] C --> G["Measured by shared budget
and cumulative step costs"] D --> H["Measured against
domain-oracle baseline"] E --> I["Exposed by multi-step
OpenEnv episodes"]
1. Reward scale mismatch
Step rewards (−0.1 to −2.0) and commit rewards (−0.5 to +1.1) operate on different scales. A useful trained policy must learn that a costly call can still be rational when it raises answer quality enough to recover the cost.
2. Budget-exhaustion cliff
When the agent exhausts its shared budget, the episode ends. This makes early overspending visible: a policy that solves the first few questions with expensive tools can lose the rest of the episode.
3. Variable-length trajectory handling
Episodes can end after different numbers of tool calls because agents commit, run out of steps, or spend the budget. That makes batching and credit assignment harder than single-turn QA, and it is exactly why a realistic tool-use environment matters.
4. Domain-collapse risk
The domain mix is intentionally uneven: 40% HotpotQA, 30% MATH, 20% GPQA, 10% HumanEval. A weak policy can overfit to the most common domain and call the same tool everywhere. The domain-oracle baseline makes that failure easy to detect.
5. The commit-immediately attractor
Because commit is free, an untrained agent can minimize spending by answering immediately. The quality gate blocks the efficiency bonus for poor answers, so a successful policy must learn the marginal value of information rather than simply learning to spend nothing.
Why OpenEnv
OpenEnv provides: (1) a standard WebSocket contract consumable by training clients, (2) per-session state with concurrent session support, and (3) a uniform deployment path — same code runs in-process for tests, as Docker for dev, and as a HF Space for training. The remaining integration work is specifically at the Env Factory/model-control layer: making repeated structured multi-tool calls stable enough for post-training rollouts.
Prior Work and Foundations
- Weitzman (1979) "Optimal Search for the Best Alternative" — foundational search economics. Information has a cost; rational agents should not search beyond expected marginal gain.
- SearchEconomicsEnv (Yashaswi Sharma / University of Southern California / Ceramic AI) — direct predecessor. Single-tool (search), single-dataset (HotpotQA), budget-constrained. Proved the principle.
- ReAct (Yao et al., 2022) — interleaving reasoning and tool calls. The paradigm our agent operates within.
- Toolformer (Schick et al., 2023) — self-supervised tool learning for LLMs.
- GRPO / DeepSeekMath (Shao et al., 2024) — group-relative advantages. Our training algorithm.
- DeepSeek-R1 (Guo et al., 2025) — GRPO at scale for reasoning.
- CATP-LLM (Wu et al., ICCV 2025) — cost-aware tool planning via offline RL. We differ: online GRPO, episode-level budget, broader benchmarks.
- Agent-R1 (Cheng et al., 2025) — end-to-end RL for LLM agents. Complementary: capability + our cost-awareness could compose.
- OpenEnv (Meta PyTorch) — base types, WebSocket protocol, submission framework.
Quick Start
# 1. Run the env locally
pip install -r requirements.txt
python app.py # FastAPI on port 8000
# 2. Or use the HF Space
export ENV_BASE_URL="https://landrew9-cost-aware-tool-env.hf.space"
# 3. Run baselines
python baselines/random_tool.py
python baselines/cheapest_first.py
python baselines/oracle.py
# 4. Train with GRPO (requires TRL + vLLM)
# See training client docs in RESEARCH.md
All episodes are seeded and reproducible. The Ceramic AI fallback client provides deterministic offline results when no API key is set, so the full environment runs without external dependencies.
What We Did Not Do (Yet)
- No converged checkpoint. Environment, baselines, tests, and deployment are complete; convergence is the next milestone.
- No Env Factory training logs yet. We could not complete a reliable Env Factory integration for repeated multi-tool calls in time for submission. This is planned follow-up work as the Env Factory path and available post-training model series mature.
- Fixed cost model. Real API costs are dynamic. Our fixed costs are a useful simplification.
- No human respondent. All grading is automated (EM + F1). Human evaluation of answer quality is future work.
- Single budget per episode. Per-question budgets or adaptive budgets are natural extensions.
- Ceramic AI dependency. Live web search requires an API key. Fallback client enables offline training but loses real-world retrieval quality.
Conclusion
CostAwareToolEnv reframes a practical engineering problem — "LLM agents waste money on tools" — as a verifiable RL task. Six tools with a 20:1 cost ratio, four domains requiring fundamentally different tool strategies, a shared episode budget, and a decomposed reward that penalizes every tool call while rewarding correct-and-frugal commits. The completed contribution is the environment: a deployed OpenEnv-native benchmark, explicit cost model, reward implementation, baselines, tests, and submission artifact. The research question — can a GRPO-trained LLM beat the domain oracle baseline on cost-adjusted score? — is now ready to evaluate cleanly. Convergence is the next milestone, not a current claim.
References
- Weitzman, M. "Optimal Search for the Best Alternative." Econometrica, 1979.
- Yao, S., et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023.
- Schick, T., et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS 2023.
- Shao, Z., et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300, 2024.
- Guo, D., et al. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, 2025.
- Wu, Y., et al. "CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning." ICCV 2025.
- Cheng, M., et al. "Agent-R1: Training Powerful LLM Agents with End-to-End RL." arXiv:2511.14460, 2025.
- Yang, Z., et al. "HotpotQA: A Dataset for Diverse, Explainable Multi-hop QA." EMNLP 2018.
- Hendrycks, D., et al. "Measuring Mathematical Problem Solving with the MATH Dataset." NeurIPS 2021.
- Rein, D., et al. "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arXiv:2311.12022, 2023.
- Chen, M., et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374, 2021.
- Schulman, J., et al. "Proximal Policy Optimization Algorithms." arXiv:1707.06347, 2017.