Show HN: Terminal-Bench-RL: Training long-horizon terminal agents with RL

(github.com)

124 points | by Danau5tin 2 days ago

12 comments

tjungblut 2 days ago
If you are curios, like me, how the actual reinforcement learning happens. It uses verl [1] underneath. The paper "HybridFlow: A Flexible and Efficient RLHF Framework" [2] explains it really well.
[1] https://github.com/volcengine/verl
[2] https://arxiv.org/abs/2409.19256v2
anorwell 2 days ago
Some of the comments so far seem to be misunderstanding this submission. As I understand it:
1. Custom scaffolding (system prompt and tools) using Qwen3-32B achieved 13.75% on Terminal-Bench. No training was involved.
2. The author has built an RL system, but it has not been used for anything due to cost limitations.
So there's actually no result related to training here. It well known that the scaffolding used can have a large impact on benchmark outcomes (the Terminal bench leaderboard also demonstrates this [1]).
[1] https://www.tbench.ai/leaderboard
[-]
- esafak 2 days ago
  It looks like the submission has two aspects that are being conflated.
  1. Tooling for training a terminal agent.
  2. An agent that was _not_ trained with this tooling but prompt engineered. I could not find the author's discussion on this point.
OtherShrezzing 2 days ago
That you've spent in the low-thousands (by the looks of it), and managed to beat GPT4.1 is an amazing insight into the moat of the big AI labs.
rboyd 2 days ago
Great work! There should be a way for entities to crowdfund model training. Can a model like this be partially evaluated during training time and save through early stopping?
What are the best papers/resources on sota long-horizon RL?
Thanks.
TarasBob 2 days ago
I'm willing to help fund this if the creator is interested. I sent him an email.
enigma101 2 days ago
Did you consider a kickstarter to overcome the gpu poorness??? 30 to 50 should be doable
bravesoul2 2 days ago
Wow amazing! Amazing a "one person band" can do this much. It crosses many skillets.
thomasfromcdnjs 2 days ago
How much did you spend?
lostmsu a day ago
Why do you need 50k? Can't you tune using LoRA?
[-]
- Danau5tin a day ago
  Exactly my first thought when I realised the cost! Currently LoRA is not supported by rLLM (The team told me they aim to support in next release), but it is certainly possible to port to verl directly or another RL framework for sure. I just did not have the time to port again (already done 2x as other RL frameworks had issues)
erdaltoprak 2 days ago
This is incredible work