The game environment looks pretty neat. Not surprised to see LLMs struggling but with a benchmark to focus new techniques on, I am excited how some of the new solutions trying to top the leaderboard would do.
What's like the most minimally scoped business someone could operate entirely digitally though? Is it the drop ship crap? Or maybe like a web game w/ ad revenue?
Webgame with ad rev might work -- I was thinking some kind of churned out self-publishing of children's books? Though I'm not sure if you'd actually turn a profit. Whatever it was, it'd definitely have to be heavily engineered though -- custom tools, and basically a glorified flow chart.
It feels like we are pretty far away from LLMs running a concession stand (see andon labs) so not surprised it would struggle here. Still the failure modes are super interesting and having benchmarks seems to be the starting point to domain-specific improvements.
I'm kinda curious how a VLM would do -- better spatial reasoning but worse planning? I don't use an AI web browser, but I'd be curious to know what happens if you throw something like OpenAI Atlas at the game's webpage.
So there are a couple of papers that try to use LLMs for UI-based enterprise task benchmarking like WorkArena++(ServiceNow) where the agent has to solve a couple of relatively simple enterprise tasks (like creating incident tickets based on some criteria that has to be determined by the agent etc). This benchmark in particular had quite low accuracy numbers especially on the more composite tasks. Curious about the OpenAI Atlas thing too.
The game environment looks pretty neat. Not surprised to see LLMs struggling but with a benchmark to focus new techniques on, I am excited how some of the new solutions trying to top the leaderboard would do.
What business has the smallest context window to operate?
Like maybe if you can have constraints in place such that the space of variables is minimal we already have economically relevant AI
Like a drop shipping t-shirt thing - surely the right sequence of LMs can
(1) parse out vibes/trends (e.g., "67 is currently a meme") (2) tool call that out to a print shop (3) spam it on twitter
Seems like there's just so much white space on benchmarks and gyms for this
Even in the minimal example there are way more variables than it first seems.
1. How many shirts do we order? 2. When is it worth moving on to the next trend? 3. How should we handle shipping? Do we market globally or locally?
Even the smallest business require a lot of balancing of priorities and planning for the long run with uncertain returns
True
What's like the most minimally scoped business someone could operate entirely digitally though? Is it the drop ship crap? Or maybe like a web game w/ ad revenue?
Webgame with ad rev might work -- I was thinking some kind of churned out self-publishing of children's books? Though I'm not sure if you'd actually turn a profit. Whatever it was, it'd definitely have to be heavily engineered though -- custom tools, and basically a glorified flow chart.
It feels like we are pretty far away from LLMs running a concession stand (see andon labs) so not surprised it would struggle here. Still the failure modes are super interesting and having benchmarks seems to be the starting point to domain-specific improvements.
Saving this for next time I get over caffeinated and try to convince my friends that economically viable AI will make their CPG business irrelevant
I'm kinda curious how a VLM would do -- better spatial reasoning but worse planning? I don't use an AI web browser, but I'd be curious to know what happens if you throw something like OpenAI Atlas at the game's webpage.
So there are a couple of papers that try to use LLMs for UI-based enterprise task benchmarking like WorkArena++(ServiceNow) where the agent has to solve a couple of relatively simple enterprise tasks (like creating incident tickets based on some criteria that has to be determined by the agent etc). This benchmark in particular had quite low accuracy numbers especially on the more composite tasks. Curious about the OpenAI Atlas thing too.
Have you talked to Alex Duffy from Good Start Labs? Recommend reaching out
Insanely cool
cool