▲Show HN: Web-eval-agent – Let the coding agent debug itselfgithub.com

76 points by neversettles 1 days ago | 12 comments

proc0 1 days ago [-]

Interesting. I see from the video example it took a lot of steps and there is a lot of output for a simple task. I'm thinking this probably doesn't scale very well and more complex tasks might have performance challenges. I do think it's the right direction for AI coding.

neversettles 23 hours ago [-]

Yeah, I suppose to esafak's point, perhaps a benchmark for browser agent QA testing would be needed.

gitroom 10 hours ago [-]

Gotta say, getting rid of all the clicking and checking just sounds like a huge win. I hate wasting time on all that.

klntsky 15 hours ago [-]

I told windsurf to install playwright, identify crucial workflows of the app and add tests for them. Not without my input, but I got what I wanted without getting the hands dirty.

Does this thing add much on top?

neversettles 12 hours ago [-]

The power here is the coding agent has the ability to test visually if - and like a human would. So if the button isn't visible, the browser agent would use vision to detect that it's missing.

It sorta tests 'just like a human would' to make sure the flow that's implemented works as it's expecting to.

esafak 1 days ago [-]

Is there a benchmark for this? If not, you ought to (crowd?)start one for everybody's sake.

neversettles 1 days ago [-]

We started with using browser-use because they had the best evals: https://browser-use.com/posts/sota-technical-report

- but we found that Laminar came out with a better browser agent (& a better eval): https://www.lmnr.ai/ so we're looking to migrate over soon!

nico 1 days ago [-]

Looks amazing. Congrats on the release

How does this compare to browser mcp (https://browsermcp.io/)?

neversettles 1 days ago [-]

In browser MCP, looks like cursor controls each action along the way, but actually what we wanted was a single browser agent that had a high quality eval that could perform all the actions independently (browser-use)

GreenGames 1 days ago [-]

This is very cool! Does your MCP server preserve cookies/localStorage between steps, or would developers need to manually script auth handshakes?

neversettles 1 days ago [-]

Between steps it would preserve cookies, but atm when the playwright browser launches, it starts with a fresh browser state, so you'd have to o-auth to log in each time.

We're adding browser state persistence soon, hoping to enable it so once you sign in with google once, it can stay signed in on your local machine.

GreenGames 1 days ago [-]

Oh okay thanks - that would be fire tbh

Loading comments...

proc0 1 days ago [-]

neversettles 23 hours ago [-]

Yeah, I suppose to esafak's point, perhaps a benchmark for browser agent QA testing would be needed.

gitroom 10 hours ago [-]

Gotta say, getting rid of all the clicking and checking just sounds like a huge win. I hate wasting time on all that.

klntsky 15 hours ago [-]

I told windsurf to install playwright, identify crucial workflows of the app and add tests for them. Not without my input, but I got what I wanted without getting the hands dirty.

Does this thing add much on top?

neversettles 12 hours ago [-]

The power here is the coding agent has the ability to test visually if - and like a human would. So if the button isn't visible, the browser agent would use vision to detect that it's missing.

It sorta tests 'just like a human would' to make sure the flow that's implemented works as it's expecting to.

esafak 1 days ago [-]

Is there a benchmark for this? If not, you ought to (crowd?)start one for everybody's sake.

neversettles 1 days ago [-]

We started with using browser-use because they had the best evals: https://browser-use.com/posts/sota-technical-report

- but we found that Laminar came out with a better browser agent (& a better eval): https://www.lmnr.ai/ so we're looking to migrate over soon!

nico 1 days ago [-]

Looks amazing. Congrats on the release

How does this compare to browser mcp (https://browsermcp.io/)?

neversettles 1 days ago [-]

GreenGames 1 days ago [-]

This is very cool! Does your MCP server preserve cookies/localStorage between steps, or would developers need to manually script auth handshakes?

neversettles 1 days ago [-]

Between steps it would preserve cookies, but atm when the playwright browser launches, it starts with a fresh browser state, so you'd have to o-auth to log in each time.

We're adding browser state persistence soon, hoping to enable it so once you sign in with google once, it can stay signed in on your local machine.

GreenGames 1 days ago [-]

Oh okay thanks - that would be fire tbh