I made an AI agent scrape 19 real websites. Here's what I learned.

Written by Brittany Joiner

I made an AI agent scrape 19 real websites. Here's what I learned.

Everyone keeps saying AI agents are expensive to run. I wanted to know where the money actually goes.

So our team built agent-browser-shield, a layer that sits between a browser-use agent and the pages it visits. It strips out the junk before the page ever reaches the model. Cookie banners, ads, footers, chat widgets, all the stuff a human ignores without thinking. Then I ran a real test to see if it actually mattered.

Before you think is this another boring marketing piece, I didn't run this to prove the tool wins. I ran it because I think the honest version is more useful. It matters to know where a tool makes a difference and where it doesn't, and the only way to know that is to measure it on real sites and look at the real numbers, good and bad.

So here's what I found.

How I actually ran this

I tried to make this something anyone could rerun, not a number I pulled out of thin air.

I wrote a list of 19 tasks. Each one is a real site, a real goal, and a clear definition of what counts as success. Things like "find the cheapest white IKEA Billy bookcase," "get the top air fryer on Target," "pull the newest React bug off GitHub."

Then our cloud harness drives a real browser-use agent through each task two ways:

Baseline: the agent hits the raw page, no shield.
Guarded: the same agent, same task, but the shield is on and stripping the junk first.

Every task runs 3 times in a fresh session, so one lucky pass or one weird fluke can't skew the result. As it goes, the harness counts the tokens from the actual agent runs, works out the cost from real per-token pricing, and tracks how often the task actually passed. Then it spits out a report you can read side by side.

If you want the exact commands and setup, it's all in the benchmark README. You can clone it and run the whole thing yourself.

The junk on a page is a tax you pay on every step

Here's the part nobody measures. An agent doesn't read a page once. It re-reads it on every single step as it works toward the goal. So every cookie banner and every footer link isn't a one-time cost. You're paying for it again and again, the whole way through the task.

On a clean page that's fine. On a cluttered one it adds up fast, and you never see the bill until you add up the tokens.

Where the shield paid off

The wins landed exactly where you'd guess: messy pages. Commerce, search results, listings. The stuff that's 90% noise.

Untitled design (16)

GitHub trending: 71% fewer tokens
Weather lookup: 62% fewer tokens
Target air fryer: 51% fewer
GitHub React bug: 42% fewer
Etsy shop sales: 37% fewer

In real money, the weather task dropped from $0.034 a run to $0.013. That's a small number until you're running thousands of these a day, which is exactly what people building agents are doing.

The surprise: the cheaper agent was also the better agent

This is the one I didn't see coming. I figured stripping the page would save tokens but maybe cost me a little accuracy. Less context, more mistakes, right?

Opposite happened.

The Target air fryer task passed 1 out of 3 times on the raw page. Behind the shield it passed 3 out of 3. GitHub trending did the same thing, 1 out of 3 to a clean sweep. Etsy went from 2 out of 3 to 3 out of 3.

Turns out all that noise wasn't just expensive. It was distracting the agent, pulling it toward the wrong buttons and burying the thing it actually needed. Clear the clutter and the agent stops getting lost. Cheaper and more reliable at the same time, which almost never happens.

Where it didn't help (the honest part)

It's not magic, and I'm not going to pretend it is.

On pages that were already clean, the shield didn't help and sometimes hurt. A plain Wikipedia article or a simple docs page has barely any junk to strip, so the extra processing was just overhead. A couple of those tasks actually used more tokens with the shield on.

So I'm not telling you to slap this on every site you touch. That's not the takeaway. The win comes from the messy pages, and the messy pages are exactly where agents do most of their real work: shopping, searching, scraping, filling out forms.

If you're running agents at any kind of volume, especially scraping, that's where I'd point the benchmark. Run it on your own tasks and see if it actually saves you money. Your sites aren't my sites, so don't take my word for it. Measure it.

So what

Noise is the silent tax on every agent you run. Most people never measure it, so they never know they're paying it. I didn't either until I put real numbers next to it.

That's the problem agent-browser-shield is going after. We're building the safety and efficiency layer for the agentic web out in the open.

We need your help!

We're adding new cleaning rules and new benchmark tasks all the time, and we want to do it based on how people are actually using this, not on what we guess. The harder questions (like teaching it to tell a noisy page from a clean one on its own) are exactly the ones we're working through right now.

Clone the benchmark and run it on your own tasks.

If you do, open an issue or a PR and share what you got, even if it shows the shield made things worse. Especially if it made things worse. That's the data that actually makes this better, and building it honestly in the open is the whole point.

If that's the kind of thing you'd build with us, the repo is right here. Star to follow along with our progress and support this project!🛡️