7 min read

iocaine: poisoning AI's

Iocaine is not just another anti-bot tool. It is a Rust-built tarpit that traps AI scrapers, feeds them synthetic nonsense, and turns content theft into self-inflicted data poisoning.

At first glance, the project looks easy to summarize: a tarpit that serves garbage to bots. That summary is not wrong, but it is too small. After reading through the stable 3.x branch, its workspace layout, dependency choices, and changelog, the more accurate description is sharper: Iocaine has grown into a programmable HTTP decision engine that can classify requests, generate poisoned content, preserve legitimate traffic, export metrics, and, in recent versions, optionally cooperate with nftables.

The Problem It Attacks

The web has always had robots. Search engines, RSS readers, uptime monitors, academic crawlers, archive tools, SEO crawlers, and link checkers are not new. The newer pain is the intensity and entitlement of large-scale scraping, especially when public websites, documentation, blogs, and Git forges are treated as raw input for machine-learning pipelines.

Traditional defenses are imperfect. robots.txt is a polite request. Rate limiting reduces volume but does not necessarily change intent. User-Agent blocking becomes theater the moment the crawler starts lying. Commercial WAF and CDN tools help, but they also add opacity, cost, and dependency. Iocaine chooses a different axis: do not simply reject the crawler; let it keep crawling, but put it in the wrong place.

The trick is not to spend your resources fighting the crawler. The trick is to make the crawler spend its resources on itself.

The basic topology is simple:

Internet

   โ†“

Reverse proxy: Caddy, Nginx, HAProxy, or similar

   โ†“

Iocaine

   โ”œโ”€โ”€ benign visitor โ†’ fallback to the real backend

   โ””โ”€โ”€ suspicious scraper โ†’ poisoned content, fake links, maze behavior, metrics, and possible firewalling

 

Iocaine is designed to sit between upstream resources and the fronting reverse proxy. The proxy effectively asks Iocaine whether a request should see the real site. If the request looks benign, the proxy falls back to the backend. If it looks suspicious, the visitor receives plausible garbage and links into a maze.

The economics matter. The project is engineered so that serving garbage is cheap for the operator, close to the cost of serving a static file. Meanwhile the crawler must download, parse, queue, classify, and follow links. That flips the cost asymmetry.

The 3.x series is where Iocaine becomes more than a toy. The configuration system was replaced, request handlers became mandatory, handlers became responsible for rendering output, and a single instance gained the ability to run multiple servers. Alongside HTTP and metrics servers, the project added an HAProxy SPOA mode. It also ships with a built-in configuration, request handler, and training corpus, so it can function out of the box.

That design says something important: Iocaine is not trying to be a static blocklist. It is trying to be a programmable traffic policy engine.

The operator can now express decisions around trusted User-Agents, trusted IPs, trusted paths, ASNs, CIDR ranges, poisoned URLs, headers, query parameters, cookies, and request metadata. It is HTTP triage with a mean streak.

The Stack: Rust Outside, Script Inside

The stable 3.x branch uses Rust 2024 with a minimum supported Rust version of 1.88. That is a reasonable technical choice for a piece of edge infrastructure: predictable performance, safe concurrency, strong typing, and easy static deployment.

The workspace is split into four main crates:

iocaine          โ†’ the main executable and operational integration

iocaine-powder   โ†’ the core library, intended for embedding Iocaine logic elsewhere

iocaine-label    โ†’ bundled third-party embedded resources, including Fennel and ai.robots.txt data

iocaine-table    โ†’ a wrapper around an optional dependency, keeping the rest of the project cleaner

 

This split is not cosmetic. It is a separation of responsibilities. iocaine-powder turns the core idea into an embeddable library. iocaine remains the complete application. iocaine-label isolates vendored resources so they can be updated or patched downstream. iocaine-table hides an optional feature-bound dependency.

The dependency graph tells the rest of the story:

axum / tower-http       โ†’ HTTP server and middleware

Tokio                  โ†’ asynchronous runtime

tracing                โ†’ structured logging

Prometheus             โ†’ metrics

Figment + KDL/TOML/YAML/JSON โ†’ flexible configuration

mlua + Fennel          โ†’ Lua/Fennel scripting

Roto                   โ†’ scripting and rule logic

maxminddb              โ†’ GeoIP/ASN-based decisions

ipnet / ipnet-trie     โ†’ networks, CIDRs, efficient matching

regex / aho-corasick   โ†’ text matching and multi-pattern scanning

upon                   โ†’ templating

minify-html            โ†’ compact generated output

mimalloc               โ†’ custom allocator on x86-64 Linux

libnftables1-sys       โ†’ experimental nftables integration

rust-embed             โ†’ embedded resources inside the binary

 

That stack is not accidental. It is built to classify traffic cheaply, not merely block it crudely.

The nonsense content is part of the weapon. It must be cheap to generate, plausible enough to keep crawlers engaged, and poisonous enough to degrade naive collection pipelines.

If the response looked like an obvious block page, the crawler could abandon the path, rotate identity, switch to browser automation, or retry more aggressively. If the response looks like content, the crawler continues. This is the difference between a wall and a swamp.

A wall says: do not enter. A swamp says: come a little farther.

Poisoned URLs: The Clever Part

One of the strongest ideas in Iocaine is the poisoned URL. A simple collector enters the maze and receives fake links carrying identifiers. Later, if a different agent comes back with one of those URLs โ€” perhaps from another IP, perhaps using a real browser, perhaps with a more convincing User-Agent โ€” Iocaine has a strong signal. A normal human visitor would not have discovered that URL.

That turns crawler behavior into evidence. The bot carries the poison forward. When it later tries to appear legitimate, it reveals itself by the path it follows.

Earlier deployments often used HTTP 421 Misdirected Request as the fallback signal: if Iocaine decided a request should see the real site, it returned a status the reverse proxy could intercept and route to the upstream backend. In 3.5.0, the built-in script gained the ability to configure that fallback status instead of hardcoding 421.

That sounds small, but it matters. Different proxies and environments may prefer different integration contracts.

With Caddy, the pattern is roughly:

example.com {

  @read method GET HEAD

 

  reverse_proxy @read 127.0.0.1:42069 {

    @fallback status 421

    handle_response @fallback

  }

 

  reverse_proxy 127.0.0.1:8080

}

 

With Nginx, the same idea is to intercept the fallback status and jump to the real backend:

location / {

    proxy_pass http://127.0.0.1:42069;

    proxy_intercept_errors on;

    error_page 421 = @real_backend;

}

 

location @real_backend {

    proxy_pass http://127.0.0.1:8080;

}

 

This is also where the danger lives. Misconfigure this layer and you can poison Googlebot, RSS readers, ActivityPub fetchers, uptime monitors, APIs, or real users. Good tools do not protect you from sloppy operations.

Modern Iocaine is not a fixed list of bot names. It has a scripting engine with Roto and optional Lua/Fennel support. The request handler can reason over headers, cookies, query parameters, source addresses, CIDRs, ASNs, User-Agents, Sec-CH-UA, poisoned URLs, trusted paths, trusted IPs, trusted User-Agents, metrics, and challenge responses.

That is where the project becomes surgical. It does not need to treat all machine traffic as hostile. A feed reader, a webhook, a monitoring probe, a legitimate search engine, and a hostile scraper can be treated differently.

A defense mechanism without observability is superstition. Iocaine exposes Prometheus metrics for requests, generated garbage, maze depth, request handler hits, process statistics, and, in recent versions, firewall blocks.

That allows the operator to ask concrete questions: how many requests reached the real backend, how many were fed poison, which rules are firing, whether the maze is being explored, whether upstream load dropped, and whether firewalling is doing anything useful.

Starting in the 3.3.0 series, Iocaine gained experimental nftables-based firewalling. It talks to the kernel through netlink without relying on external binaries. A built-in rule can push certain bad actors into firewall blocks, and later versions redesigned this subsystem for simplicity and performance.

I would be conservative here. The tarpit and metrics are the first tools to use. Kernel-level blocking should come after the operator understands false positives and traffic patterns. A clever firewall is still a foot-gun if pointed at the wrong visitor.

How I Would Deploy It After Operating on the Code

I would not throw Iocaine in front of everything immediately. The sane deployment path is layered:

1. Start with a lab domain, not a commercial site.

2. Run Iocaine locally behind Caddy or Nginx.

3. Enable Prometheus metrics from day one.

4. Define trusted paths for RSS, ActivityPub, APIs, webhooks, and known good bots.

5. Test with curl, known bad User-Agents, normal browsers, and real feed readers.

6. Watch logs for several days.

7. Only then move it in front of real public services.

 

I would test it first on a personal site, documentation, a public directory, or a Git forge. I would not start with a production education portal, a payment path, a login flow, or anything whose organic search visibility matters.

Iocaine is strongest around technical blogs, public documentation, wikis, Forgejo/Gitea/cgit instances, static sites, personal sites, and services suffering from abusive crawler traffic. Git forges are an especially good fit because commits, diffs, file trees, raw files, blame pages, and history create a huge surface for crawlers to waste time on.

Where It Can Hurt You

It is risky in commercial SEO-heavy sites, public APIs, e-commerce, educational portals, federated services that were not carefully tested, and environments with aggressive reverse caching. The obvious disaster is caching garbage and serving it to humans. The second disaster is poisoning a legitimate crawler. The third is forgetting that RSS, ActivityPub, webhooks, and monitoring tools can look bot-like from a distance.

Iocaine is spectacular because it is not a sanitized enterprise appliance. It has attitude. It is technical, sarcastic, and hostile in the right direction. But beneath the attitude there is real engineering: Rust, async IO, scripting, flexible configuration, embedded resources, metrics, network matching, ASN logic, templates, persistent state, configurable fallback, and experimental firewall integration.

The project changes the question. The old question was: how do I make the bot go away? Iocaine asks: how do I make the bot keep walking into the wrong room, spending its money instead of mine?

Reply

Got a thought? Reply by email, or publish a response on your own site and it'll show up above via Webmention.

Reply by email