| 1 |
Some sites put a security wall in front of their pages (Cloudflare, Imperva, DataDome, Akamai). The wall lets /robots.txt and /sitemap.xml through (those are designed to be machine-readable), so we happily build a 10,000-task job — then every page hits the wall and 403s. We retry, exhaust, and bill the time anyway. |
Recognise the wall on first contact, stop knocking. Two layers: (a) a pre-flight probe that fetches the homepage before enqueueing any tasks; if WAF detected, abort the job with a clear domain-blocked status. (b) Mid-job circuit breaker for jobs that slip past the probe — if the first N pages all return WAF markers, mark the domain blocked and short-circuit the remaining tasks. Surface the block to the customer so they can ask the site owner for an allowlist. |
Add response-fingerprint checks in the crawler response handler (cf-mitigated: header, _Incapsula_Resource body marker, Server: DataDome, tiny-body-with-403/202). New domains.waf_blocked flag. Pre-flight probe step in job initialisation. Circuit-breaker counter in the job runner. Customer-facing block status on the job. |
5 — we're honouring exactly what these sites are telling us. The opposite of a workaround. |
2 — small detector + DB flag + two branches (pre-flight + circuit). ~250 LOC. |
3 — affected jobs finish near-instantly instead of grinding through 10K useless tasks. Doesn't change healthy-domain throughput. |
5 — eliminates the worst single source of doomed task volume on the platform. |
| 2 |
Shopify stores have predictable expensive URL families (sort/filter/locale combos) and the platform throttles us hard. We crawl them blindly today — locale fan-out and filter combinatorics blow up the URL set 10–100×, then Shopify throttles, we retry, jobs drag for hours. Skipping at request time wouldn't help — the tasks already exist and have to be paid for. The fix has to happen at discovery so the bad URLs never become tasks. |
Detect Shopify, then follow Shopify's own published rules before enqueueing: drop the expensive URL families during sitemap/link discovery, collapse locale prefixes to canonical paths, use the segmented product/page sitemaps, run one task at a time per store. Smaller jobs finish faster because there are fewer tasks, not because each task is faster. |
Detect via powered-by: Shopify header. Apply Shopify's robots.txt template denylist (filter/sort/cart/account patterns) at the link extraction + sitemap parse layer, not at request time. Sitemap parser uses /sitemap_products_1.xml / _pages_1.xml / _collections_1.xml with query params preserved (Shopify returns 400 if ?from=…&to=… is stripped). Locale-prefix canonicalisation (/en-sk/foo → /foo) at discovery. DomainPacer concurrency cap = 1 for Shopify. |
5 — every rule we apply is one Shopify itself publishes (robots template, sitemap structure, locale convention). |
4 — multiple touch points: detection, sitemap parser, denylist at discovery, locale dedup, pacer override. ~800 LOC. |
5 — Shopify long-tail jobs shrink 10–100× and finish proportionally faster. Glossier-class stores stop being multi-hour jobs. |
5 — fewer tasks, fewer retries from throttling, fewer locale duplicates persisted. Compounding save. |
| 3 |
Some sites block our HoverBot/1.0 user-agent specifically (Amazon, target.com.au, woolworths.com.au). The likely cause is a substring match against a banned-word list (bot, crawl, spider) — the cheapest filter, runs first. We're hitting the dumbest layer of their defence with the word "Bot" in the UA. |
Two compliant fixes: (a) Rename the UA to Hover/1.0 (+https://www.goodnative.co/hover). We still identify as automated via the contact URL, still publish a bot info page, still honour robots.txt — we just drop the trigger word. Honest, not a workaround. (b) For sites that still block (rare — usually Akamai/DataDome with stricter rules), outreach process: publish a public Hover bot info page and request allowlist inclusion from site owners. Slow, manual, scales by demand only. |
One-line UA constant change in internal/crawler/config.go (and propagate through robots.txt parser which currently extracts hoverbot from the UA string). Add a /.well-known/hover-bot or marketing page documenting purpose, rate limits, contact email, allowlist request flow. |
5 — both options are honest. Option (a) is just renaming ourselves to dodge a naive substring filter; we're still openly identifying as an automated service. |
1 — UA constant change is a few lines. Marketing/info page is content work, not engineering. ~30 LOC. |
2 — only affects sites whose block trigger is the word "bot". Real but bounded set. |
3 — those sites currently fail every page task and burn the retry budget; rename converts them to working crawls. |
| 4 |
When a crawl fails we can't tell why — DNS? TCP? TLS? WAF? Slow page? Soft-block? 4xx? Everything buckets as "failed", so #319-style investigations require fresh manual curl probes. We also retry blindly: a WAF block gets retried 3× just like a transient TCP failure. |
Tag every failure with a specific class, store structured details in JSON, and derive retry policy from the class. Terminal failures (WAF, soft-block, 4xx, parse error) stop retrying entirely. Transient failures (DNS, TCP, 5xx, 429) retry with class-appropriate backoff. We get observability and faster job completion from the same change. |
Migration: ALTER TABLE tasks ADD COLUMN failure_class text, ADD COLUMN failure_details jsonb. failure_class enum: dns, tcp, tls, http_4xx, http_5xx, http_429, waf, soft_block, ttfb_timeout, body_timeout, parse_error. failure_details JSONB carries: HTTP status, selected response headers (server, cf-ray, cf-mitigated, retry-after, content-length, content-type), detection markers triggered, body snippet (512 B, sanitised), Go error class, timing breakdown (DNS/TCP/TLS/TTFB/total). Per-class retry table in config: terminal classes don't retry; transient classes get class-specific backoff and retry-after honoured for 429/5xx. |
5 — pure observability + smarter retries. Zero deception. |
3 — touches every error path in the crawler client, plus a migration and a config-driven retry table. ~400 LOC + migration. |
4 — terminal-failure short-circuiting stops infinite retry loops; smarter backoff on 429/5xx clears throttle storms faster; queue slots free up sooner. |
5 — biggest single saving on retry waste. Today a 4xx page might cost 4× its real cost in retries. After this, 1×. |
| 5 |
Single-page apps (React/Vue/Next.js client-side) ship a near-empty HTML shell; the real page only exists after JavaScript runs. We see nothing useful — app.posthog.com returned 0 paragraphs and 0 anchors. We don't really waste resources on these (one HTTP fetch is cheap), but we get zero data value from the task. |
Run a real browser ourselves. Self-hosted headless Chromium in a separate worker fleet, routed only to SPA-detected domains. Vendor services are a non-starter — they're competitors, ongoing per-render cost, and we'd hand them our crawl volume. |
Integrate Chromium via chromedp or Playwright in a dedicated worker pool. Browser sandboxing, separate Fly machine class with more RAM, independent scaling profile. Browser pool with reused contexts. Block ads/fonts/images/third-party trackers (~30–50% wall-time win). Network-idle timeout instead of full load. SPA detector (low text-to-script ratio, single root div, near-empty body) routes only matching domains to the fleet. New failure_details.rendering field tying back to #4. Per-page cost is 10–100× the HTTP path: ~50–200 MB RAM vs 1–5 MB; ~500–2000 ms CPU vs 10–50 ms; ~1–5 s wall vs 100–500 ms. |
5 — fully transparent, just executing the page as a browser would, same as Googlebot. |
10/5 (off the chart) — new worker fleet, browser sandbox, separate Fly machine config, distinct memory/CPU profile, retry semantics, ongoing infra-operator burden. ~2000+ LOC + new infra. Defer until cheaper wins are in. |
2 — only affects the SPA bucket (~1–5% of tasks). Doesn't speed up the bulk. |
2 — SPA tasks today "succeed" with empty body, so they're not really wasted resources, just low-value successes. The win is data quality, not resource recovery. |
Context
Closes the investigation phase of #319. Probing the original failing-domain set plus a wider sample surfaced five distinct failure categories. This issue tracks the action plan; #319 is closed as the investigation that produced it.
Of the original #319 domains: smashingmagazine, changelog, fly.io, render.com, clerk.com now return HTTP 200 to
HoverBot/1.0— likely cleared by the HTTP/2 disable (15d146f3) and DomainPacer changes. The remaining failure modes concentrate in the five categories below.Action plan
/robots.txtand/sitemap.xmlthrough (those are designed to be machine-readable), so we happily build a 10,000-task job — then every page hits the wall and 403s. We retry, exhaust, and bill the time anyway.domain-blockedstatus. (b) Mid-job circuit breaker for jobs that slip past the probe — if the first N pages all return WAF markers, mark the domain blocked and short-circuit the remaining tasks. Surface the block to the customer so they can ask the site owner for an allowlist.cf-mitigated:header,_Incapsula_Resourcebody marker,Server: DataDome, tiny-body-with-403/202). Newdomains.waf_blockedflag. Pre-flight probe step in job initialisation. Circuit-breaker counter in the job runner. Customer-facing block status on the job.powered-by: Shopifyheader. Apply Shopify's robots.txt template denylist (filter/sort/cart/account patterns) at the link extraction + sitemap parse layer, not at request time. Sitemap parser uses/sitemap_products_1.xml/_pages_1.xml/_collections_1.xmlwith query params preserved (Shopify returns 400 if?from=…&to=…is stripped). Locale-prefix canonicalisation (/en-sk/foo→/foo) at discovery. DomainPacer concurrency cap = 1 for Shopify.HoverBot/1.0user-agent specifically (Amazon, target.com.au, woolworths.com.au). The likely cause is a substring match against a banned-word list (bot,crawl,spider) — the cheapest filter, runs first. We're hitting the dumbest layer of their defence with the word "Bot" in the UA.Hover/1.0 (+https://www.goodnative.co/hover). We still identify as automated via the contact URL, still publish a bot info page, still honour robots.txt — we just drop the trigger word. Honest, not a workaround. (b) For sites that still block (rare — usually Akamai/DataDome with stricter rules), outreach process: publish a public Hover bot info page and request allowlist inclusion from site owners. Slow, manual, scales by demand only.internal/crawler/config.go(and propagate through robots.txt parser which currently extractshoverbotfrom the UA string). Add a/.well-known/hover-botor marketing page documenting purpose, rate limits, contact email, allowlist request flow.ALTER TABLE tasks ADD COLUMN failure_class text, ADD COLUMN failure_details jsonb.failure_classenum:dns,tcp,tls,http_4xx,http_5xx,http_429,waf,soft_block,ttfb_timeout,body_timeout,parse_error.failure_detailsJSONB carries: HTTP status, selected response headers (server,cf-ray,cf-mitigated,retry-after,content-length,content-type), detection markers triggered, body snippet (512 B, sanitised), Go error class, timing breakdown (DNS/TCP/TLS/TTFB/total). Per-class retry table in config: terminal classes don't retry; transient classes get class-specific backoff andretry-afterhonoured for 429/5xx.app.posthog.comreturned 0 paragraphs and 0 anchors. We don't really waste resources on these (one HTTP fetch is cheap), but we get zero data value from the task.failure_details.renderingfield tying back to #4. Per-page cost is 10–100× the HTTP path: ~50–200 MB RAM vs 1–5 MB; ~500–2000 ms CPU vs 10–50 ms; ~1–5 s wall vs 100–500 ms.Recommended order
Out of scope for this issue