Skip to content

Crawler reliability action plan (from #319 investigation) #365

@simonsmallchua

Description

@simonsmallchua

Context

Closes the investigation phase of #319. Probing the original failing-domain set plus a wider sample surfaced five distinct failure categories. This issue tracks the action plan; #319 is closed as the investigation that produced it.

Of the original #319 domains: smashingmagazine, changelog, fly.io, render.com, clerk.com now return HTTP 200 to HoverBot/1.0 — likely cleared by the HTTP/2 disable (15d146f3) and DomainPacer changes. The remaining failure modes concentrate in the five categories below.

Action plan

# Problem (plain English) Solution (plain English) Technical change Compliance / transparency (5 = honest) Complexity / LoC (5 = big) Increased throughput (5 = big) Reduced wasted resources (5 = big)
1 Some sites put a security wall in front of their pages (Cloudflare, Imperva, DataDome, Akamai). The wall lets /robots.txt and /sitemap.xml through (those are designed to be machine-readable), so we happily build a 10,000-task job — then every page hits the wall and 403s. We retry, exhaust, and bill the time anyway. Recognise the wall on first contact, stop knocking. Two layers: (a) a pre-flight probe that fetches the homepage before enqueueing any tasks; if WAF detected, abort the job with a clear domain-blocked status. (b) Mid-job circuit breaker for jobs that slip past the probe — if the first N pages all return WAF markers, mark the domain blocked and short-circuit the remaining tasks. Surface the block to the customer so they can ask the site owner for an allowlist. Add response-fingerprint checks in the crawler response handler (cf-mitigated: header, _Incapsula_Resource body marker, Server: DataDome, tiny-body-with-403/202). New domains.waf_blocked flag. Pre-flight probe step in job initialisation. Circuit-breaker counter in the job runner. Customer-facing block status on the job. 5 — we're honouring exactly what these sites are telling us. The opposite of a workaround. 2 — small detector + DB flag + two branches (pre-flight + circuit). ~250 LOC. 3 — affected jobs finish near-instantly instead of grinding through 10K useless tasks. Doesn't change healthy-domain throughput. 5 — eliminates the worst single source of doomed task volume on the platform.
2 Shopify stores have predictable expensive URL families (sort/filter/locale combos) and the platform throttles us hard. We crawl them blindly today — locale fan-out and filter combinatorics blow up the URL set 10–100×, then Shopify throttles, we retry, jobs drag for hours. Skipping at request time wouldn't help — the tasks already exist and have to be paid for. The fix has to happen at discovery so the bad URLs never become tasks. Detect Shopify, then follow Shopify's own published rules before enqueueing: drop the expensive URL families during sitemap/link discovery, collapse locale prefixes to canonical paths, use the segmented product/page sitemaps, run one task at a time per store. Smaller jobs finish faster because there are fewer tasks, not because each task is faster. Detect via powered-by: Shopify header. Apply Shopify's robots.txt template denylist (filter/sort/cart/account patterns) at the link extraction + sitemap parse layer, not at request time. Sitemap parser uses /sitemap_products_1.xml / _pages_1.xml / _collections_1.xml with query params preserved (Shopify returns 400 if ?from=…&to=… is stripped). Locale-prefix canonicalisation (/en-sk/foo/foo) at discovery. DomainPacer concurrency cap = 1 for Shopify. 5 — every rule we apply is one Shopify itself publishes (robots template, sitemap structure, locale convention). 4 — multiple touch points: detection, sitemap parser, denylist at discovery, locale dedup, pacer override. ~800 LOC. 5 — Shopify long-tail jobs shrink 10–100× and finish proportionally faster. Glossier-class stores stop being multi-hour jobs. 5 — fewer tasks, fewer retries from throttling, fewer locale duplicates persisted. Compounding save.
3 Some sites block our HoverBot/1.0 user-agent specifically (Amazon, target.com.au, woolworths.com.au). The likely cause is a substring match against a banned-word list (bot, crawl, spider) — the cheapest filter, runs first. We're hitting the dumbest layer of their defence with the word "Bot" in the UA. Two compliant fixes: (a) Rename the UA to Hover/1.0 (+https://www.goodnative.co/hover). We still identify as automated via the contact URL, still publish a bot info page, still honour robots.txt — we just drop the trigger word. Honest, not a workaround. (b) For sites that still block (rare — usually Akamai/DataDome with stricter rules), outreach process: publish a public Hover bot info page and request allowlist inclusion from site owners. Slow, manual, scales by demand only. One-line UA constant change in internal/crawler/config.go (and propagate through robots.txt parser which currently extracts hoverbot from the UA string). Add a /.well-known/hover-bot or marketing page documenting purpose, rate limits, contact email, allowlist request flow. 5 — both options are honest. Option (a) is just renaming ourselves to dodge a naive substring filter; we're still openly identifying as an automated service. 1 — UA constant change is a few lines. Marketing/info page is content work, not engineering. ~30 LOC. 2 — only affects sites whose block trigger is the word "bot". Real but bounded set. 3 — those sites currently fail every page task and burn the retry budget; rename converts them to working crawls.
4 When a crawl fails we can't tell why — DNS? TCP? TLS? WAF? Slow page? Soft-block? 4xx? Everything buckets as "failed", so #319-style investigations require fresh manual curl probes. We also retry blindly: a WAF block gets retried 3× just like a transient TCP failure. Tag every failure with a specific class, store structured details in JSON, and derive retry policy from the class. Terminal failures (WAF, soft-block, 4xx, parse error) stop retrying entirely. Transient failures (DNS, TCP, 5xx, 429) retry with class-appropriate backoff. We get observability and faster job completion from the same change. Migration: ALTER TABLE tasks ADD COLUMN failure_class text, ADD COLUMN failure_details jsonb. failure_class enum: dns, tcp, tls, http_4xx, http_5xx, http_429, waf, soft_block, ttfb_timeout, body_timeout, parse_error. failure_details JSONB carries: HTTP status, selected response headers (server, cf-ray, cf-mitigated, retry-after, content-length, content-type), detection markers triggered, body snippet (512 B, sanitised), Go error class, timing breakdown (DNS/TCP/TLS/TTFB/total). Per-class retry table in config: terminal classes don't retry; transient classes get class-specific backoff and retry-after honoured for 429/5xx. 5 — pure observability + smarter retries. Zero deception. 3 — touches every error path in the crawler client, plus a migration and a config-driven retry table. ~400 LOC + migration. 4 — terminal-failure short-circuiting stops infinite retry loops; smarter backoff on 429/5xx clears throttle storms faster; queue slots free up sooner. 5 — biggest single saving on retry waste. Today a 4xx page might cost 4× its real cost in retries. After this, 1×.
5 Single-page apps (React/Vue/Next.js client-side) ship a near-empty HTML shell; the real page only exists after JavaScript runs. We see nothing useful — app.posthog.com returned 0 paragraphs and 0 anchors. We don't really waste resources on these (one HTTP fetch is cheap), but we get zero data value from the task. Run a real browser ourselves. Self-hosted headless Chromium in a separate worker fleet, routed only to SPA-detected domains. Vendor services are a non-starter — they're competitors, ongoing per-render cost, and we'd hand them our crawl volume. Integrate Chromium via chromedp or Playwright in a dedicated worker pool. Browser sandboxing, separate Fly machine class with more RAM, independent scaling profile. Browser pool with reused contexts. Block ads/fonts/images/third-party trackers (~30–50% wall-time win). Network-idle timeout instead of full load. SPA detector (low text-to-script ratio, single root div, near-empty body) routes only matching domains to the fleet. New failure_details.rendering field tying back to #4. Per-page cost is 10–100× the HTTP path: ~50–200 MB RAM vs 1–5 MB; ~500–2000 ms CPU vs 10–50 ms; ~1–5 s wall vs 100–500 ms. 5 — fully transparent, just executing the page as a browser would, same as Googlebot. 10/5 (off the chart) — new worker fleet, browser sandbox, separate Fly machine config, distinct memory/CPU profile, retry semantics, ongoing infra-operator burden. ~2000+ LOC + new infra. Defer until cheaper wins are in. 2 — only affects the SPA bucket (~1–5% of tasks). Doesn't speed up the bulk. 2 — SPA tasks today "succeed" with empty body, so they're not really wasted resources, just low-value successes. The win is data quality, not resource recovery.

Recommended order

  1. Adjust versions/etc #3 (UA rename) — 30 LOC, ships first, unblocks Amazon / AU retail tail.
  2. Fundamentals, Turso and basic main.go setup #1 (WAF detection + pre-flight) — biggest single saving on doomed task volume.
  3. Feature/go version update #4 (failure granularity + retry policy) — observability + retry waste reduction in the same change. Unblocks future Investigate domains with high/total failure rates #319-style diagnosis from the dashboard.
  4. Feature/fly io setup #2 (Shopify handler) — biggest throughput win for the Shopify long tail; ship after Fundamentals, Turso and basic main.go setup #1 + Feature/go version update #4 are providing the framework (WAF detection + failure classes).
  5. Postgres #5 (self-hosted headless) — defer until 1–4 are in production. 10/5 effort, lowest throughput impact, highest infra cost.

Out of scope for this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions