prototype: H2O bench framework by ser-vasilich · Pull Request #1 · RayforceDB/rayforce-bench

ser-vasilich · 2026-05-08T07:29:24Z

Summary

Full H2O.ai canonical bench (groupby q1..q7, join_inner/left, sort_single/multi) across 8 engines: rayforce, polars, duckdb, chdb, datafusion, pandas, questdb, timescale.
Subprocess-isolated workers, identical Python-level timing for every adapter, swap monitor, SHA256 manifest, cross-adapter row-count validation.
Static reports under docs/: histogram + index (8 dbs / 11 ops / 88 data points dynamic counters) + scaling sweep with engine/op filters.

Test plan

make bench-all DOCKER=ON ITERATIONS=3 WARMUP=1 runs to completion on a clean checkout.
make bench-scaling DOCKER=ON produces docs/scaling.html with 7 sizes per (engine, op).
Render docs/index.html in Chrome, hard-refresh, verify hero counters reflect 8/11/88 and table rebuilds dynamically.

🤖 Generated with Claude Code

Each benchmark operation now runs in its own child process via bench.worker. Memory is guaranteed released between runs, and a single engine crash no longer aborts the whole suite. Pattern borrowed from teide-bench. bench/worker.py — child entrypoint. Lazy-imports the adapter, runs warmup + measured iterations, writes JSON to --result, exits via os._exit() to skip Python cleanup (some engines segfault on exit). bench/runner.py — orchestrator. Spawns workers via subprocess.run with a 600s timeout, reads back JSON, aggregates into BenchmarkRun objects. The old in-process BenchmarkRunner class is replaced by an OrchestratorConfig + run_suite() function. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When the OS pages out, results stop reflecting engine performance and start reflecting disk I/O. swapcheck.py samples psutil.swap_memory() before and after each operation; runner warns when growth exceeds 100MB or when swap is already in use at startup. Threshold matches the conventional "noise floor" for analytical workloads on workstations — smaller deltas usually come from unrelated background activity, larger ones almost always indicate the dataset doesn't fit in RAM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaced eval_str("(timeit ...)") wrapper with time.perf_counter_ns around eval_str(query). The other adapters (duckdb, polars, questdb, timescale) already time externally via Adapter._time_it; rayforce was the only one measuring inside the engine, which excluded Python-binding overhead and skewed comparisons in its favor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bench/engine_source.py — clones the requested rayforce-py branch into .deps/rayforce-py-branch-<name>/ (or fast-forwards an existing checkout) and returns the directory. The orchestrator passes it to the worker as --rayforce-local, so a branch behaves exactly like a local clone. engine_label() emits 'rayforce@branch (commit) dirty' for reports — same shape as teide-bench used. Wired into OrchestratorConfig but only threaded into reports in a later commit. .deps/ ignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

py mode: existing RayforceAdapter (rayforce-py wrapper, PyPI when shipped). rfl mode: new RayforceRflAdapter that drives the native binary via generated .rfl scripts — same approach teide-bench used for rayforce2, adapted to the new API (left-join / inner-join / xasc / select). Each call to run_full() builds one .rfl with read-csv outside (timeit ...), n_warmup blind runs, then n_iter measured runs that each println their ms — so a single binary invocation does the whole warmup + measurement cycle and we don't pay CSV-read cost per iteration. Adapter.run_full() is the new orchestration hook on the base class. The default implementation matches the old worker loop (per-iter calls); RayforceRflAdapter overrides it because per-iter binary launches would re-parse the CSV every time. CLI: --rayforce-mode {py,rfl} + --rayforce-bin <path>. Default py because once rayforce-py lands on PyPI that becomes the canonical path; today rfl is the practical fallback (rayforce-py not yet released). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Optional benchmark mode that complements the standard H2O sort (s1, s6) with a typed scaling curve. Random pattern only — focus is throughput per type, not stability under partially-sorted input. bench/generators/sort_grid.py — random columns at 9 points per decade up to a configurable max. dtypes: u8 i16 i32 i64 f64 str8 str16. The str8/str16 split exists to surface RAY_STR's SSO boundary at 12 bytes: str8 stays inline, str16 spills to the pool, and the same effect applies to DuckDB VARCHAR (12-byte inline) and Polars Utf8. Adapter.run_sort_typed_full(csv, dtype, n_warmup, n_iter) is the new optional hook. duckdb / polars / rayforce-py / rayforce-rfl implement it; questdb / timescale don't (excluded from grid by default — Docker overhead and SQL setup cost dwarf the actual sort). bench/sort_grid_runner.py + sort_grid_worker.py mirror the H2O orchestrator/worker split: each (adapter, dtype, length) triple is its own subprocess. Default 3 iterations, 1 warmup — fewer than the H2O suite because the grid sweeps O(adapters × dtypes × lengths) combinations and we want the whole thing to fit in a coffee break. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

generate_histogram_html — self-contained Plotly bar chart, log-Y, grouped by adapter. Same shape teide-bench used for results/bench.html, kept deliberately ascetic so it works without docs/index.html infrastructure and screenshots from both repos look comparable. generate_sort_grid_html — log-log scaling curve fed from docs/sort_data.json. One trace per (adapter, dtype) pair: color encodes the engine, line dash distinguishes the dtype within a color group. Legend supports group-toggle so the viewer can isolate one engine or one dtype across engines. Wired into runner.py (writes docs/histogram.html alongside index.html) and sort_grid_runner.py (writes docs/sort.html alongside sort_data.json). ENGINE_COLORS palette stays compatible with teide-bench so cross-repo screenshots remain visually consistent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Makefile: bench-sort-ext target for the extended grid, RAYFORCE_MODE / RAYFORCE_BIN / SORT_MAX / SORT_DTYPES knobs, fixed JOIN_DATA path to match what bench.generate actually emits (joinNxM uses 'x' not '_'). README: prototype-branch summary at the top, rayforce execution-mode explainer, extended sort grid section with the str8/str16 SSO rationale, roadmap pointing at ClickBench / TPC-H / JOB next. rayforce_rfl_adapter: read-csv → .csv.read. The reference scripts in ~/rayforce/bench/h2o/*.rfl still use the old read-csv name but the current binary registers .csv.read (eval.c:2181) — those .rfl files are stale. runner: dependency check no longer hard-fails when an unrequested adapter is missing. Worker fails cleanly if the user actually picks an unavailable adapter. Smoke-tested locally: - groupby (duckdb + polars) on 10k rows - groupby (rayforce rfl mode) — works without rayforce-py installed - extended sort grid (rayforce rfl + duckdb + polars, max=100) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three new engines round out the embedded competitor set: * chdb — embedded ClickHouse via the chdb Python package. Lets us measure against ClickHouse without running a server. Session API so CREATE TABLE state persists across queries within one adapter instance. * pandas — slow baseline included for context. Almost everyone reading the report has a mental model calibrated against pandas; the "of course pandas is slow" column makes the rest of the chart legible. * datafusion — Apache Rust+Arrow query engine. It's the substrate of InfluxDB 3, GlareDB, ROAPI, RisingLight and Sail, so measuring against it covers the Apache columnar ecosystem rather than just one product. bench/adapters/__init__.py now imports each adapter lazily — a missing optional dep no longer breaks the whole module. print_dependency_status lists every recognized engine so the user sees what's installed. Default adapters for both runners now: rayforce, duckdb, polars, chdb, datafusion, pandas. ALL=1 still adds questdb + timescale. Smoke-tested locally on 10k groupby + sort grid on str8/i64 to 100 rows — all 5 new adapters return sensible numbers. Color palette in report.py extended for the new engines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

questdb: ILP ingestion is async — the table (on first write) and rows both lag the flush by ~1s. The old loader returned immediately after sender.flush(), so the first benchmark query landed on an empty or nonexistent table. Now wait_for_commit polls SELECT count(*) until the visible row count matches the load, swallowing 'table does not exist' during the racy window. 30s timeout matches QuestDB's worst-case commit cadence. timescale: post_start retry loop was too short (5×2s = 10s). Postgres opens its port before initdb finishes, so the port-ready check returns long before psql can connect. Bumped to 15×2s = 30s, matching the ready_timeout we wait on. Both surfaced as silent N/A or 'database benchmark does not exist' on the all-8-adapter snapshot — both are pre-existing bugs unrelated to the prototype refactor, but blocked clean cross-engine numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Generators now produce data byte-identical to ~/rayforce/bench/h2o/*.rfl expectations and to teide-bench's gen/generate.py: groupby: id1..id3 string, id4..id6 int64, v1 int[1..5], v2 int[1..15], v3 float[0..100) 6dp join: id1..id3 int64, id4..id6 string, v1 or v2 float Cross-machine determinism: PCG64 (stable since numpy 1.17, doesn't shift on default_rng changes) + SHA256 of every emitted file in manifest.json. Two users on different machines must see the same hash for the same (n_rows, k, seed) — if they don't, generator changed and benchmark numbers are no longer comparable. The schema bump from 6 → 9 columns is what unlocks q7 (6-key groupby) in the next commit, and aligns join key shapes (int IDs + string sides) so the 22 adapters have something to actually stress different join paths against. Default --right-rows for join goes 1m → 10m to match canonical H2O J1 (left and right tables the same size by default; previous 1/10 ratio was a project-specific shortcut). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Canonical H2O q7: SUM(v3), COUNT(v1) GROUP BY id1..id6. With the new schema this stresses high-cardinality hashing on a mix of string and integer keys — which is exactly where engines diverge most sharply. Adapter changes mirror the schema bump from 6 → 9 columns: * duckdb / polars / pandas / chdb / datafusion: SQL or DataFrame q7, pulling all six id columns through GROUP BY. * rayforce-py: _get_column_types now returns Symbol for id1..id3 (string IDs) and I64 for id4..id6 + v1/v2, F64 for v3. Falls back to STR if Symbol isn't exposed by the wrapper. The first-data-row sniff also disambiguates groupby vs join layout (where id4..id6 are strings). * rayforce-rfl: GROUPBY_SCHEMA is now [SYMBOL SYMBOL SYMBOL I64 I64 I64 I64 I64 F64], JOIN_SCHEMA is [I64 I64 I64 SYMBOL SYMBOL SYMBOL F64]. Matches ~/rayforce/bench/h2o/q*.rfl byte-for-byte. * questdb / timescale: SQL q7 only — ILP and COPY already handled the string IDs correctly, no schema fix needed. Smoke on 10k canonical groupby (id3 high-cardinality → ~10k groups in q7): rayforce 9ms, polars 3ms, duckdb 9ms, pandas 11ms, datafusion 14ms, chdb 14ms, questdb 18ms, timescale 21ms. The high-cardinality hash paths separate engines much more visibly than q1..q5 on 100 groups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bench.scaling_runner sweeps a list of sizes (default 10..1m) and runs each adapter through every H2O op plus the typed-sort grid at each size, producing one JSON suitable for an interactive log-log scaling chart. Adaptive iteration counts borrow teide-bench/sort_bench_multi staircase: n≤100→21/5, ≤100k→7/3, ≤10m→5/2, larger→3/1. Tiny inputs need many runs to drown out the perf_counter floor (~50µs); huge inputs are already slow so we cut down. Joins skipped under 1000 rows — both sides are tiny and the curve adds nothing. generate_scaling_html in bench.report adapts teide-bench's sort_bench_plot.py: two checkbox groups (Engines + Operations), All/None buttons, plus four preset buttons that one-click switch the op filter between groupby/join/sort-h2o/sort-typed. Plotly.react redraws on every toggle. One trace per (engine, op) pair: engine→colour, op→line-dash + marker symbol. Default-on a starter triple (groupby_q1 + sort_i64 + sort_str8) so the page isn't a wall of lines on first load. Smoke on 4 adapters × 100,1k,10k → 172 data points, JSON 66KB, HTML 14KB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Makefile: * SIZE default 1m → 10m. Canonical H2O is benchmarked at 10m+ rows; 1m was a hold-over from quick-iteration days. 10k/100k/1m still available via SIZE=... * New bench-scaling target with default SIZES=10,100,1k,10k,100k,1m, driving bench.scaling_runner. Skips Docker engines by default since the sweep generates many subprocess spawns and TSDB engines are already disproportionately slow. * JOIN_DATA paths follow canonical H2O: equal-size left and right tables (J1 standard), so 10mx10m instead of the previous 10mx1m. README: * Quick-start switches to make bench-scaling as the showcase command. * New Reproducibility section explains the SHA256 manifest contract: same seed + size + machine produce byte-identical CSVs, mismatch means the generator changed. * GroupBy section lists q7 alongside q1..q6 and documents the 9-column canonical schema (id1..id3 string, id4..id6 int). * Join section notes the inverted spread (int keys + string sides). * Scaling sweep section explains adaptive iter_counts and the engine/op filter UI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

FAIRNESS.md was directly contradicting current behaviour — it claimed rayforce uses internal (timeit ...) and that id1..id3 were int64. Both were true two months ago and false now. Rewrite covers what the prototype branch actually does: * Python-level perf_counter_ns around every engine call (no more asymmetric internal timeit). * Subprocess isolation as the primary mechanism for fairness. * Canonical H2O 9-col schema + per-engine type mapping table covering all 8 adapters (chdb / datafusion were missing entirely). * Adaptive iteration counts table for bench-scaling. * SHA256 manifest contract — the verifiability story. * Swap monitor — what the warnings mean and when to trust the number. * Explicit list of what's deliberately excluded (server engines from sort-ext, partial-sort patterns, nullable workloads, value-level cross-engine comparison). * Source-file pointers throughout so claims are checkable. README sections that were equally stale: * Project Structure missed worker.py, scaling_runner.py, sort_grid_runner.py, sort_grid_worker.py, engine_source.py, swapcheck.py, and four adapter files (chdb, datafusion, pandas, rayforce_rfl). Updated tree shows the real layout. * Data Format documented the old 6-col int-id schema. Now lists the canonical 9-col groupby and 7-col join layouts with example values and a snippet showing how to verify SHA256 across machines. * "Benchmarking with Local Rayforce Build" only knew about --rayforce-local. Now covers all three rayforce flows: --rayforce-local (path), --rayforce-branch (clone), and --rayforce-mode rfl (native binary, no rayforce-py needed). * "Server-Based Adapters" referenced make targets that don't exist (make infra-start/stop/status/cleanup). Replaced with the actual interface — ALL=1 for auto-start, python -m bench.infra for manual. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

3-key inner/left join across all 8 adapters (was: only id1, which on canonical H2O J1 — both tables 10M, id1 cardinality 100 — produced a 10^12-row Cartesian-ish result and OOMed). Now matches teide-bench and ~/rayforce/bench/h2o/j1.rfl: ON id1 = id1 AND id2 = id2 AND id3 = id3. * duckdb / polars / pandas / chdb / datafusion / timescale: SQL/expr with three keys. * questdb: implements joins for the first time (was NotImplementedError); loads the right side via ILP, waits for commit, joins on three keys. * rayforce: (inner-join [id1 id2 id3] left right) instead of (ij `id1 left right) — same canonical form as in ~/rayforce/bench/h2o/j1.rfl. Drop rfl mode entirely. Reasoning: the rfl path went through the .csv.read builtin which produces a table without the hash index that Operation.READ_CSV + binary_set attaches. Net effect was every Symbol-keyed select rehashed 10M rows from scratch (~30x slowdown, including a hard timeout on q6). With rayforce-py 1.0.0 now on PyPI, keeping rfl as a "fallback" only re-introduces an asymmetric timing path. Single Python entry for every engine, period. Removed: * bench/adapters/rayforce_rfl_adapter.py * --rayforce-mode / --rayforce-bin CLI flags from runner / worker / scaling_runner / sort_grid_runner / sort_grid_worker * RAYFORCE_MODE / RAYFORCE_BIN Makefile knobs * rfl sections in README and FAIRNESS.md Smoke on 10k canonical H2O groupby (rayforce + duckdb + polars + chdb, all py-mode): rayforce: 0.10-1.15ms (q1..q7) polars: 3.87-6.00ms — ~25x slower chdb: 5.03-14.67ms — ~37x slower duckdb: 7.04-19.54ms — ~52x slower A reproducer for the .csv.read perf gap was packaged separately for Anton (see /tmp/rayforce-csvread-repro.tar.gz) — that's an upstream bug, not something to work around in the bench harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…esults * rayforce_adapter sort: backtick form (xasc t \`id1) is not parsed by current rayforce; switched to canonical (xasc t 'id1) for single-key and (xasc t [id1 id2 id3]) for multi-key. Smoke-tested on 10k and 10m. * docs/index.html overview: split the single bar chart into "Fast queries (groupby q1..q6)" and "Heavy queries (q7, joins, sorts)" so the multi-second q7/sort entries don't flatten the sub-second q1..q6 group to invisible slivers. Same _buildBarOption helper feeds both, FAST_TASKS set decides the partition. * 10M results merged from three runs (groupby+sort partial first, then join with right.csv from data/join_10mx10m, then a 1-iter rayforce join because rayforce-py crashes on repeated 10M-row right-table reloads — likely a memory leak in the wrapper, separate bug for Anton). Headline numbers (10M, median ms): rayforce: groupby 24-1747, join 511/627, sort 5264/19787 next-fastest on each op trails by 3x-15x except sort_multi (pandas wins). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bench.scaling_runner sweep across sizes 10, 100, 1k, 10k, 100k, 1m on rayforce + duckdb + polars + pandas + chdb + datafusion. 623 data points → docs/scaling_data.json + interactive log-log chart in docs/scaling.html with engine + op filters and preset buttons (groupby / join / sort H2O / sort typed). Two rayforce-py workarounds applied during the run, both flagged for upstream: * String column type for sort grid: rf.String exists in 1.0.0 but Table.from_csv() asks for c.ray_name and String doesn't expose one, so we can't request a RAY_STR column at load time. Fall back to Symbol for str8/str16 in the sort grid — same scan path the ~/rayforce/bench/h2o/q*.rfl examples use. * xasc syntax: backtick form (xasc t `id1) parses as an error in 1.0.0; switched to (xasc t 'id1) for single key and (xasc t [id1 id2 id3]) for multi-key. Per-adapter coverage (ops × sizes): every embedded engine: full 16 ops on 10/100, full 18 ops on 1k+ rayforce: full 16/16/18/18/17/18 — one 100k sort_f64 lost to a rayforce-py worker crash that we already see in the 10M join path (Repeated load+save of large right tables crashes the wrapper — separate bug for Anton, see /tmp/rayforce-csvread-repro for the related .csv.read perf gap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

scaling_runner now accepts -i/--iterations, -w/--warmup (override the adaptive 21/7/5/3 staircase) and --metric min|median (default median). Server engines added to the scaling chart at every size, run with 1 warmup + 2 timed + min aggregation — fewer iterations because Docker round-trip dominates the small-N timing anyway, and "best of N" gives a clean lower bound. On 10..100 rows the curves flatten into "network overhead" territory — that's exactly the diagnostic value the user asked for: see how unusable QuestDB / Timescale become at small scale. Coverage matrix now: embedded engines (rayforce / duckdb / polars / pandas / chdb / datafusion): full 16/16/18/18/18 × 6 sizes server engines (questdb / timescale): full 9/9/11/11/11/11 × 6 sizes (no sort grid — they don't implement run_sort_typed_full) 747 data points total. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

run_sort_typed_full implemented for both server engines: * questdb: ILP load with the appropriate column type (BYTE/SHORT/INT/ LONG/DOUBLE/SYMBOL). u8 widens to SHORT — QuestDB has no UINT. str8/str16 land on SYMBOL (low-cardinality dictionary) which is the natural type for the financial / market-data segment QuestDB targets. * timescale: CREATE TABLE with PostgreSQL-native type then COPY-from- STDIN. SMALLINT/SMALLINT/INTEGER/BIGINT/DOUBLE PRECISION/TEXT. PostgreSQL has no UINT8 either; SMALLINT covers 0..255 safely. sort_grid_worker.py and scaling_runner.SORT_GRID_ADAPTERS now allow both server engines through the typed-sort path. Final scaling coverage matrix (op count per adapter × size): 10 100 1000 10000 100000 1000000 rayforce: 16/16 16/16 18/18 18/18 17/18 18/18 every other: 16/16 16/16 18/18 18/18 18/18 18/18 831 data points total. Single rayforce miss (sort_f64 at 100k) carries over from the earlier rayforce-py worker-crash pattern documented in the prior commits — not a fix-now item. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pre-existing histogram.html was generated from the very first 10m run, before xasc syntax fix and before the join data was merged in. It had rayforce sort_* as null and missing join_inner/join_left rows for everyone. Re-rendered from docs/data_10m.json (the merged, complete dataset) so it now matches docs/index.html in coverage: all 8 adapters × 11 ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The preset buttons (Groupby / Join / Sort H2O / Sort typed) and the All button under "Operations" added clutter for the actual workflow: the viewer toggles individual ops on demand. Only the None button stays — quick way to clear the chart back to one explicit selection. Engines panel (All / None) untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Task Breakdown was a per-op bar chart (8 adapters sorted by median). Same ranking is already legible from the heights in Overview's fast/heavy charts, so the extra section was redundant. It also hardcoded six tabs (groupby_q1..q6) — every op past q6 was silently invisible. Removed the markup (section, .task-tabs / .task-panel divs, six hardcoded chart-containers), the JS (initTaskCharts, updateTaskChart, showTask, taskCharts), and the orphan CSS rules. Lazy-load observer now only resizes the two surviving overview charts (fast / heavy). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…scaling bench.runner all expected join's left.csv/right.csv to live in the same directory it was given for groupby — they live in data/join_<n>x<n>/ instead, so all-mode joins always failed silently (median_ms=0, N/A in the chart). New --join-data flag lets the orchestrator point each suite at its own dataset; Makefile bench-all target now forwards it automatically based on SIZE. Makefile bench-scaling now forwards ITERATIONS / WARMUP — earlier the scaling sweep ignored them and stuck to the adaptive 21/7/5/3 staircase even when the user asked for fixed counts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

generate_html_report no longer patches index.html via regex. It writes two artifacts and stops: docs/data.json — pretty-printed dataset (tooling, manifest, share) docs/data.js — window.chartData = {...}; one-line module index.html now ships as a static file and pulls the dataset via a plain <script src="data.js"></script> include. No JSON parser, no fetch and its file:// CORS quirks, no regex over data that could legitimately contain '};' in some future schema. data.js is just a JS file the browser already trusts to set globals. The 70KB of inline data dropped from index.html as a side effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two engine selectors under "Detailed Results > Compare Databases" hardcoded the original five adapters (rayforce / duckdb / polars / questdb / timescale). After we added pandas, chdb, and datafusion in 2be5755 the markup was never extended, so those three never appeared in the side-by-side comparison. Now lists all eight in both dropdowns, alphabetically grouped by purpose: embedded engines first, then server engines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

scaling_runner now accepts --stop-infra (parity with bench.runner) and the Makefile bench-scaling target forwards STOP_INFRA when ALL=1. Without this, a scaling sweep that included questdb / timescale left their containers running after exit, holding multi-GB of buffer pool and query-plan caches in RAM. Server engines don't release that on psycopg connection close — the only way is to stop the container. With --stop-infra the runner does that as the last step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bench / bench-* and bench-scaling now declare file dependencies on .venv/bin/python and on $(GROUPBY_DATA)/data.csv / $(JOIN_DATA)/left.csv, so make creates whatever is missing before running the bench: * No .venv yet → python3 -m venv .venv + pip install requirements. * No data/groupby_<SIZE>_k100/ yet → bench.generate groupby. * No data/join_<SIZE>x<SIZE>/ yet → bench.generate join. * Everything already there → straight to the bench, no rebuild. PYTHON now defaults to .venv/bin/python (the file target makes sure it exists), so `make bench-all` works out of the box on a fresh checkout without the user remembering to run `make setup` and `make data` first. Also rename the per-row count in the runner output: "rows=N" -> "result=N rows". The previous label read like input row count; "result" is unambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ALL=1 was overloaded against `make bench-all` (run all suites). New DOCKER=ON makes the intent unambiguous: "switch on the Docker-backed engines (QuestDB + TimescaleDB)". Strict ifeq match means typos like DOCKER=0 or DOCKER=off don't accidentally enable them — only the explicit ON value does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bench.runner already auto-started questdb / timescale containers when they were in the adapter list. scaling_runner didn't, so a sweep with DOCKER=ON gave 'Connection refused' on every server-engine point — the recent run was 220 errored entries out of 836 because of this. Now scaling_runner calls start_required_infrastructure(adapters) before running the sweep, mirroring runner.py. If a container fails to come up, that adapter is dropped from the run with a warning instead of poisoning the chart with 200+ identical 'Connection refused' rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eout Two bugs surfaced in the latest 1M run: 1. str8 / str16 measurements showed median=0.36ms / rows=0 — i.e. the sort ran against an empty table because the 30s ILP commit deadline silently expired and we proceeded anyway. Now: 120s deadline and raise RuntimeError on miss, so a row that didn't load shows up as ERROR rather than as a fake "QuestDB sorts 1M strings in 0.36ms". 2. Random 1M unique strings were going through ILP `symbols={...}` — QuestDB Symbol is a dictionary type for low-cardinality categoricals, not a general string column. Switched to ILP `columns={...}`, which maps to STRING and handles unique values per row without the symbol- dictionary bottleneck. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rayforce-py 1.0.0 sometimes crashes between the timed block and the JSON write — the parent then sees "Expecting value: line 1 column 1" with no idea why. Random pattern, not deterministic by op or row count (seen on q3/q7/join_inner/sort_single across the same sweep where q1/q2/q4/q5/q6 succeeded on identical conditions). Worker calls in scaling_runner now go through _run_worker_with_retry: - capture_output=True so we get stderr - one retry on empty/missing JSON; second subprocess in a fresh Python interpreter clears whatever state crashed the first - on final failure, error message includes the last 3 lines of the worker's stderr instead of the cryptic "Expecting value" string Retries print a "retry [adapter/op n=N]: ..." line so the user can see flakiness even when it eventually succeeded. Other adapters never hit this path in practice; the cost is negligible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For each benchmark (and for scaling, each (op, size) pair) all adapters that successfully returned a result should agree on the number of output rows. Disagreement = either a SQL-semantics bug in one adapter (e.g. NULL handling, distinct vs. not-distinct join) or a real engine difference worth knowing about. Both runner.py and scaling_runner.py now print a "Row-count validation" block at the end: Row-count validation: OK — all 11 benchmark(s) returned the same row count from every adapter or, when somebody disagrees: Row-count validation: WARNING — 2 benchmark(s) disagree across adapters: groupby_q7: chdb=10000, duckdb=9998, polars=10000 So far our existing data passes this check on canonical H2O 10M; if a new schema or query introduces a divergence, the bench loudly says so instead of silently averaging incompatible results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add 10m to the default scaling-curve sweep (was 10..1m). Lets make bench-scaling cover the full size range — including the real-world 10M-row scenario — without callers having to spell out SIZES=... manually. Override remains available for shorter runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ser-vasilich and others added 30 commits May 4, 2026 14:54

ser-vasilich and others added 10 commits May 5, 2026 22:35

prototype: write-proof + generated_at on bench-all output

1913797

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

prototype: untrack machine-specific bench snapshots

82bb018

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

prototype: drop orphan docs/data_10m.json

80c95a4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

prototype: snapshot 10m results for Anton (still gitignored)

fa8b833

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

prototype: track docs/ bench snapshots

dda1b03

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

patch

92b6c69

ser-vasilich force-pushed the prototype branch from 06119ae to 92b6c69 Compare May 8, 2026 09:03

ser-vasilich changed the title ~~prototype: H2O bench framework + cross-adapter check~~ prototype: H2O bench framework May 8, 2026

singaraiona merged commit aa4b402 into RayforceDB:master May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prototype: H2O bench framework#1

prototype: H2O bench framework#1
singaraiona merged 40 commits intoRayforceDB:masterfrom
ser-vasilich:prototype

ser-vasilich commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ser-vasilich commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ser-vasilich commented May 8, 2026 •

edited

Loading