prototype: H2O bench framework#1
Merged
singaraiona merged 40 commits intoRayforceDB:masterfrom May 8, 2026
Merged
Conversation
Each benchmark operation now runs in its own child process via bench.worker. Memory is guaranteed released between runs, and a single engine crash no longer aborts the whole suite. Pattern borrowed from teide-bench. bench/worker.py — child entrypoint. Lazy-imports the adapter, runs warmup + measured iterations, writes JSON to --result, exits via os._exit() to skip Python cleanup (some engines segfault on exit). bench/runner.py — orchestrator. Spawns workers via subprocess.run with a 600s timeout, reads back JSON, aggregates into BenchmarkRun objects. The old in-process BenchmarkRunner class is replaced by an OrchestratorConfig + run_suite() function. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the OS pages out, results stop reflecting engine performance and start reflecting disk I/O. swapcheck.py samples psutil.swap_memory() before and after each operation; runner warns when growth exceeds 100MB or when swap is already in use at startup. Threshold matches the conventional "noise floor" for analytical workloads on workstations — smaller deltas usually come from unrelated background activity, larger ones almost always indicate the dataset doesn't fit in RAM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaced eval_str("(timeit ...)") wrapper with time.perf_counter_ns
around eval_str(query). The other adapters (duckdb, polars, questdb,
timescale) already time externally via Adapter._time_it; rayforce was
the only one measuring inside the engine, which excluded Python-binding
overhead and skewed comparisons in its favor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench/engine_source.py — clones the requested rayforce-py branch into .deps/rayforce-py-branch-<name>/ (or fast-forwards an existing checkout) and returns the directory. The orchestrator passes it to the worker as --rayforce-local, so a branch behaves exactly like a local clone. engine_label() emits 'rayforce@branch (commit) dirty' for reports — same shape as teide-bench used. Wired into OrchestratorConfig but only threaded into reports in a later commit. .deps/ ignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
py mode: existing RayforceAdapter (rayforce-py wrapper, PyPI when shipped).
rfl mode: new RayforceRflAdapter that drives the native binary via
generated .rfl scripts — same approach teide-bench used for rayforce2,
adapted to the new API (left-join / inner-join / xasc / select).
Each call to run_full() builds one .rfl with read-csv outside (timeit ...),
n_warmup blind runs, then n_iter measured runs that each println their
ms — so a single binary invocation does the whole warmup + measurement
cycle and we don't pay CSV-read cost per iteration.
Adapter.run_full() is the new orchestration hook on the base class. The
default implementation matches the old worker loop (per-iter calls);
RayforceRflAdapter overrides it because per-iter binary launches would
re-parse the CSV every time.
CLI: --rayforce-mode {py,rfl} + --rayforce-bin <path>. Default py because
once rayforce-py lands on PyPI that becomes the canonical path; today rfl
is the practical fallback (rayforce-py not yet released).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Optional benchmark mode that complements the standard H2O sort (s1, s6) with a typed scaling curve. Random pattern only — focus is throughput per type, not stability under partially-sorted input. bench/generators/sort_grid.py — random columns at 9 points per decade up to a configurable max. dtypes: u8 i16 i32 i64 f64 str8 str16. The str8/str16 split exists to surface RAY_STR's SSO boundary at 12 bytes: str8 stays inline, str16 spills to the pool, and the same effect applies to DuckDB VARCHAR (12-byte inline) and Polars Utf8. Adapter.run_sort_typed_full(csv, dtype, n_warmup, n_iter) is the new optional hook. duckdb / polars / rayforce-py / rayforce-rfl implement it; questdb / timescale don't (excluded from grid by default — Docker overhead and SQL setup cost dwarf the actual sort). bench/sort_grid_runner.py + sort_grid_worker.py mirror the H2O orchestrator/worker split: each (adapter, dtype, length) triple is its own subprocess. Default 3 iterations, 1 warmup — fewer than the H2O suite because the grid sweeps O(adapters × dtypes × lengths) combinations and we want the whole thing to fit in a coffee break. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
generate_histogram_html — self-contained Plotly bar chart, log-Y, grouped by adapter. Same shape teide-bench used for results/bench.html, kept deliberately ascetic so it works without docs/index.html infrastructure and screenshots from both repos look comparable. generate_sort_grid_html — log-log scaling curve fed from docs/sort_data.json. One trace per (adapter, dtype) pair: color encodes the engine, line dash distinguishes the dtype within a color group. Legend supports group-toggle so the viewer can isolate one engine or one dtype across engines. Wired into runner.py (writes docs/histogram.html alongside index.html) and sort_grid_runner.py (writes docs/sort.html alongside sort_data.json). ENGINE_COLORS palette stays compatible with teide-bench so cross-repo screenshots remain visually consistent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Makefile: bench-sort-ext target for the extended grid, RAYFORCE_MODE / RAYFORCE_BIN / SORT_MAX / SORT_DTYPES knobs, fixed JOIN_DATA path to match what bench.generate actually emits (joinNxM uses 'x' not '_'). README: prototype-branch summary at the top, rayforce execution-mode explainer, extended sort grid section with the str8/str16 SSO rationale, roadmap pointing at ClickBench / TPC-H / JOB next. rayforce_rfl_adapter: read-csv → .csv.read. The reference scripts in ~/rayforce/bench/h2o/*.rfl still use the old read-csv name but the current binary registers .csv.read (eval.c:2181) — those .rfl files are stale. runner: dependency check no longer hard-fails when an unrequested adapter is missing. Worker fails cleanly if the user actually picks an unavailable adapter. Smoke-tested locally: - groupby (duckdb + polars) on 10k rows - groupby (rayforce rfl mode) — works without rayforce-py installed - extended sort grid (rayforce rfl + duckdb + polars, max=100) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new engines round out the embedded competitor set: * chdb — embedded ClickHouse via the chdb Python package. Lets us measure against ClickHouse without running a server. Session API so CREATE TABLE state persists across queries within one adapter instance. * pandas — slow baseline included for context. Almost everyone reading the report has a mental model calibrated against pandas; the "of course pandas is slow" column makes the rest of the chart legible. * datafusion — Apache Rust+Arrow query engine. It's the substrate of InfluxDB 3, GlareDB, ROAPI, RisingLight and Sail, so measuring against it covers the Apache columnar ecosystem rather than just one product. bench/adapters/__init__.py now imports each adapter lazily — a missing optional dep no longer breaks the whole module. print_dependency_status lists every recognized engine so the user sees what's installed. Default adapters for both runners now: rayforce, duckdb, polars, chdb, datafusion, pandas. ALL=1 still adds questdb + timescale. Smoke-tested locally on 10k groupby + sort grid on str8/i64 to 100 rows — all 5 new adapters return sensible numbers. Color palette in report.py extended for the new engines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
questdb: ILP ingestion is async — the table (on first write) and rows both lag the flush by ~1s. The old loader returned immediately after sender.flush(), so the first benchmark query landed on an empty or nonexistent table. Now wait_for_commit polls SELECT count(*) until the visible row count matches the load, swallowing 'table does not exist' during the racy window. 30s timeout matches QuestDB's worst-case commit cadence. timescale: post_start retry loop was too short (5×2s = 10s). Postgres opens its port before initdb finishes, so the port-ready check returns long before psql can connect. Bumped to 15×2s = 30s, matching the ready_timeout we wait on. Both surfaced as silent N/A or 'database benchmark does not exist' on the all-8-adapter snapshot — both are pre-existing bugs unrelated to the prototype refactor, but blocked clean cross-engine numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generators now produce data byte-identical to ~/rayforce/bench/h2o/*.rfl
expectations and to teide-bench's gen/generate.py:
groupby: id1..id3 string, id4..id6 int64, v1 int[1..5],
v2 int[1..15], v3 float[0..100) 6dp
join: id1..id3 int64, id4..id6 string, v1 or v2 float
Cross-machine determinism: PCG64 (stable since numpy 1.17, doesn't
shift on default_rng changes) + SHA256 of every emitted file in
manifest.json. Two users on different machines must see the same hash
for the same (n_rows, k, seed) — if they don't, generator changed and
benchmark numbers are no longer comparable.
The schema bump from 6 → 9 columns is what unlocks q7 (6-key groupby)
in the next commit, and aligns join key shapes (int IDs + string
sides) so the 22 adapters have something to actually stress different
join paths against.
Default --right-rows for join goes 1m → 10m to match canonical H2O J1
(left and right tables the same size by default; previous 1/10 ratio
was a project-specific shortcut).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canonical H2O q7: SUM(v3), COUNT(v1) GROUP BY id1..id6. With the new schema this stresses high-cardinality hashing on a mix of string and integer keys — which is exactly where engines diverge most sharply. Adapter changes mirror the schema bump from 6 → 9 columns: * duckdb / polars / pandas / chdb / datafusion: SQL or DataFrame q7, pulling all six id columns through GROUP BY. * rayforce-py: _get_column_types now returns Symbol for id1..id3 (string IDs) and I64 for id4..id6 + v1/v2, F64 for v3. Falls back to STR if Symbol isn't exposed by the wrapper. The first-data-row sniff also disambiguates groupby vs join layout (where id4..id6 are strings). * rayforce-rfl: GROUPBY_SCHEMA is now [SYMBOL SYMBOL SYMBOL I64 I64 I64 I64 I64 F64], JOIN_SCHEMA is [I64 I64 I64 SYMBOL SYMBOL SYMBOL F64]. Matches ~/rayforce/bench/h2o/q*.rfl byte-for-byte. * questdb / timescale: SQL q7 only — ILP and COPY already handled the string IDs correctly, no schema fix needed. Smoke on 10k canonical groupby (id3 high-cardinality → ~10k groups in q7): rayforce 9ms, polars 3ms, duckdb 9ms, pandas 11ms, datafusion 14ms, chdb 14ms, questdb 18ms, timescale 21ms. The high-cardinality hash paths separate engines much more visibly than q1..q5 on 100 groups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench.scaling_runner sweeps a list of sizes (default 10..1m) and runs each adapter through every H2O op plus the typed-sort grid at each size, producing one JSON suitable for an interactive log-log scaling chart. Adaptive iteration counts borrow teide-bench/sort_bench_multi staircase: n≤100→21/5, ≤100k→7/3, ≤10m→5/2, larger→3/1. Tiny inputs need many runs to drown out the perf_counter floor (~50µs); huge inputs are already slow so we cut down. Joins skipped under 1000 rows — both sides are tiny and the curve adds nothing. generate_scaling_html in bench.report adapts teide-bench's sort_bench_plot.py: two checkbox groups (Engines + Operations), All/None buttons, plus four preset buttons that one-click switch the op filter between groupby/join/sort-h2o/sort-typed. Plotly.react redraws on every toggle. One trace per (engine, op) pair: engine→colour, op→line-dash + marker symbol. Default-on a starter triple (groupby_q1 + sort_i64 + sort_str8) so the page isn't a wall of lines on first load. Smoke on 4 adapters × 100,1k,10k → 172 data points, JSON 66KB, HTML 14KB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Makefile: * SIZE default 1m → 10m. Canonical H2O is benchmarked at 10m+ rows; 1m was a hold-over from quick-iteration days. 10k/100k/1m still available via SIZE=... * New bench-scaling target with default SIZES=10,100,1k,10k,100k,1m, driving bench.scaling_runner. Skips Docker engines by default since the sweep generates many subprocess spawns and TSDB engines are already disproportionately slow. * JOIN_DATA paths follow canonical H2O: equal-size left and right tables (J1 standard), so 10mx10m instead of the previous 10mx1m. README: * Quick-start switches to make bench-scaling as the showcase command. * New Reproducibility section explains the SHA256 manifest contract: same seed + size + machine produce byte-identical CSVs, mismatch means the generator changed. * GroupBy section lists q7 alongside q1..q6 and documents the 9-column canonical schema (id1..id3 string, id4..id6 int). * Join section notes the inverted spread (int keys + string sides). * Scaling sweep section explains adaptive iter_counts and the engine/op filter UI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FAIRNESS.md was directly contradicting current behaviour — it claimed
rayforce uses internal (timeit ...) and that id1..id3 were int64. Both
were true two months ago and false now. Rewrite covers what the prototype
branch actually does:
* Python-level perf_counter_ns around every engine call (no more
asymmetric internal timeit).
* Subprocess isolation as the primary mechanism for fairness.
* Canonical H2O 9-col schema + per-engine type mapping table covering
all 8 adapters (chdb / datafusion were missing entirely).
* Adaptive iteration counts table for bench-scaling.
* SHA256 manifest contract — the verifiability story.
* Swap monitor — what the warnings mean and when to trust the number.
* Explicit list of what's deliberately excluded (server engines from
sort-ext, partial-sort patterns, nullable workloads, value-level
cross-engine comparison).
* Source-file pointers throughout so claims are checkable.
README sections that were equally stale:
* Project Structure missed worker.py, scaling_runner.py,
sort_grid_runner.py, sort_grid_worker.py, engine_source.py,
swapcheck.py, and four adapter files (chdb, datafusion, pandas,
rayforce_rfl). Updated tree shows the real layout.
* Data Format documented the old 6-col int-id schema. Now lists the
canonical 9-col groupby and 7-col join layouts with example values
and a snippet showing how to verify SHA256 across machines.
* "Benchmarking with Local Rayforce Build" only knew about
--rayforce-local. Now covers all three rayforce flows: --rayforce-local
(path), --rayforce-branch (clone), and --rayforce-mode rfl
(native binary, no rayforce-py needed).
* "Server-Based Adapters" referenced make targets that don't exist
(make infra-start/stop/status/cleanup). Replaced with the actual
interface — ALL=1 for auto-start, python -m bench.infra for manual.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3-key inner/left join across all 8 adapters (was: only id1, which on canonical H2O J1 — both tables 10M, id1 cardinality 100 — produced a 10^12-row Cartesian-ish result and OOMed). Now matches teide-bench and ~/rayforce/bench/h2o/j1.rfl: ON id1 = id1 AND id2 = id2 AND id3 = id3. * duckdb / polars / pandas / chdb / datafusion / timescale: SQL/expr with three keys. * questdb: implements joins for the first time (was NotImplementedError); loads the right side via ILP, waits for commit, joins on three keys. * rayforce: (inner-join [id1 id2 id3] left right) instead of (ij `id1 left right) — same canonical form as in ~/rayforce/bench/h2o/j1.rfl. Drop rfl mode entirely. Reasoning: the rfl path went through the .csv.read builtin which produces a table without the hash index that Operation.READ_CSV + binary_set attaches. Net effect was every Symbol-keyed select rehashed 10M rows from scratch (~30x slowdown, including a hard timeout on q6). With rayforce-py 1.0.0 now on PyPI, keeping rfl as a "fallback" only re-introduces an asymmetric timing path. Single Python entry for every engine, period. Removed: * bench/adapters/rayforce_rfl_adapter.py * --rayforce-mode / --rayforce-bin CLI flags from runner / worker / scaling_runner / sort_grid_runner / sort_grid_worker * RAYFORCE_MODE / RAYFORCE_BIN Makefile knobs * rfl sections in README and FAIRNESS.md Smoke on 10k canonical H2O groupby (rayforce + duckdb + polars + chdb, all py-mode): rayforce: 0.10-1.15ms (q1..q7) polars: 3.87-6.00ms — ~25x slower chdb: 5.03-14.67ms — ~37x slower duckdb: 7.04-19.54ms — ~52x slower A reproducer for the .csv.read perf gap was packaged separately for Anton (see /tmp/rayforce-csvread-repro.tar.gz) — that's an upstream bug, not something to work around in the bench harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…esults * rayforce_adapter sort: backtick form (xasc t \`id1) is not parsed by current rayforce; switched to canonical (xasc t 'id1) for single-key and (xasc t [id1 id2 id3]) for multi-key. Smoke-tested on 10k and 10m. * docs/index.html overview: split the single bar chart into "Fast queries (groupby q1..q6)" and "Heavy queries (q7, joins, sorts)" so the multi-second q7/sort entries don't flatten the sub-second q1..q6 group to invisible slivers. Same _buildBarOption helper feeds both, FAST_TASKS set decides the partition. * 10M results merged from three runs (groupby+sort partial first, then join with right.csv from data/join_10mx10m, then a 1-iter rayforce join because rayforce-py crashes on repeated 10M-row right-table reloads — likely a memory leak in the wrapper, separate bug for Anton). Headline numbers (10M, median ms): rayforce: groupby 24-1747, join 511/627, sort 5264/19787 next-fastest on each op trails by 3x-15x except sort_multi (pandas wins). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench.scaling_runner sweep across sizes 10, 100, 1k, 10k, 100k, 1m on rayforce + duckdb + polars + pandas + chdb + datafusion. 623 data points → docs/scaling_data.json + interactive log-log chart in docs/scaling.html with engine + op filters and preset buttons (groupby / join / sort H2O / sort typed). Two rayforce-py workarounds applied during the run, both flagged for upstream: * String column type for sort grid: rf.String exists in 1.0.0 but Table.from_csv() asks for c.ray_name and String doesn't expose one, so we can't request a RAY_STR column at load time. Fall back to Symbol for str8/str16 in the sort grid — same scan path the ~/rayforce/bench/h2o/q*.rfl examples use. * xasc syntax: backtick form (xasc t `id1) parses as an error in 1.0.0; switched to (xasc t 'id1) for single key and (xasc t [id1 id2 id3]) for multi-key. Per-adapter coverage (ops × sizes): every embedded engine: full 16 ops on 10/100, full 18 ops on 1k+ rayforce: full 16/16/18/18/17/18 — one 100k sort_f64 lost to a rayforce-py worker crash that we already see in the 10M join path (Repeated load+save of large right tables crashes the wrapper — separate bug for Anton, see /tmp/rayforce-csvread-repro for the related .csv.read perf gap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scaling_runner now accepts -i/--iterations, -w/--warmup (override the
adaptive 21/7/5/3 staircase) and --metric min|median (default median).
Server engines added to the scaling chart at every size, run with
1 warmup + 2 timed + min aggregation — fewer iterations because
Docker round-trip dominates the small-N timing anyway, and "best of N"
gives a clean lower bound. On 10..100 rows the curves flatten into
"network overhead" territory — that's exactly the diagnostic value
the user asked for: see how unusable QuestDB / Timescale become at
small scale.
Coverage matrix now:
embedded engines (rayforce / duckdb / polars / pandas / chdb /
datafusion): full 16/16/18/18/18 × 6 sizes
server engines (questdb / timescale): full 9/9/11/11/11/11 × 6 sizes
(no sort grid — they don't
implement run_sort_typed_full)
747 data points total.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run_sort_typed_full implemented for both server engines:
* questdb: ILP load with the appropriate column type (BYTE/SHORT/INT/
LONG/DOUBLE/SYMBOL). u8 widens to SHORT — QuestDB has no UINT.
str8/str16 land on SYMBOL (low-cardinality dictionary) which is the
natural type for the financial / market-data segment QuestDB targets.
* timescale: CREATE TABLE with PostgreSQL-native type then COPY-from-
STDIN. SMALLINT/SMALLINT/INTEGER/BIGINT/DOUBLE PRECISION/TEXT.
PostgreSQL has no UINT8 either; SMALLINT covers 0..255 safely.
sort_grid_worker.py and scaling_runner.SORT_GRID_ADAPTERS now allow
both server engines through the typed-sort path.
Final scaling coverage matrix (op count per adapter × size):
10 100 1000 10000 100000 1000000
rayforce: 16/16 16/16 18/18 18/18 17/18 18/18
every other: 16/16 16/16 18/18 18/18 18/18 18/18
831 data points total. Single rayforce miss (sort_f64 at 100k) carries
over from the earlier rayforce-py worker-crash pattern documented in
the prior commits — not a fix-now item.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-existing histogram.html was generated from the very first 10m run, before xasc syntax fix and before the join data was merged in. It had rayforce sort_* as null and missing join_inner/join_left rows for everyone. Re-rendered from docs/data_10m.json (the merged, complete dataset) so it now matches docs/index.html in coverage: all 8 adapters × 11 ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The preset buttons (Groupby / Join / Sort H2O / Sort typed) and the All button under "Operations" added clutter for the actual workflow: the viewer toggles individual ops on demand. Only the None button stays — quick way to clear the chart back to one explicit selection. Engines panel (All / None) untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task Breakdown was a per-op bar chart (8 adapters sorted by median). Same ranking is already legible from the heights in Overview's fast/heavy charts, so the extra section was redundant. It also hardcoded six tabs (groupby_q1..q6) — every op past q6 was silently invisible. Removed the markup (section, .task-tabs / .task-panel divs, six hardcoded chart-containers), the JS (initTaskCharts, updateTaskChart, showTask, taskCharts), and the orphan CSS rules. Lazy-load observer now only resizes the two surviving overview charts (fast / heavy). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…scaling bench.runner all expected join's left.csv/right.csv to live in the same directory it was given for groupby — they live in data/join_<n>x<n>/ instead, so all-mode joins always failed silently (median_ms=0, N/A in the chart). New --join-data flag lets the orchestrator point each suite at its own dataset; Makefile bench-all target now forwards it automatically based on SIZE. Makefile bench-scaling now forwards ITERATIONS / WARMUP — earlier the scaling sweep ignored them and stuck to the adaptive 21/7/5/3 staircase even when the user asked for fixed counts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
generate_html_report no longer patches index.html via regex. It writes
two artifacts and stops:
docs/data.json — pretty-printed dataset (tooling, manifest, share)
docs/data.js — window.chartData = {...}; one-line module
index.html now ships as a static file and pulls the dataset via a plain
<script src="data.js"></script> include. No JSON parser, no fetch and
its file:// CORS quirks, no regex over data that could legitimately
contain '};' in some future schema. data.js is just a JS file the
browser already trusts to set globals.
The 70KB of inline data dropped from index.html as a side effect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two engine selectors under "Detailed Results > Compare Databases" hardcoded the original five adapters (rayforce / duckdb / polars / questdb / timescale). After we added pandas, chdb, and datafusion in 2be5755 the markup was never extended, so those three never appeared in the side-by-side comparison. Now lists all eight in both dropdowns, alphabetically grouped by purpose: embedded engines first, then server engines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scaling_runner now accepts --stop-infra (parity with bench.runner) and the Makefile bench-scaling target forwards STOP_INFRA when ALL=1. Without this, a scaling sweep that included questdb / timescale left their containers running after exit, holding multi-GB of buffer pool and query-plan caches in RAM. Server engines don't release that on psycopg connection close — the only way is to stop the container. With --stop-infra the runner does that as the last step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench / bench-* and bench-scaling now declare file dependencies on .venv/bin/python and on $(GROUPBY_DATA)/data.csv / $(JOIN_DATA)/left.csv, so make creates whatever is missing before running the bench: * No .venv yet → python3 -m venv .venv + pip install requirements. * No data/groupby_<SIZE>_k100/ yet → bench.generate groupby. * No data/join_<SIZE>x<SIZE>/ yet → bench.generate join. * Everything already there → straight to the bench, no rebuild. PYTHON now defaults to .venv/bin/python (the file target makes sure it exists), so `make bench-all` works out of the box on a fresh checkout without the user remembering to run `make setup` and `make data` first. Also rename the per-row count in the runner output: "rows=N" -> "result=N rows". The previous label read like input row count; "result" is unambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ALL=1 was overloaded against `make bench-all` (run all suites). New DOCKER=ON makes the intent unambiguous: "switch on the Docker-backed engines (QuestDB + TimescaleDB)". Strict ifeq match means typos like DOCKER=0 or DOCKER=off don't accidentally enable them — only the explicit ON value does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench.runner already auto-started questdb / timescale containers when they were in the adapter list. scaling_runner didn't, so a sweep with DOCKER=ON gave 'Connection refused' on every server-engine point — the recent run was 220 errored entries out of 836 because of this. Now scaling_runner calls start_required_infrastructure(adapters) before running the sweep, mirroring runner.py. If a container fails to come up, that adapter is dropped from the run with a warning instead of poisoning the chart with 200+ identical 'Connection refused' rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eout
Two bugs surfaced in the latest 1M run:
1. str8 / str16 measurements showed median=0.36ms / rows=0 — i.e. the
sort ran against an empty table because the 30s ILP commit deadline
silently expired and we proceeded anyway. Now: 120s deadline and
raise RuntimeError on miss, so a row that didn't load shows up as
ERROR rather than as a fake "QuestDB sorts 1M strings in 0.36ms".
2. Random 1M unique strings were going through ILP `symbols={...}` —
QuestDB Symbol is a dictionary type for low-cardinality categoricals,
not a general string column. Switched to ILP `columns={...}`, which
maps to STRING and handles unique values per row without the symbol-
dictionary bottleneck.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rayforce-py 1.0.0 sometimes crashes between the timed block and the
JSON write — the parent then sees "Expecting value: line 1 column 1"
with no idea why. Random pattern, not deterministic by op or row count
(seen on q3/q7/join_inner/sort_single across the same sweep where
q1/q2/q4/q5/q6 succeeded on identical conditions).
Worker calls in scaling_runner now go through _run_worker_with_retry:
- capture_output=True so we get stderr
- one retry on empty/missing JSON; second subprocess in a fresh
Python interpreter clears whatever state crashed the first
- on final failure, error message includes the last 3 lines of the
worker's stderr instead of the cryptic "Expecting value" string
Retries print a "retry [adapter/op n=N]: ..." line so the user can see
flakiness even when it eventually succeeded. Other adapters never hit
this path in practice; the cost is negligible.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For each benchmark (and for scaling, each (op, size) pair) all adapters
that successfully returned a result should agree on the number of output
rows. Disagreement = either a SQL-semantics bug in one adapter (e.g.
NULL handling, distinct vs. not-distinct join) or a real engine
difference worth knowing about.
Both runner.py and scaling_runner.py now print a "Row-count validation"
block at the end:
Row-count validation:
OK — all 11 benchmark(s) returned the same row count from every adapter
or, when somebody disagrees:
Row-count validation:
WARNING — 2 benchmark(s) disagree across adapters:
groupby_q7: chdb=10000, duckdb=9998, polars=10000
So far our existing data passes this check on canonical H2O 10M; if a
new schema or query introduces a divergence, the bench loudly says so
instead of silently averaging incompatible results.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add 10m to the default scaling-curve sweep (was 10..1m). Lets make bench-scaling cover the full size range — including the real-world 10M-row scenario — without callers having to spell out SIZES=... manually. Override remains available for shorter runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docs/: histogram + index (8 dbs / 11 ops / 88 data points dynamic counters) + scaling sweep with engine/op filters.Test plan
make bench-all DOCKER=ON ITERATIONS=3 WARMUP=1runs to completion on a clean checkout.make bench-scaling DOCKER=ONproduces docs/scaling.html with 7 sizes per (engine, op).🤖 Generated with Claude Code