Skip to content

prototype: H2O bench framework#1

Merged
singaraiona merged 40 commits intoRayforceDB:masterfrom
ser-vasilich:prototype
May 8, 2026
Merged

prototype: H2O bench framework#1
singaraiona merged 40 commits intoRayforceDB:masterfrom
ser-vasilich:prototype

Conversation

@ser-vasilich
Copy link
Copy Markdown
Contributor

@ser-vasilich ser-vasilich commented May 8, 2026

Summary

  • Full H2O.ai canonical bench (groupby q1..q7, join_inner/left, sort_single/multi) across 8 engines: rayforce, polars, duckdb, chdb, datafusion, pandas, questdb, timescale.
  • Subprocess-isolated workers, identical Python-level timing for every adapter, swap monitor, SHA256 manifest, cross-adapter row-count validation.
  • Static reports under docs/: histogram + index (8 dbs / 11 ops / 88 data points dynamic counters) + scaling sweep with engine/op filters.

Test plan

  • make bench-all DOCKER=ON ITERATIONS=3 WARMUP=1 runs to completion on a clean checkout.
  • make bench-scaling DOCKER=ON produces docs/scaling.html with 7 sizes per (engine, op).
  • Render docs/index.html in Chrome, hard-refresh, verify hero counters reflect 8/11/88 and table rebuilds dynamically.

🤖 Generated with Claude Code

ser-vasilich and others added 30 commits May 4, 2026 14:54
Each benchmark operation now runs in its own child process via bench.worker.
Memory is guaranteed released between runs, and a single engine crash no
longer aborts the whole suite. Pattern borrowed from teide-bench.

bench/worker.py — child entrypoint. Lazy-imports the adapter, runs warmup
+ measured iterations, writes JSON to --result, exits via os._exit() to
skip Python cleanup (some engines segfault on exit).

bench/runner.py — orchestrator. Spawns workers via subprocess.run with a
600s timeout, reads back JSON, aggregates into BenchmarkRun objects. The
old in-process BenchmarkRunner class is replaced by an OrchestratorConfig
+ run_suite() function.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the OS pages out, results stop reflecting engine performance and
start reflecting disk I/O. swapcheck.py samples psutil.swap_memory()
before and after each operation; runner warns when growth exceeds 100MB
or when swap is already in use at startup.

Threshold matches the conventional "noise floor" for analytical workloads
on workstations — smaller deltas usually come from unrelated background
activity, larger ones almost always indicate the dataset doesn't fit in
RAM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaced eval_str("(timeit ...)") wrapper with time.perf_counter_ns
around eval_str(query). The other adapters (duckdb, polars, questdb,
timescale) already time externally via Adapter._time_it; rayforce was
the only one measuring inside the engine, which excluded Python-binding
overhead and skewed comparisons in its favor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench/engine_source.py — clones the requested rayforce-py branch into
.deps/rayforce-py-branch-<name>/ (or fast-forwards an existing checkout)
and returns the directory. The orchestrator passes it to the worker as
--rayforce-local, so a branch behaves exactly like a local clone.

engine_label() emits 'rayforce@branch (commit) dirty' for reports — same
shape as teide-bench used. Wired into OrchestratorConfig but only
threaded into reports in a later commit.

.deps/ ignored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
py mode: existing RayforceAdapter (rayforce-py wrapper, PyPI when shipped).
rfl mode: new RayforceRflAdapter that drives the native binary via
generated .rfl scripts — same approach teide-bench used for rayforce2,
adapted to the new API (left-join / inner-join / xasc / select).

Each call to run_full() builds one .rfl with read-csv outside (timeit ...),
n_warmup blind runs, then n_iter measured runs that each println their
ms — so a single binary invocation does the whole warmup + measurement
cycle and we don't pay CSV-read cost per iteration.

Adapter.run_full() is the new orchestration hook on the base class. The
default implementation matches the old worker loop (per-iter calls);
RayforceRflAdapter overrides it because per-iter binary launches would
re-parse the CSV every time.

CLI: --rayforce-mode {py,rfl} + --rayforce-bin <path>. Default py because
once rayforce-py lands on PyPI that becomes the canonical path; today rfl
is the practical fallback (rayforce-py not yet released).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Optional benchmark mode that complements the standard H2O sort (s1, s6)
with a typed scaling curve. Random pattern only — focus is throughput
per type, not stability under partially-sorted input.

bench/generators/sort_grid.py — random columns at 9 points per decade
up to a configurable max. dtypes: u8 i16 i32 i64 f64 str8 str16. The
str8/str16 split exists to surface RAY_STR's SSO boundary at 12 bytes:
str8 stays inline, str16 spills to the pool, and the same effect
applies to DuckDB VARCHAR (12-byte inline) and Polars Utf8.

Adapter.run_sort_typed_full(csv, dtype, n_warmup, n_iter) is the new
optional hook. duckdb / polars / rayforce-py / rayforce-rfl implement
it; questdb / timescale don't (excluded from grid by default — Docker
overhead and SQL setup cost dwarf the actual sort).

bench/sort_grid_runner.py + sort_grid_worker.py mirror the H2O
orchestrator/worker split: each (adapter, dtype, length) triple is
its own subprocess. Default 3 iterations, 1 warmup — fewer than the
H2O suite because the grid sweeps O(adapters × dtypes × lengths)
combinations and we want the whole thing to fit in a coffee break.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
generate_histogram_html — self-contained Plotly bar chart, log-Y, grouped
by adapter. Same shape teide-bench used for results/bench.html, kept
deliberately ascetic so it works without docs/index.html infrastructure
and screenshots from both repos look comparable.

generate_sort_grid_html — log-log scaling curve fed from
docs/sort_data.json. One trace per (adapter, dtype) pair: color encodes
the engine, line dash distinguishes the dtype within a color group.
Legend supports group-toggle so the viewer can isolate one engine or
one dtype across engines.

Wired into runner.py (writes docs/histogram.html alongside index.html)
and sort_grid_runner.py (writes docs/sort.html alongside sort_data.json).

ENGINE_COLORS palette stays compatible with teide-bench so cross-repo
screenshots remain visually consistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Makefile: bench-sort-ext target for the extended grid, RAYFORCE_MODE /
RAYFORCE_BIN / SORT_MAX / SORT_DTYPES knobs, fixed JOIN_DATA path to
match what bench.generate actually emits (joinNxM uses 'x' not '_').

README: prototype-branch summary at the top, rayforce execution-mode
explainer, extended sort grid section with the str8/str16 SSO rationale,
roadmap pointing at ClickBench / TPC-H / JOB next.

rayforce_rfl_adapter: read-csv → .csv.read. The reference scripts in
~/rayforce/bench/h2o/*.rfl still use the old read-csv name but the
current binary registers .csv.read (eval.c:2181) — those .rfl files
are stale.

runner: dependency check no longer hard-fails when an unrequested
adapter is missing. Worker fails cleanly if the user actually picks an
unavailable adapter.

Smoke-tested locally:
  - groupby (duckdb + polars) on 10k rows
  - groupby (rayforce rfl mode) — works without rayforce-py installed
  - extended sort grid (rayforce rfl + duckdb + polars, max=100)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new engines round out the embedded competitor set:

* chdb — embedded ClickHouse via the chdb Python package. Lets us measure
  against ClickHouse without running a server. Session API so CREATE
  TABLE state persists across queries within one adapter instance.

* pandas — slow baseline included for context. Almost everyone reading
  the report has a mental model calibrated against pandas; the "of
  course pandas is slow" column makes the rest of the chart legible.

* datafusion — Apache Rust+Arrow query engine. It's the substrate of
  InfluxDB 3, GlareDB, ROAPI, RisingLight and Sail, so measuring against
  it covers the Apache columnar ecosystem rather than just one product.

bench/adapters/__init__.py now imports each adapter lazily — a missing
optional dep no longer breaks the whole module. print_dependency_status
lists every recognized engine so the user sees what's installed.

Default adapters for both runners now: rayforce, duckdb, polars, chdb,
datafusion, pandas. ALL=1 still adds questdb + timescale.

Smoke-tested locally on 10k groupby + sort grid on str8/i64 to 100 rows
— all 5 new adapters return sensible numbers. Color palette in
report.py extended for the new engines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
questdb: ILP ingestion is async — the table (on first write) and rows
both lag the flush by ~1s. The old loader returned immediately after
sender.flush(), so the first benchmark query landed on an empty or
nonexistent table. Now wait_for_commit polls SELECT count(*) until the
visible row count matches the load, swallowing 'table does not exist'
during the racy window. 30s timeout matches QuestDB's worst-case commit
cadence.

timescale: post_start retry loop was too short (5×2s = 10s). Postgres
opens its port before initdb finishes, so the port-ready check returns
long before psql can connect. Bumped to 15×2s = 30s, matching the
ready_timeout we wait on.

Both surfaced as silent N/A or 'database benchmark does not exist' on
the all-8-adapter snapshot — both are pre-existing bugs unrelated to
the prototype refactor, but blocked clean cross-engine numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generators now produce data byte-identical to ~/rayforce/bench/h2o/*.rfl
expectations and to teide-bench's gen/generate.py:

  groupby:  id1..id3 string, id4..id6 int64, v1 int[1..5],
            v2 int[1..15], v3 float[0..100) 6dp
  join:     id1..id3 int64, id4..id6 string, v1 or v2 float

Cross-machine determinism: PCG64 (stable since numpy 1.17, doesn't
shift on default_rng changes) + SHA256 of every emitted file in
manifest.json. Two users on different machines must see the same hash
for the same (n_rows, k, seed) — if they don't, generator changed and
benchmark numbers are no longer comparable.

The schema bump from 6 → 9 columns is what unlocks q7 (6-key groupby)
in the next commit, and aligns join key shapes (int IDs + string
sides) so the 22 adapters have something to actually stress different
join paths against.

Default --right-rows for join goes 1m → 10m to match canonical H2O J1
(left and right tables the same size by default; previous 1/10 ratio
was a project-specific shortcut).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canonical H2O q7: SUM(v3), COUNT(v1) GROUP BY id1..id6. With the new
schema this stresses high-cardinality hashing on a mix of string and
integer keys — which is exactly where engines diverge most sharply.

Adapter changes mirror the schema bump from 6 → 9 columns:

* duckdb / polars / pandas / chdb / datafusion: SQL or DataFrame q7,
  pulling all six id columns through GROUP BY.

* rayforce-py: _get_column_types now returns Symbol for id1..id3 (string
  IDs) and I64 for id4..id6 + v1/v2, F64 for v3. Falls back to STR if
  Symbol isn't exposed by the wrapper. The first-data-row sniff also
  disambiguates groupby vs join layout (where id4..id6 are strings).

* rayforce-rfl: GROUPBY_SCHEMA is now [SYMBOL SYMBOL SYMBOL I64 I64 I64
  I64 I64 F64], JOIN_SCHEMA is [I64 I64 I64 SYMBOL SYMBOL SYMBOL F64].
  Matches ~/rayforce/bench/h2o/q*.rfl byte-for-byte.

* questdb / timescale: SQL q7 only — ILP and COPY already handled the
  string IDs correctly, no schema fix needed.

Smoke on 10k canonical groupby (id3 high-cardinality → ~10k groups in q7):
  rayforce 9ms, polars 3ms, duckdb 9ms, pandas 11ms, datafusion 14ms,
  chdb 14ms, questdb 18ms, timescale 21ms. The high-cardinality hash
  paths separate engines much more visibly than q1..q5 on 100 groups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench.scaling_runner sweeps a list of sizes (default 10..1m) and runs
each adapter through every H2O op plus the typed-sort grid at each size,
producing one JSON suitable for an interactive log-log scaling chart.

Adaptive iteration counts borrow teide-bench/sort_bench_multi staircase:
n≤100→21/5, ≤100k→7/3, ≤10m→5/2, larger→3/1. Tiny inputs need many
runs to drown out the perf_counter floor (~50µs); huge inputs are
already slow so we cut down. Joins skipped under 1000 rows — both
sides are tiny and the curve adds nothing.

generate_scaling_html in bench.report adapts teide-bench's
sort_bench_plot.py: two checkbox groups (Engines + Operations), All/None
buttons, plus four preset buttons that one-click switch the op filter
between groupby/join/sort-h2o/sort-typed. Plotly.react redraws on
every toggle. One trace per (engine, op) pair: engine→colour,
op→line-dash + marker symbol. Default-on a starter triple
(groupby_q1 + sort_i64 + sort_str8) so the page isn't a wall of lines
on first load.

Smoke on 4 adapters × 100,1k,10k → 172 data points, JSON 66KB,
HTML 14KB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Makefile:
* SIZE default 1m → 10m. Canonical H2O is benchmarked at 10m+ rows; 1m
  was a hold-over from quick-iteration days. 10k/100k/1m still
  available via SIZE=...
* New bench-scaling target with default SIZES=10,100,1k,10k,100k,1m,
  driving bench.scaling_runner. Skips Docker engines by default since
  the sweep generates many subprocess spawns and TSDB engines are
  already disproportionately slow.
* JOIN_DATA paths follow canonical H2O: equal-size left and right
  tables (J1 standard), so 10mx10m instead of the previous 10mx1m.

README:
* Quick-start switches to make bench-scaling as the showcase command.
* New Reproducibility section explains the SHA256 manifest contract:
  same seed + size + machine produce byte-identical CSVs, mismatch
  means the generator changed.
* GroupBy section lists q7 alongside q1..q6 and documents the 9-column
  canonical schema (id1..id3 string, id4..id6 int).
* Join section notes the inverted spread (int keys + string sides).
* Scaling sweep section explains adaptive iter_counts and the engine/op
  filter UI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FAIRNESS.md was directly contradicting current behaviour — it claimed
rayforce uses internal (timeit ...) and that id1..id3 were int64. Both
were true two months ago and false now. Rewrite covers what the prototype
branch actually does:

  * Python-level perf_counter_ns around every engine call (no more
    asymmetric internal timeit).
  * Subprocess isolation as the primary mechanism for fairness.
  * Canonical H2O 9-col schema + per-engine type mapping table covering
    all 8 adapters (chdb / datafusion were missing entirely).
  * Adaptive iteration counts table for bench-scaling.
  * SHA256 manifest contract — the verifiability story.
  * Swap monitor — what the warnings mean and when to trust the number.
  * Explicit list of what's deliberately excluded (server engines from
    sort-ext, partial-sort patterns, nullable workloads, value-level
    cross-engine comparison).
  * Source-file pointers throughout so claims are checkable.

README sections that were equally stale:

  * Project Structure missed worker.py, scaling_runner.py,
    sort_grid_runner.py, sort_grid_worker.py, engine_source.py,
    swapcheck.py, and four adapter files (chdb, datafusion, pandas,
    rayforce_rfl). Updated tree shows the real layout.

  * Data Format documented the old 6-col int-id schema. Now lists the
    canonical 9-col groupby and 7-col join layouts with example values
    and a snippet showing how to verify SHA256 across machines.

  * "Benchmarking with Local Rayforce Build" only knew about
    --rayforce-local. Now covers all three rayforce flows: --rayforce-local
    (path), --rayforce-branch (clone), and --rayforce-mode rfl
    (native binary, no rayforce-py needed).

  * "Server-Based Adapters" referenced make targets that don't exist
    (make infra-start/stop/status/cleanup). Replaced with the actual
    interface — ALL=1 for auto-start, python -m bench.infra for manual.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3-key inner/left join across all 8 adapters (was: only id1, which on
canonical H2O J1 — both tables 10M, id1 cardinality 100 — produced a
10^12-row Cartesian-ish result and OOMed). Now matches teide-bench and
~/rayforce/bench/h2o/j1.rfl: ON id1 = id1 AND id2 = id2 AND id3 = id3.

* duckdb / polars / pandas / chdb / datafusion / timescale: SQL/expr
  with three keys.
* questdb: implements joins for the first time (was NotImplementedError);
  loads the right side via ILP, waits for commit, joins on three keys.
* rayforce: (inner-join [id1 id2 id3] left right) instead of
  (ij `id1 left right) — same canonical form as in
  ~/rayforce/bench/h2o/j1.rfl.

Drop rfl mode entirely. Reasoning: the rfl path went through the
.csv.read builtin which produces a table without the hash index that
Operation.READ_CSV + binary_set attaches. Net effect was every
Symbol-keyed select rehashed 10M rows from scratch (~30x slowdown,
including a hard timeout on q6). With rayforce-py 1.0.0 now on PyPI,
keeping rfl as a "fallback" only re-introduces an asymmetric timing
path. Single Python entry for every engine, period.

Removed:
* bench/adapters/rayforce_rfl_adapter.py
* --rayforce-mode / --rayforce-bin CLI flags from runner / worker /
  scaling_runner / sort_grid_runner / sort_grid_worker
* RAYFORCE_MODE / RAYFORCE_BIN Makefile knobs
* rfl sections in README and FAIRNESS.md

Smoke on 10k canonical H2O groupby (rayforce + duckdb + polars + chdb,
all py-mode):
  rayforce: 0.10-1.15ms (q1..q7)
  polars:   3.87-6.00ms — ~25x slower
  chdb:     5.03-14.67ms — ~37x slower
  duckdb:   7.04-19.54ms — ~52x slower

A reproducer for the .csv.read perf gap was packaged separately for
Anton (see /tmp/rayforce-csvread-repro.tar.gz) — that's an upstream bug,
not something to work around in the bench harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…esults

* rayforce_adapter sort: backtick form (xasc t \`id1) is not parsed by
  current rayforce; switched to canonical (xasc t 'id1) for single-key
  and (xasc t [id1 id2 id3]) for multi-key. Smoke-tested on 10k and 10m.

* docs/index.html overview: split the single bar chart into "Fast queries
  (groupby q1..q6)" and "Heavy queries (q7, joins, sorts)" so the
  multi-second q7/sort entries don't flatten the sub-second q1..q6 group
  to invisible slivers. Same _buildBarOption helper feeds both, FAST_TASKS
  set decides the partition.

* 10M results merged from three runs (groupby+sort partial first, then
  join with right.csv from data/join_10mx10m, then a 1-iter rayforce
  join because rayforce-py crashes on repeated 10M-row right-table
  reloads — likely a memory leak in the wrapper, separate bug for Anton).

Headline numbers (10M, median ms):
  rayforce: groupby 24-1747, join 511/627, sort 5264/19787
  next-fastest on each op trails by 3x-15x except sort_multi (pandas wins).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench.scaling_runner sweep across sizes 10, 100, 1k, 10k, 100k, 1m on
rayforce + duckdb + polars + pandas + chdb + datafusion. 623 data
points → docs/scaling_data.json + interactive log-log chart in
docs/scaling.html with engine + op filters and preset buttons (groupby /
join / sort H2O / sort typed).

Two rayforce-py workarounds applied during the run, both flagged for
upstream:

* String column type for sort grid: rf.String exists in 1.0.0 but
  Table.from_csv() asks for c.ray_name and String doesn't expose one,
  so we can't request a RAY_STR column at load time. Fall back to
  Symbol for str8/str16 in the sort grid — same scan path the
  ~/rayforce/bench/h2o/q*.rfl examples use.

* xasc syntax: backtick form (xasc t `id1) parses as an error in 1.0.0;
  switched to (xasc t 'id1) for single key and (xasc t [id1 id2 id3])
  for multi-key.

Per-adapter coverage (ops × sizes):
  every embedded engine: full 16 ops on 10/100, full 18 ops on 1k+
  rayforce: full 16/16/18/18/17/18 — one 100k sort_f64 lost to a
  rayforce-py worker crash that we already see in the 10M join path
  (Repeated load+save of large right tables crashes the wrapper —
  separate bug for Anton, see /tmp/rayforce-csvread-repro for the
  related .csv.read perf gap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scaling_runner now accepts -i/--iterations, -w/--warmup (override the
adaptive 21/7/5/3 staircase) and --metric min|median (default median).

Server engines added to the scaling chart at every size, run with
1 warmup + 2 timed + min aggregation — fewer iterations because
Docker round-trip dominates the small-N timing anyway, and "best of N"
gives a clean lower bound. On 10..100 rows the curves flatten into
"network overhead" territory — that's exactly the diagnostic value
the user asked for: see how unusable QuestDB / Timescale become at
small scale.

Coverage matrix now:
  embedded engines (rayforce / duckdb / polars / pandas / chdb /
                    datafusion):   full 16/16/18/18/18 × 6 sizes
  server engines (questdb / timescale): full 9/9/11/11/11/11 × 6 sizes
                                        (no sort grid — they don't
                                        implement run_sort_typed_full)

747 data points total.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run_sort_typed_full implemented for both server engines:

* questdb: ILP load with the appropriate column type (BYTE/SHORT/INT/
  LONG/DOUBLE/SYMBOL). u8 widens to SHORT — QuestDB has no UINT.
  str8/str16 land on SYMBOL (low-cardinality dictionary) which is the
  natural type for the financial / market-data segment QuestDB targets.

* timescale: CREATE TABLE with PostgreSQL-native type then COPY-from-
  STDIN. SMALLINT/SMALLINT/INTEGER/BIGINT/DOUBLE PRECISION/TEXT.
  PostgreSQL has no UINT8 either; SMALLINT covers 0..255 safely.

sort_grid_worker.py and scaling_runner.SORT_GRID_ADAPTERS now allow
both server engines through the typed-sort path.

Final scaling coverage matrix (op count per adapter × size):
                    10     100    1000   10000  100000 1000000
  rayforce:         16/16  16/16  18/18  18/18  17/18  18/18
  every other:      16/16  16/16  18/18  18/18  18/18  18/18

831 data points total. Single rayforce miss (sort_f64 at 100k) carries
over from the earlier rayforce-py worker-crash pattern documented in
the prior commits — not a fix-now item.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-existing histogram.html was generated from the very first 10m
run, before xasc syntax fix and before the join data was merged in.
It had rayforce sort_* as null and missing join_inner/join_left rows
for everyone.

Re-rendered from docs/data_10m.json (the merged, complete dataset)
so it now matches docs/index.html in coverage: all 8 adapters × 11 ops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The preset buttons (Groupby / Join / Sort H2O / Sort typed) and the All
button under "Operations" added clutter for the actual workflow: the
viewer toggles individual ops on demand. Only the None button stays —
quick way to clear the chart back to one explicit selection.

Engines panel (All / None) untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task Breakdown was a per-op bar chart (8 adapters sorted by median).
Same ranking is already legible from the heights in Overview's fast/heavy
charts, so the extra section was redundant. It also hardcoded six tabs
(groupby_q1..q6) — every op past q6 was silently invisible.

Removed the markup (section, .task-tabs / .task-panel divs, six
hardcoded chart-containers), the JS (initTaskCharts, updateTaskChart,
showTask, taskCharts), and the orphan CSS rules. Lazy-load observer
now only resizes the two surviving overview charts (fast / heavy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…scaling

bench.runner all expected join's left.csv/right.csv to live in the same
directory it was given for groupby — they live in data/join_<n>x<n>/
instead, so all-mode joins always failed silently (median_ms=0, N/A
in the chart). New --join-data flag lets the orchestrator point each
suite at its own dataset; Makefile bench-all target now forwards it
automatically based on SIZE.

Makefile bench-scaling now forwards ITERATIONS / WARMUP — earlier the
scaling sweep ignored them and stuck to the adaptive 21/7/5/3 staircase
even when the user asked for fixed counts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
generate_html_report no longer patches index.html via regex. It writes
two artifacts and stops:

  docs/data.json — pretty-printed dataset (tooling, manifest, share)
  docs/data.js   — window.chartData = {...}; one-line module

index.html now ships as a static file and pulls the dataset via a plain
<script src="data.js"></script> include. No JSON parser, no fetch and
its file:// CORS quirks, no regex over data that could legitimately
contain '};' in some future schema. data.js is just a JS file the
browser already trusts to set globals.

The 70KB of inline data dropped from index.html as a side effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two engine selectors under "Detailed Results > Compare Databases"
hardcoded the original five adapters (rayforce / duckdb / polars /
questdb / timescale). After we added pandas, chdb, and datafusion in
2be5755 the markup was never extended, so those three never appeared
in the side-by-side comparison.

Now lists all eight in both dropdowns, alphabetically grouped by
purpose: embedded engines first, then server engines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scaling_runner now accepts --stop-infra (parity with bench.runner) and
the Makefile bench-scaling target forwards STOP_INFRA when ALL=1.

Without this, a scaling sweep that included questdb / timescale left
their containers running after exit, holding multi-GB of buffer pool
and query-plan caches in RAM. Server engines don't release that on
psycopg connection close — the only way is to stop the container.
With --stop-infra the runner does that as the last step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench / bench-* and bench-scaling now declare file dependencies on
.venv/bin/python and on $(GROUPBY_DATA)/data.csv / $(JOIN_DATA)/left.csv,
so make creates whatever is missing before running the bench:

  * No .venv yet → python3 -m venv .venv + pip install requirements.
  * No data/groupby_<SIZE>_k100/ yet → bench.generate groupby.
  * No data/join_<SIZE>x<SIZE>/ yet → bench.generate join.
  * Everything already there → straight to the bench, no rebuild.

PYTHON now defaults to .venv/bin/python (the file target makes sure it
exists), so `make bench-all` works out of the box on a fresh checkout
without the user remembering to run `make setup` and `make data` first.

Also rename the per-row count in the runner output: "rows=N" -> "result=N rows".
The previous label read like input row count; "result" is unambiguous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ALL=1 was overloaded against `make bench-all` (run all suites). New
DOCKER=ON makes the intent unambiguous: "switch on the Docker-backed
engines (QuestDB + TimescaleDB)". Strict ifeq match means typos like
DOCKER=0 or DOCKER=off don't accidentally enable them — only the
explicit ON value does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench.runner already auto-started questdb / timescale containers when
they were in the adapter list. scaling_runner didn't, so a sweep with
DOCKER=ON gave 'Connection refused' on every server-engine point — the
recent run was 220 errored entries out of 836 because of this.

Now scaling_runner calls start_required_infrastructure(adapters) before
running the sweep, mirroring runner.py. If a container fails to come
up, that adapter is dropped from the run with a warning instead of
poisoning the chart with 200+ identical 'Connection refused' rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ser-vasilich and others added 10 commits May 5, 2026 22:35
…eout

Two bugs surfaced in the latest 1M run:

1. str8 / str16 measurements showed median=0.36ms / rows=0 — i.e. the
   sort ran against an empty table because the 30s ILP commit deadline
   silently expired and we proceeded anyway. Now: 120s deadline and
   raise RuntimeError on miss, so a row that didn't load shows up as
   ERROR rather than as a fake "QuestDB sorts 1M strings in 0.36ms".

2. Random 1M unique strings were going through ILP `symbols={...}` —
   QuestDB Symbol is a dictionary type for low-cardinality categoricals,
   not a general string column. Switched to ILP `columns={...}`, which
   maps to STRING and handles unique values per row without the symbol-
   dictionary bottleneck.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rayforce-py 1.0.0 sometimes crashes between the timed block and the
JSON write — the parent then sees "Expecting value: line 1 column 1"
with no idea why. Random pattern, not deterministic by op or row count
(seen on q3/q7/join_inner/sort_single across the same sweep where
q1/q2/q4/q5/q6 succeeded on identical conditions).

Worker calls in scaling_runner now go through _run_worker_with_retry:
  - capture_output=True so we get stderr
  - one retry on empty/missing JSON; second subprocess in a fresh
    Python interpreter clears whatever state crashed the first
  - on final failure, error message includes the last 3 lines of the
    worker's stderr instead of the cryptic "Expecting value" string

Retries print a "retry [adapter/op n=N]: ..." line so the user can see
flakiness even when it eventually succeeded. Other adapters never hit
this path in practice; the cost is negligible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For each benchmark (and for scaling, each (op, size) pair) all adapters
that successfully returned a result should agree on the number of output
rows. Disagreement = either a SQL-semantics bug in one adapter (e.g.
NULL handling, distinct vs. not-distinct join) or a real engine
difference worth knowing about.

Both runner.py and scaling_runner.py now print a "Row-count validation"
block at the end:

  Row-count validation:
    OK — all 11 benchmark(s) returned the same row count from every adapter

or, when somebody disagrees:

  Row-count validation:
    WARNING — 2 benchmark(s) disagree across adapters:
      groupby_q7: chdb=10000, duckdb=9998, polars=10000

So far our existing data passes this check on canonical H2O 10M; if a
new schema or query introduces a divergence, the bench loudly says so
instead of silently averaging incompatible results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add 10m to the default scaling-curve sweep (was 10..1m). Lets
make bench-scaling cover the full size range — including the
real-world 10M-row scenario — without callers having to spell out
SIZES=... manually. Override remains available for shorter runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ser-vasilich ser-vasilich changed the title prototype: H2O bench framework + cross-adapter check prototype: H2O bench framework May 8, 2026
@singaraiona singaraiona merged commit aa4b402 into RayforceDB:master May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants