Multi-page PDF extraction: undocumented `pages: []` schema works great, plus 7 specific asks

# Multi-page PDF extraction: undocumented `pages: []` schema works great, plus 7 specific asks

Hi team — first off, the model is genuinely impressive on a hard real-world use case (more on that below). Filing this as one consolidated issue because everything I found is connected to a single theme: **the docs and a couple of small API ergonomics don't yet support the long-document use case that your model is already capable of.** If even half of these get picked up, Interfaze becomes a no-brainer choice for a category of work (regulated long-document extraction) that other providers do poorly.

## The use case I'm building for

A **PDF→HTML pipeline for certification documents in the sustainability industry** — think compliance standards, methodology specs, and certification scheme documents. They share a few properties that make them hard:

- 30–200+ pages, structured by section/sub-section
- Dense with **mathematical formulas** (subscripted variables: `e_{ec}`, `η_h`, `EM_f`, etc.) that must round-trip to LaTeX
- **Right-margin annotation columns** with short callout labels that explain adjacent paragraphs
- **Figures often embed tables/dashboards** as images, not native PDF tables
- Per-page anchoring matters because every variable, formula, and value needs a citation back to the specific page it was stated on (auditability is a hard requirement)

I evaluated Interfaze against Reducto Extract v2 on a 4-page subset (real production sample). **Results below show Interfaze matches Reducto's fidelity, runs ~3–5× faster, and the LaTeX subscript correction is honestly the killer feature** — your OCR layer correctly flags `'7h'` (a misread of η_h) at confidence 0.26, and your schema layer rewrites it to `\eta_h` in the output. That correction loop is *exactly* what compliance-doc extraction needs, and I haven't seen another vendor do it cleanly.

So this issue is "please make a few small things explicit so I can ship Interfaze into a production pipeline" rather than "your stuff is broken."

---

## Headline finding: wrapped schema works, but it's hidden

The only PDF example in your docs (https://interfaze.ai/docs/vision/ocr — the arxiv research paper one) uses a **flat top-level schema** (`formulas: List[FormulaItem]`). This implicitly suggests that one document = one flat object. For long compliance docs, we need per-page anchoring, so I tried wrapping:

```json
{
  "type": "object",
  "properties": {
    "pages": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "page_number": { "type": "integer" },
          "section_title": { "type": "string" },
          "formulas": { "type": "array", "items": { /* ... */ } },
          "variable_definitions": { "type": "array", "items": { /* ... */ } },
          "sidebar_annotations": { "type": "array", "items": { /* ... */ } }
        }
      }
    }
  },
  "required": ["pages"]
}
```

**It just works.** The model returned 4 pages in correct order with page numbers `[9, 15, 16, 23]` honored and each page's content properly partitioned. This needs to be a first-class documented example, not something developers discover by trying it.

## Empirical comparison — 3 approaches on the same 4-page PDF

| Approach | Per-page? | Formulas | Variables | Sidebars | Wall-clock |
|---|---|---|---|---|---|
| Flat top-level schema (your docs' pattern) | ❌ merged | 10 | 27 (dedup) | 9 | 57.3s |
| **A. Wrapped `pages: []` schema** | ✅ correct order | 10 | 39 | 9 | 64.6s |
| **B. 4× parallel single-page calls** | ✅ by construction | 10 | 40 | 9 | 66.0s |
| Reducto Extract v2 (baseline) | ✅ | — | 40 | 9 | minutes |

Test A nearly matches Test B (and Reducto), confirming wrapped schemas are production-viable.

## What I'd ask for, in priority order

### 1. Document the `pages: [PageObject]` schema pattern as a first-class long-document example
Add it to https://interfaze.ai/docs/vision/ocr alongside the arxiv example. Suggested copy: *"For multi-page PDFs where you need per-page anchoring (compliance documents, technical specs, multi-section reports), wrap your per-page extraction schema in a top-level `pages: []` array. The model honors page order and fills one entry per physical page."* That's it — one paragraph + a working snippet would unlock a whole category of use cases.

### 2. Add an explicit `page_number` field to `precontext.ocr.result.sections[]`
Today the mapping is inferential — `sections[]` happens to be the same length as `total_pages`, and the first line of each section happens to be the printed page number. That's brittle. One explicit `page_number` (1-indexed, physical page in the input PDF) on each section eliminates the ambiguity and lets us build reliable bbox-to-page mappings.

### 3. Document the coordinate frame
Top-level `result.width: 1190, result.height: 6732` (on a 4-page input) suggests a stacked canvas. But per-section `lines[].bounds.y` starts fresh at ~73 for each section, suggesting page-local. Both interpretations are valid for different use cases (cross-page visuals vs. per-page overlays) — please document which is authoritative or, ideally, expose both: page-local on each section + a stacked canvas at top-level only when needed.

### 4. Surface why wrapped-schema variable count (39) is lower than parallel single-page (40)
Probably deduplication when the same `variable_definitions[]` array appears in multiple page objects in the response. If that's tunable, an explicit flag like `"deduplicate_repeated_items": false` would help. If it's an artifact of attention/context, surfacing a hint in the schema description ("treat each page's array independently") would be enough. For us, **every duplicate is information** — a variable that's defined on three different pages should appear three times so we can cite each occurrence back to its page.

### 5. `run_tasks` precontext-only mode that accepts a multi-page PDF
The "Lower costs" page recommends `<task>ocr</task>` for raw OCR — but it's not clear from the docs whether that mode works on a multi-page PDF and returns the per-section precontext directly. For our use case we'd love to skip the full reasoning pass when all we need is raw text + bboxes + confidence. A worked example of `<task>ocr</task>` on a 10+ page PDF with precontext returned at top-level would be valuable.

### 6. Document actual concurrent-request capacity
With 4 concurrent requests on the same API key I saw only **2.6× speedup** vs serial (66s wall-clock vs 171s sum-of-latencies). That's much lower than the 50 req/sec rate limit implies. Either there's an undocumented per-account concurrent-request cap, or there's shared internal pipeline contention. **Knowing the real concurrency ceiling would let us size client-side fan-out correctly.** A line in the rate-limits page like "Up to N concurrent requests per account; queue beyond that" would solve this.

### 7. Long-doc strategy: 5-min cap + 80 MB URL cap don't fit a 200-page standard
A real certification standard at 300 DPI in this category is ~200 pages and >80 MB. Either (a) a chunked/streamed PDF endpoint, or (b) an authoritative recommendation in the docs: "for PDFs longer than N pages or larger than M MB, split client-side into chunks of K pages and call in parallel." Right now we have to derive the right chunk size empirically.

### 8. **(Possible bug)** Wrapped schema returned empty `section_title` for 2 of 4 pages
Same model, same prompt, same page content — when I ran each page as a separate single-page request (Test B), all 4 pages got non-empty `section_title`. When I ran the same 4 pages inside a wrapped `pages: []` schema (Test A), pages 15 and 23 returned `section_title: ""`. This suggests the schema-following layer treats per-page object fields differently when the page is one element of an array vs. when it's the top-level object. Probably worth a look from your model team.

---

## The killer feature you should advertise more

On page 16 of my test doc, the raw OCR layer returns the line `'7h'` at **confidence 0.26**. That's an OCR misread of `η_h` (Greek eta with subscript h — a variable used hundreds of times in this standard). Your schema-extraction layer then writes out `\eta_h` in the LaTeX field of the formula. The OCR layer correctly *flagged it as low-confidence* and the schema layer *fixed it using context*.

**This is the magic.** No other OCR vendor I've tested (Reducto, Mistral OCR, GLM-OCR, PaddleOCR-VL) does both halves of this correction loop in one call. It's a real differentiator for any vertical that involves mathematical or scientific notation. Please put it in a blog post.

---

## Reproduction

Full repro available — let me know if it's useful and I can share the script + raw responses privately. Summary:
- Input: 4-page subset of a real sustainability certification standard
- Model: `interfaze-beta`
- SDK: OpenAI Python SDK 2.36.0
- Three runs: flat top-level schema (baseline), wrapped `pages: []` schema, 4× parallel single-page calls
- All structured outputs requested via `response_format: {type: "json_schema"}`

Happy to provide raw response JSON and the runner script on request — I just didn't want to leak the source document publicly.

Thanks for building this. Genuinely the first OCR stack that gets math + layout right in one pass — I'd love to ship it into production once these gaps close.

---

_Context: I'm building a startup in this space (regulated-document extraction for the sustainability/certification industry) and Interfaze is currently my preferred extraction layer. Happy to be a design partner — if any of the items above are easier to scope with a concrete user on the line, just let me know._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-page PDF extraction: undocumented `pages: []` schema works great, plus 7 specific asks #119

Multi-page PDF extraction: undocumented `pages: []` schema works great, plus 7 specific asks

The use case I'm building for

Headline finding: wrapped schema works, but it's hidden

Empirical comparison — 3 approaches on the same 4-page PDF

What I'd ask for, in priority order

1. Document the `pages: [PageObject]` schema pattern as a first-class long-document example

2. Add an explicit `page_number` field to `precontext.ocr.result.sections[]`

3. Document the coordinate frame

4. Surface why wrapped-schema variable count (39) is lower than parallel single-page (40)

5. `run_tasks` precontext-only mode that accepts a multi-page PDF

6. Document actual concurrent-request capacity

7. Long-doc strategy: 5-min cap + 80 MB URL cap don't fit a 200-page standard

8. (Possible bug) Wrapped schema returned empty `section_title` for 2 of 4 pages

The killer feature you should advertise more

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Approach	Per-page?	Formulas	Variables	Sidebars	Wall-clock
Flat top-level schema (your docs' pattern)	❌ merged	10	27 (dedup)	9	57.3s
A. Wrapped `pages: []` schema	✅ correct order	10	39	9	64.6s
B. 4× parallel single-page calls	✅ by construction	10	40	9	66.0s
Reducto Extract v2 (baseline)	✅	—	40	9	minutes

Multi-page PDF extraction: undocumented pages: [] schema works great, plus 7 specific asks #119

Description

Multi-page PDF extraction: undocumented pages: [] schema works great, plus 7 specific asks

The use case I'm building for

Headline finding: wrapped schema works, but it's hidden

Empirical comparison — 3 approaches on the same 4-page PDF

What I'd ask for, in priority order

1. Document the pages: [PageObject] schema pattern as a first-class long-document example

2. Add an explicit page_number field to precontext.ocr.result.sections[]

3. Document the coordinate frame

4. Surface why wrapped-schema variable count (39) is lower than parallel single-page (40)

5. run_tasks precontext-only mode that accepts a multi-page PDF

6. Document actual concurrent-request capacity

7. Long-doc strategy: 5-min cap + 80 MB URL cap don't fit a 200-page standard

8. (Possible bug) Wrapped schema returned empty section_title for 2 of 4 pages

The killer feature you should advertise more

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Multi-page PDF extraction: undocumented `pages: []` schema works great, plus 7 specific asks #119

Multi-page PDF extraction: undocumented `pages: []` schema works great, plus 7 specific asks

1. Document the `pages: [PageObject]` schema pattern as a first-class long-document example

2. Add an explicit `page_number` field to `precontext.ocr.result.sections[]`

5. `run_tasks` precontext-only mode that accepts a multi-page PDF

8. (Possible bug) Wrapped schema returned empty `section_title` for 2 of 4 pages