Multi-page PDF extraction: undocumented pages: [] schema works great, plus 7 specific asks
Hi team — first off, the model is genuinely impressive on a hard real-world use case (more on that below). Filing this as one consolidated issue because everything I found is connected to a single theme: the docs and a couple of small API ergonomics don't yet support the long-document use case that your model is already capable of. If even half of these get picked up, Interfaze becomes a no-brainer choice for a category of work (regulated long-document extraction) that other providers do poorly.
The use case I'm building for
A PDF→HTML pipeline for certification documents in the sustainability industry — think compliance standards, methodology specs, and certification scheme documents. They share a few properties that make them hard:
- 30–200+ pages, structured by section/sub-section
- Dense with mathematical formulas (subscripted variables:
e_{ec}, η_h, EM_f, etc.) that must round-trip to LaTeX
- Right-margin annotation columns with short callout labels that explain adjacent paragraphs
- Figures often embed tables/dashboards as images, not native PDF tables
- Per-page anchoring matters because every variable, formula, and value needs a citation back to the specific page it was stated on (auditability is a hard requirement)
I evaluated Interfaze against Reducto Extract v2 on a 4-page subset (real production sample). Results below show Interfaze matches Reducto's fidelity, runs ~3–5× faster, and the LaTeX subscript correction is honestly the killer feature — your OCR layer correctly flags '7h' (a misread of η_h) at confidence 0.26, and your schema layer rewrites it to \eta_h in the output. That correction loop is exactly what compliance-doc extraction needs, and I haven't seen another vendor do it cleanly.
So this issue is "please make a few small things explicit so I can ship Interfaze into a production pipeline" rather than "your stuff is broken."
Headline finding: wrapped schema works, but it's hidden
The only PDF example in your docs (https://interfaze.ai/docs/vision/ocr — the arxiv research paper one) uses a flat top-level schema (formulas: List[FormulaItem]). This implicitly suggests that one document = one flat object. For long compliance docs, we need per-page anchoring, so I tried wrapping:
{
"type": "object",
"properties": {
"pages": {
"type": "array",
"items": {
"type": "object",
"properties": {
"page_number": { "type": "integer" },
"section_title": { "type": "string" },
"formulas": { "type": "array", "items": { /* ... */ } },
"variable_definitions": { "type": "array", "items": { /* ... */ } },
"sidebar_annotations": { "type": "array", "items": { /* ... */ } }
}
}
}
},
"required": ["pages"]
}
It just works. The model returned 4 pages in correct order with page numbers [9, 15, 16, 23] honored and each page's content properly partitioned. This needs to be a first-class documented example, not something developers discover by trying it.
Empirical comparison — 3 approaches on the same 4-page PDF
| Approach |
Per-page? |
Formulas |
Variables |
Sidebars |
Wall-clock |
| Flat top-level schema (your docs' pattern) |
❌ merged |
10 |
27 (dedup) |
9 |
57.3s |
A. Wrapped pages: [] schema |
✅ correct order |
10 |
39 |
9 |
64.6s |
| B. 4× parallel single-page calls |
✅ by construction |
10 |
40 |
9 |
66.0s |
| Reducto Extract v2 (baseline) |
✅ |
— |
40 |
9 |
minutes |
Test A nearly matches Test B (and Reducto), confirming wrapped schemas are production-viable.
What I'd ask for, in priority order
1. Document the pages: [PageObject] schema pattern as a first-class long-document example
Add it to https://interfaze.ai/docs/vision/ocr alongside the arxiv example. Suggested copy: "For multi-page PDFs where you need per-page anchoring (compliance documents, technical specs, multi-section reports), wrap your per-page extraction schema in a top-level pages: [] array. The model honors page order and fills one entry per physical page." That's it — one paragraph + a working snippet would unlock a whole category of use cases.
2. Add an explicit page_number field to precontext.ocr.result.sections[]
Today the mapping is inferential — sections[] happens to be the same length as total_pages, and the first line of each section happens to be the printed page number. That's brittle. One explicit page_number (1-indexed, physical page in the input PDF) on each section eliminates the ambiguity and lets us build reliable bbox-to-page mappings.
3. Document the coordinate frame
Top-level result.width: 1190, result.height: 6732 (on a 4-page input) suggests a stacked canvas. But per-section lines[].bounds.y starts fresh at ~73 for each section, suggesting page-local. Both interpretations are valid for different use cases (cross-page visuals vs. per-page overlays) — please document which is authoritative or, ideally, expose both: page-local on each section + a stacked canvas at top-level only when needed.
4. Surface why wrapped-schema variable count (39) is lower than parallel single-page (40)
Probably deduplication when the same variable_definitions[] array appears in multiple page objects in the response. If that's tunable, an explicit flag like "deduplicate_repeated_items": false would help. If it's an artifact of attention/context, surfacing a hint in the schema description ("treat each page's array independently") would be enough. For us, every duplicate is information — a variable that's defined on three different pages should appear three times so we can cite each occurrence back to its page.
5. run_tasks precontext-only mode that accepts a multi-page PDF
The "Lower costs" page recommends <task>ocr</task> for raw OCR — but it's not clear from the docs whether that mode works on a multi-page PDF and returns the per-section precontext directly. For our use case we'd love to skip the full reasoning pass when all we need is raw text + bboxes + confidence. A worked example of <task>ocr</task> on a 10+ page PDF with precontext returned at top-level would be valuable.
6. Document actual concurrent-request capacity
With 4 concurrent requests on the same API key I saw only 2.6× speedup vs serial (66s wall-clock vs 171s sum-of-latencies). That's much lower than the 50 req/sec rate limit implies. Either there's an undocumented per-account concurrent-request cap, or there's shared internal pipeline contention. Knowing the real concurrency ceiling would let us size client-side fan-out correctly. A line in the rate-limits page like "Up to N concurrent requests per account; queue beyond that" would solve this.
7. Long-doc strategy: 5-min cap + 80 MB URL cap don't fit a 200-page standard
A real certification standard at 300 DPI in this category is ~200 pages and >80 MB. Either (a) a chunked/streamed PDF endpoint, or (b) an authoritative recommendation in the docs: "for PDFs longer than N pages or larger than M MB, split client-side into chunks of K pages and call in parallel." Right now we have to derive the right chunk size empirically.
8. (Possible bug) Wrapped schema returned empty section_title for 2 of 4 pages
Same model, same prompt, same page content — when I ran each page as a separate single-page request (Test B), all 4 pages got non-empty section_title. When I ran the same 4 pages inside a wrapped pages: [] schema (Test A), pages 15 and 23 returned section_title: "". This suggests the schema-following layer treats per-page object fields differently when the page is one element of an array vs. when it's the top-level object. Probably worth a look from your model team.
The killer feature you should advertise more
On page 16 of my test doc, the raw OCR layer returns the line '7h' at confidence 0.26. That's an OCR misread of η_h (Greek eta with subscript h — a variable used hundreds of times in this standard). Your schema-extraction layer then writes out \eta_h in the LaTeX field of the formula. The OCR layer correctly flagged it as low-confidence and the schema layer fixed it using context.
This is the magic. No other OCR vendor I've tested (Reducto, Mistral OCR, GLM-OCR, PaddleOCR-VL) does both halves of this correction loop in one call. It's a real differentiator for any vertical that involves mathematical or scientific notation. Please put it in a blog post.
Reproduction
Full repro available — let me know if it's useful and I can share the script + raw responses privately. Summary:
- Input: 4-page subset of a real sustainability certification standard
- Model:
interfaze-beta
- SDK: OpenAI Python SDK 2.36.0
- Three runs: flat top-level schema (baseline), wrapped
pages: [] schema, 4× parallel single-page calls
- All structured outputs requested via
response_format: {type: "json_schema"}
Happy to provide raw response JSON and the runner script on request — I just didn't want to leak the source document publicly.
Thanks for building this. Genuinely the first OCR stack that gets math + layout right in one pass — I'd love to ship it into production once these gaps close.
Context: I'm building a startup in this space (regulated-document extraction for the sustainability/certification industry) and Interfaze is currently my preferred extraction layer. Happy to be a design partner — if any of the items above are easier to scope with a concrete user on the line, just let me know.
Multi-page PDF extraction: undocumented
pages: []schema works great, plus 7 specific asksHi team — first off, the model is genuinely impressive on a hard real-world use case (more on that below). Filing this as one consolidated issue because everything I found is connected to a single theme: the docs and a couple of small API ergonomics don't yet support the long-document use case that your model is already capable of. If even half of these get picked up, Interfaze becomes a no-brainer choice for a category of work (regulated long-document extraction) that other providers do poorly.
The use case I'm building for
A PDF→HTML pipeline for certification documents in the sustainability industry — think compliance standards, methodology specs, and certification scheme documents. They share a few properties that make them hard:
e_{ec},η_h,EM_f, etc.) that must round-trip to LaTeXI evaluated Interfaze against Reducto Extract v2 on a 4-page subset (real production sample). Results below show Interfaze matches Reducto's fidelity, runs ~3–5× faster, and the LaTeX subscript correction is honestly the killer feature — your OCR layer correctly flags
'7h'(a misread of η_h) at confidence 0.26, and your schema layer rewrites it to\eta_hin the output. That correction loop is exactly what compliance-doc extraction needs, and I haven't seen another vendor do it cleanly.So this issue is "please make a few small things explicit so I can ship Interfaze into a production pipeline" rather than "your stuff is broken."
Headline finding: wrapped schema works, but it's hidden
The only PDF example in your docs (https://interfaze.ai/docs/vision/ocr — the arxiv research paper one) uses a flat top-level schema (
formulas: List[FormulaItem]). This implicitly suggests that one document = one flat object. For long compliance docs, we need per-page anchoring, so I tried wrapping:{ "type": "object", "properties": { "pages": { "type": "array", "items": { "type": "object", "properties": { "page_number": { "type": "integer" }, "section_title": { "type": "string" }, "formulas": { "type": "array", "items": { /* ... */ } }, "variable_definitions": { "type": "array", "items": { /* ... */ } }, "sidebar_annotations": { "type": "array", "items": { /* ... */ } } } } } }, "required": ["pages"] }It just works. The model returned 4 pages in correct order with page numbers
[9, 15, 16, 23]honored and each page's content properly partitioned. This needs to be a first-class documented example, not something developers discover by trying it.Empirical comparison — 3 approaches on the same 4-page PDF
pages: []schemaTest A nearly matches Test B (and Reducto), confirming wrapped schemas are production-viable.
What I'd ask for, in priority order
1. Document the
pages: [PageObject]schema pattern as a first-class long-document exampleAdd it to https://interfaze.ai/docs/vision/ocr alongside the arxiv example. Suggested copy: "For multi-page PDFs where you need per-page anchoring (compliance documents, technical specs, multi-section reports), wrap your per-page extraction schema in a top-level
pages: []array. The model honors page order and fills one entry per physical page." That's it — one paragraph + a working snippet would unlock a whole category of use cases.2. Add an explicit
page_numberfield toprecontext.ocr.result.sections[]Today the mapping is inferential —
sections[]happens to be the same length astotal_pages, and the first line of each section happens to be the printed page number. That's brittle. One explicitpage_number(1-indexed, physical page in the input PDF) on each section eliminates the ambiguity and lets us build reliable bbox-to-page mappings.3. Document the coordinate frame
Top-level
result.width: 1190, result.height: 6732(on a 4-page input) suggests a stacked canvas. But per-sectionlines[].bounds.ystarts fresh at ~73 for each section, suggesting page-local. Both interpretations are valid for different use cases (cross-page visuals vs. per-page overlays) — please document which is authoritative or, ideally, expose both: page-local on each section + a stacked canvas at top-level only when needed.4. Surface why wrapped-schema variable count (39) is lower than parallel single-page (40)
Probably deduplication when the same
variable_definitions[]array appears in multiple page objects in the response. If that's tunable, an explicit flag like"deduplicate_repeated_items": falsewould help. If it's an artifact of attention/context, surfacing a hint in the schema description ("treat each page's array independently") would be enough. For us, every duplicate is information — a variable that's defined on three different pages should appear three times so we can cite each occurrence back to its page.5.
run_tasksprecontext-only mode that accepts a multi-page PDFThe "Lower costs" page recommends
<task>ocr</task>for raw OCR — but it's not clear from the docs whether that mode works on a multi-page PDF and returns the per-section precontext directly. For our use case we'd love to skip the full reasoning pass when all we need is raw text + bboxes + confidence. A worked example of<task>ocr</task>on a 10+ page PDF with precontext returned at top-level would be valuable.6. Document actual concurrent-request capacity
With 4 concurrent requests on the same API key I saw only 2.6× speedup vs serial (66s wall-clock vs 171s sum-of-latencies). That's much lower than the 50 req/sec rate limit implies. Either there's an undocumented per-account concurrent-request cap, or there's shared internal pipeline contention. Knowing the real concurrency ceiling would let us size client-side fan-out correctly. A line in the rate-limits page like "Up to N concurrent requests per account; queue beyond that" would solve this.
7. Long-doc strategy: 5-min cap + 80 MB URL cap don't fit a 200-page standard
A real certification standard at 300 DPI in this category is ~200 pages and >80 MB. Either (a) a chunked/streamed PDF endpoint, or (b) an authoritative recommendation in the docs: "for PDFs longer than N pages or larger than M MB, split client-side into chunks of K pages and call in parallel." Right now we have to derive the right chunk size empirically.
8. (Possible bug) Wrapped schema returned empty
section_titlefor 2 of 4 pagesSame model, same prompt, same page content — when I ran each page as a separate single-page request (Test B), all 4 pages got non-empty
section_title. When I ran the same 4 pages inside a wrappedpages: []schema (Test A), pages 15 and 23 returnedsection_title: "". This suggests the schema-following layer treats per-page object fields differently when the page is one element of an array vs. when it's the top-level object. Probably worth a look from your model team.The killer feature you should advertise more
On page 16 of my test doc, the raw OCR layer returns the line
'7h'at confidence 0.26. That's an OCR misread ofη_h(Greek eta with subscript h — a variable used hundreds of times in this standard). Your schema-extraction layer then writes out\eta_hin the LaTeX field of the formula. The OCR layer correctly flagged it as low-confidence and the schema layer fixed it using context.This is the magic. No other OCR vendor I've tested (Reducto, Mistral OCR, GLM-OCR, PaddleOCR-VL) does both halves of this correction loop in one call. It's a real differentiator for any vertical that involves mathematical or scientific notation. Please put it in a blog post.
Reproduction
Full repro available — let me know if it's useful and I can share the script + raw responses privately. Summary:
interfaze-betapages: []schema, 4× parallel single-page callsresponse_format: {type: "json_schema"}Happy to provide raw response JSON and the runner script on request — I just didn't want to leak the source document publicly.
Thanks for building this. Genuinely the first OCR stack that gets math + layout right in one pass — I'd love to ship it into production once these gaps close.
Context: I'm building a startup in this space (regulated-document extraction for the sustainability/certification industry) and Interfaze is currently my preferred extraction layer. Happy to be a design partner — if any of the items above are easier to scope with a concrete user on the line, just let me know.