feat(huggingface): add qa and ranking tasks#5574
Conversation
…d media proxy Introduces a new Jersey REST resource exposing endpoints used by the upcoming HuggingFace operator UI: - GET /api/huggingface/models — browse / search models per task - GET /api/huggingface/tasks — list HF pipeline tags with hosted inference - POST /api/huggingface/upload-audio — upload audio for HF audio tasks - GET /api/huggingface/audio-preview — stream uploaded audio (path-validated) - GET /api/huggingface/media-proxy — proxy remote media URLs to bypass CORS This is the first PR in a stacked series landing the HF operator end-to-end. No operator code yet; this resource is independently useful and lets the frontend integrate with HF before the operator class lands.
Addresses xuang7's review on PR apache#5124 — both endpoints previously buffered the full payload into a heap-resident byte[] with no upper bound, leaving the JVM open to OOM on a hostile or buggy upstream response (/media-proxy) or out-of-band write into the audio temp dir (/audio-preview). - /media-proxy: switch from Unirest.asBytes() to asObject(Function<RawResponse, T>), streaming the upstream body in 8 KiB chunks with a running byte counter. Aborts with 413 if the declared Content-Length exceeds the cap (pre-check) or if the body crosses the cap mid-read (defends against missing/lying Content-Length). New MAX_MEDIA_PROXY_BYTES = 50 MiB, sized for HF inference media (text-to-image ~5 MiB, text-to-video ~30 MiB) with headroom. - /audio-preview: add Files.size() defense-in-depth check before readAllBytes. /upload-audio already enforces MAX_AUDIO_BYTES on ingest; this catches the case where a bug or out-of-band write puts an oversized file in the temp dir. Adds a spec covering the audio-preview cap using a sparse-file fixture so the test stays fast (87/87 spec passes). The media-proxy cap path is exercised via the existing input-validation suite plus the new streamMediaWithCap helper - a follow-up can add a fake-RawResponse unit test if reviewers want explicit coverage of the chunked-read cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review on apache#5124 (xuang7, Ma77Ball): mark the resource with @RolesAllowed(Array("REGULAR", "ADMIN")) to document that all five endpoints require an authenticated user. The annotation isn't enforced yet — that's coming with the auth-enforcement PR @Yicong-Huang and @Ma77Ball are working on — but adding it now means no follow-up change is needed when enforcement lands, and it matches the convention used by UserConfigResource / AdminSettingsResource. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eration Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation) end-to-end. - TaskCodegen trait + CodegenContext model the per-task variation - PythonCodegenBase emits the shared provider-fallback / process_table / _parse_response infrastructure with two holes for the per-task payload and parse snippets - TextGenCodegen supplies text-generation's chat-completions payload and the body["choices"][0]["message"]["content"] parse branch - HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines) holding @JsonProperty fields and the registeredCodegens map User-input string fields are typed as EncodableString and emitted via the pyb"..." macro so values reach Python as self.decode_python_template('<base64>') rather than raw literals; class constants are assigned in open(self) so self is in scope for the decode call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN check at runtime before any HF URL is composed. PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking task families by registering new *Codegen objects in the dispatcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…degen specs Addresses Codecov's 66.85% patch coverage warning by exercising the defensive null-handling branches in HuggingFaceInferenceOpDesc.scala and the TextGenCodegen contract that previously had no spec hits. - null-tolerance: feed null into every @JsonProperty (token, model, prompt col, system prompt, result col, task, maxNewTokens, temperature) and assert generatePythonCode still emits a parseable ProcessTableOperator with sane defaults (TASK falls back to text-generation, MAX_NEW_TOKENS clamps to 256, TEMPERATURE to 0.7). Covers the `if (x == null) ... else x` branches that previously had no test that took the null side. - TextGenCodegen.task: trivial canonical-value check. - TextGenCodegen ctx-independence: pass an "irrelevant"-filled ctx and assert payloadPython / parsePython still reference self.MODEL_ID and body["choices"]…. Catches a future refactor that accidentally splices ctx fields into the static snippets. 13/13 in HuggingFaceInferenceOpDescSpec, 2/2 in PythonCodeRawInvalidTextSpec (117/117 descriptors still py_compile cleanly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NAI_COMPATIBLE_PROVIDERS to class constants
Plugs the 9-task image family into the dispatcher pattern established
in PR 2:
image-only image-classification, object-detection,
image-segmentation, image-to-text
image + prompt visual-question-answering, document-question-answering,
zero-shot-image-classification, image-text-to-text,
image-to-image
- ImageTaskCodegen supplies payload + parse Python for all 9 tasks
- TaskCodegen trait gains a `tasks: Set[String]` default method so a
single codegen can register under multiple task strings; the
dispatcher map in HuggingFaceInferenceOpDesc is built from
registeredCodegens.tasks.flatMap(...)
- CodegenContext extended with imageInput + inputImageColumn
(EncodableString)
- HuggingFaceInferenceOpDesc gains 2 new @JsonProperty fields and
registers ImageTaskCodegen
PythonCodegenBase grows to host the shared image infrastructure:
- image_only_tasks / image_prompt_tasks / image_tasks tuples and
image_headers in process_table
- per-row image bytes resolution from upload (self._read_image_input)
or input column (self._read_binary_value + self._compress_image_bytes)
- use_raw_binary_body / raw_binary_headers state threaded through
_post_with_fallback (signature extended)
- _post_with_fallback adds the image-text-to-text chat-completions
branch and the model-author vision branch
- _call_provider adds branches for zai-org's custom API, Replicate
predictions + polling, Fal-ai, Wavespeed submit+poll, and image
embedding in OpenAI-compatible / unknown-provider fallbacks
- image-content-type response handling returns data:image URLs
- image helpers added: _read_image_input, _compress_image_bytes,
_image_input_as_base64, _read_binary_value, _looks_like_html,
_html_to_image_bytes, _extract_json_arg, _url_to_data_url
User-input strings continue to flow through pyb"..." + EncodableString
so they reach Python as self.decode_python_template('<base64>') rather
than raw literals. PythonCodeRawInvalidTextSpec still passes
(117/117 descriptors py_compile cleanly).
Frontend integration adds only the HF lines (no agent / dataset
noise from the source branch):
- HuggingFaceImageUploadComponent declared in app.module.ts
- huggingface-image-upload formly type registered in formly-config.ts
- Image upload component .ts/.html/.scss cherry-picked from huggingFace
- HuggingFace.png + sample-image.png assets
PR 3 of a stacked 9-PR series. Stacks on hf/02-operator-textgen.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5574 +/- ##
============================================
+ Coverage 51.88% 52.33% +0.44%
- Complexity 2472 2520 +48
============================================
Files 1067 1078 +11
Lines 41258 41566 +308
Branches 4437 4467 +30
============================================
+ Hits 21408 21752 +344
+ Misses 18591 18544 -47
- Partials 1259 1270 +11
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
/request-review @Ma77Ball |
| """ if task == "question-answering": | ||
| | return body.get("answer", json.dumps(body)) | ||
| | elif task == "table-question-answering": | ||
| | return body.get("answer", json.dumps(body)) |
There was a problem hiding this comment.
body.get(...) assumes body is a dict, but some providers return a list for QA. On a non-dict, .get raises AttributeError, which is not caught by _parse_response's except (KeyError, IndexError, TypeError), so instead of the intended json.dumps(body) fallback it surfaces as a misleading "Request failed" for that row. Guard with isinstance:
| """ if task == "question-answering": | |
| | return body.get("answer", json.dumps(body)) | |
| | elif task == "table-question-answering": | |
| | return body.get("answer", json.dumps(body)) | |
| """ if task == "question-answering": | |
| | return body.get("answer", json.dumps(body)) if isinstance(body, dict) else json.dumps(body) | |
| | elif task == "table-question-answering": | |
| | return body.get("answer", json.dumps(body)) if isinstance(body, dict) else json.dumps(body) |
| | f"Prompt column '{prompt_col}' not found in input table. " | ||
| | f"Available columns: {list(table.columns)}" | ||
| | ) | ||
| | if task == "question-answering": |
There was a problem hiding this comment.
zero-shot-classification has no check that CANDIDATE_LABELS is non-empty, so an empty value sends candidate_labels: [] and produces a confusing provider-side error instead of an actionable one (unlike the context and sentences checks here). Add an upfront assert:
| | if task == "question-answering": | |
| | if task == "zero-shot-classification": | |
| | assert self.CANDIDATE_LABELS and self.CANDIDATE_LABELS.strip(), ( | |
| | "Candidate Labels are required for zero-shot-classification. " | |
| | "Provide a comma-separated list of labels." | |
| | ) | |
| | if task == "question-answering": |
hf/04-audio-mediagen. Until that lands, the diff below may also include earlier HuggingFace task-family changes depending on which base GitHub is showing. The new code in this PR iscodegen/QaRankingCodegen.scala, the QA/ranking-related additions tocodegen/PythonCodegenBase.scala, the new QA/ranking fields onHuggingFaceInferenceOpDesc.scala, and the QA/ranking task tests inHuggingFaceInferenceOpDescSpec.scala. Once PR 4 merges and this PR is retargeted tomain, the diff should auto-clean to the PR 5 QA/ranking changes only.What changes were proposed in this PR?
Adds the QA/ranking/classification task family — 5 HF pipeline tasks — as a new
TaskCodegenplugged into the dispatcher established by the text-generation PR:QA tasks:
question-answering,table-question-answeringclassification/ranking tasks:
zero-shot-classification,sentence-similarity,text-rankingcodegen/QaRankingCodegen.scalasupplies the per-task payload + parse Python branches for all 5 tasks.CodegenContextis extended withcontextColumn,candidateLabels, andsentencesColumn(EncodableString).HuggingFaceInferenceOpDesc.scalagains 3 new@JsonPropertyfields and registersQaRankingCodegenin the dispatcher.PythonCodegenBase.scalagrows to host the shared QA/ranking infrastructure:question-answeringpayload handling with prompt + context.table-question-answeringpayload handling with table data.zero-shot-classificationpayload handling with candidate labels.sentence-similarityandtext-rankingpayload handling with sentence inputs.User-input strings continue to flow through
pyb"..."+EncodableStringso they reach Python asself.decode_python_template('<base64>')rather than raw literals.PythonCodeRawInvalidTextSpecstill passes with 117/117 descriptors py_compile cleanly.Any related issues, documentation, or discussions?
Tracking issue: Add HuggingFace question answering and ranking tasks #5292
Closes #5292
Stacked on: PR 4 audio/media generation tasks /
hf/04-audio-mediagenParent issue: Add Hugging Face inference operator #5041
Closed sibling issue: Add HuggingFaceModelResource REST endpoints for HF operator UI #5134
How was this PR tested?
sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"clean.sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"— 31 focused tests pass, including HuggingFace QA/ranking task coverage and the raw Python descriptor scan.sbt "WorkflowOperator / scalafmtCheck"clean.sbt "WorkflowOperator / Test / scalafmtCheck"clean.PythonCodeRawInvalidTextSpec— 117/117 descriptors py_compile cleanly with the new operator code paths, no marker leaks.Was this PR authored or co-authored using generative AI tooling?
Yes, co-authored with generative AI tooling (Codex).