feat(huggingFace): refactor operator into per-task codegen + text-generation#5278
feat(huggingFace): refactor operator into per-task codegen + text-generation#5278PG1204 wants to merge 23 commits into
Conversation
…d media proxy Introduces a new Jersey REST resource exposing endpoints used by the upcoming HuggingFace operator UI: - GET /api/huggingface/models — browse / search models per task - GET /api/huggingface/tasks — list HF pipeline tags with hosted inference - POST /api/huggingface/upload-audio — upload audio for HF audio tasks - GET /api/huggingface/audio-preview — stream uploaded audio (path-validated) - GET /api/huggingface/media-proxy — proxy remote media URLs to bypass CORS This is the first PR in a stacked series landing the HF operator end-to-end. No operator code yet; this resource is independently useful and lets the frontend integrate with HF before the operator class lands.
Addresses xuang7's review on PR apache#5124 — both endpoints previously buffered the full payload into a heap-resident byte[] with no upper bound, leaving the JVM open to OOM on a hostile or buggy upstream response (/media-proxy) or out-of-band write into the audio temp dir (/audio-preview). - /media-proxy: switch from Unirest.asBytes() to asObject(Function<RawResponse, T>), streaming the upstream body in 8 KiB chunks with a running byte counter. Aborts with 413 if the declared Content-Length exceeds the cap (pre-check) or if the body crosses the cap mid-read (defends against missing/lying Content-Length). New MAX_MEDIA_PROXY_BYTES = 50 MiB, sized for HF inference media (text-to-image ~5 MiB, text-to-video ~30 MiB) with headroom. - /audio-preview: add Files.size() defense-in-depth check before readAllBytes. /upload-audio already enforces MAX_AUDIO_BYTES on ingest; this catches the case where a bug or out-of-band write puts an oversized file in the temp dir. Adds a spec covering the audio-preview cap using a sparse-file fixture so the test stays fast (87/87 spec passes). The media-proxy cap path is exercised via the existing input-validation suite plus the new streamMediaWithCap helper - a follow-up can add a fake-RawResponse unit test if reviewers want explicit coverage of the chunked-read cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5278 +/- ##
============================================
- Coverage 52.42% 52.17% -0.25%
- Complexity 2481 2503 +22
============================================
Files 1070 1071 +1
Lines 41359 41329 -30
Branches 4441 4445 +4
============================================
- Hits 21682 21564 -118
- Misses 18406 18495 +89
+ Partials 1271 1270 -1
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
/request-review @Ma77Ball |
Per review on apache#5124 (xuang7, Ma77Ball): mark the resource with @RolesAllowed(Array("REGULAR", "ADMIN")) to document that all five endpoints require an authenticated user. The annotation isn't enforced yet — that's coming with the auth-enforcement PR @Yicong-Huang and @Ma77Ball are working on — but adding it now means no follow-up change is needed when enforcement lands, and it matches the convention used by UserConfigResource / AdminSettingsResource. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eration Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation) end-to-end. - TaskCodegen trait + CodegenContext model the per-task variation - PythonCodegenBase emits the shared provider-fallback / process_table / _parse_response infrastructure with two holes for the per-task payload and parse snippets - TextGenCodegen supplies text-generation's chat-completions payload and the body["choices"][0]["message"]["content"] parse branch - HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines) holding @JsonProperty fields and the registeredCodegens map User-input string fields are typed as EncodableString and emitted via the pyb"..." macro so values reach Python as self.decode_python_template('<base64>') rather than raw literals; class constants are assigned in open(self) so self is in scope for the decode call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN check at runtime before any HF URL is composed. PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking task families by registering new *Codegen objects in the dispatcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…degen specs Addresses Codecov's 66.85% patch coverage warning by exercising the defensive null-handling branches in HuggingFaceInferenceOpDesc.scala and the TextGenCodegen contract that previously had no spec hits. - null-tolerance: feed null into every @JsonProperty (token, model, prompt col, system prompt, result col, task, maxNewTokens, temperature) and assert generatePythonCode still emits a parseable ProcessTableOperator with sane defaults (TASK falls back to text-generation, MAX_NEW_TOKENS clamps to 256, TEMPERATURE to 0.7). Covers the `if (x == null) ... else x` branches that previously had no test that took the null side. - TextGenCodegen.task: trivial canonical-value check. - TextGenCodegen ctx-independence: pass an "irrelevant"-filled ctx and assert payloadPython / parsePython still reference self.MODEL_ID and body["choices"]…. Catches a future refactor that accidentally splices ctx fields into the static snippets. 13/13 in HuggingFaceInferenceOpDescSpec, 2/2 in PythonCodeRawInvalidTextSpec (117/117 descriptors still py_compile cleanly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
61e6c41 to
8350eb9
Compare
Ma77Ball
left a comment
There was a problem hiding this comment.
Please look over the suggestions below.
…NAI_COMPATIBLE_PROVIDERS to class constants
|
Hi @PG1204 what is the status of this PR? not sure if it is ready for review, given the note about stacked PR, is the current diff accurate? |
@Yicong-Huang The comments given by @Ma77Ball have been resolved, awaiting further review. |
@Yicong-Huang This is PR-2 in the stacked PRs for the HuggingFace operator. PR-1 was merged a while back. |
Ma77Ball
left a comment
There was a problem hiding this comment.
Overall LGTM! I think the below can be implemented or left as is.
What changes were proposed in this PR?
Refactors the monolithic 1,278-line
HuggingFaceInferenceOpDescfrom the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation):codegen/TaskCodegen.scalaintroduces the trait +CodegenContextthat model per-task variation.codegen/PythonCodegenBase.scalaemits the shared provider-fallback /process_table/_parse_responseinfrastructure with two holes for the per-task payload and parse snippets.codegen/TextGenCodegen.scalasupplies text-generation's chat-completions payload and thebody["choices"][0 ["message"]["content"]parse branch.HuggingFaceInferenceOpDesc.scalabecomes a thin (~180-line) dispatcher holding the@JsonPropertyfields and theregisteredCodegensmap.User-input string fields are typed
EncodableStringand emitted via thepyb"..."macro so values reach Python asself.decode_python_template('<base64>')rather than raw literals. Class constants are assigned inopen(self)soselfis in scope for the decode call. The generatedprocess_tableruns a defensive_HF_MODEL_ID_PATTERNcheck at runtime before any HF URL is composed.The
TaskCodegentrait also exposes atasks: Set[String]default so a single codegen can register under multiple task strings, this becomes relevant in PR 3 (image family).Any related issues, documentation, or discussions?
Tracked in #5277 & #5041(umbrella issue for the HuggingFace operator end-to-end implementation).
Closes #5277
Stacked on #5124 (PR 1 - REST resource).
This is PR 2 of a multi-PR series landing the HuggingFace operator end-to-end. The full plan and umbrella issue live separately; this PR's scope is exactly the dispatcher pattern + text-generation codegen.
How was this PR tested?
sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"clean.sbt scalafmtCheckclean.sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec"- 10/10 pass (operator info, validation, codegen wiring, MODEL_ID runtime check, leak-prevention, clamping, schema).sbt "WorkflowOperator/testOnly org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"- 117/117 descriptorspy_compilecleanly, no raw-text leaks. The new operator is included in this scan.python3 -m py_compileon a sample output.Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.7