feat(huggingface): add audio and media generation tasks#5570
feat(huggingface): add audio and media generation tasks#5570anishshiva7 wants to merge 25 commits into
Conversation
…d media proxy Introduces a new Jersey REST resource exposing endpoints used by the upcoming HuggingFace operator UI: - GET /api/huggingface/models — browse / search models per task - GET /api/huggingface/tasks — list HF pipeline tags with hosted inference - POST /api/huggingface/upload-audio — upload audio for HF audio tasks - GET /api/huggingface/audio-preview — stream uploaded audio (path-validated) - GET /api/huggingface/media-proxy — proxy remote media URLs to bypass CORS This is the first PR in a stacked series landing the HF operator end-to-end. No operator code yet; this resource is independently useful and lets the frontend integrate with HF before the operator class lands.
Addresses xuang7's review on PR apache#5124 — both endpoints previously buffered the full payload into a heap-resident byte[] with no upper bound, leaving the JVM open to OOM on a hostile or buggy upstream response (/media-proxy) or out-of-band write into the audio temp dir (/audio-preview). - /media-proxy: switch from Unirest.asBytes() to asObject(Function<RawResponse, T>), streaming the upstream body in 8 KiB chunks with a running byte counter. Aborts with 413 if the declared Content-Length exceeds the cap (pre-check) or if the body crosses the cap mid-read (defends against missing/lying Content-Length). New MAX_MEDIA_PROXY_BYTES = 50 MiB, sized for HF inference media (text-to-image ~5 MiB, text-to-video ~30 MiB) with headroom. - /audio-preview: add Files.size() defense-in-depth check before readAllBytes. /upload-audio already enforces MAX_AUDIO_BYTES on ingest; this catches the case where a bug or out-of-band write puts an oversized file in the temp dir. Adds a spec covering the audio-preview cap using a sparse-file fixture so the test stays fast (87/87 spec passes). The media-proxy cap path is exercised via the existing input-validation suite plus the new streamMediaWithCap helper - a follow-up can add a fake-RawResponse unit test if reviewers want explicit coverage of the chunked-read cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review on apache#5124 (xuang7, Ma77Ball): mark the resource with @RolesAllowed(Array("REGULAR", "ADMIN")) to document that all five endpoints require an authenticated user. The annotation isn't enforced yet — that's coming with the auth-enforcement PR @Yicong-Huang and @Ma77Ball are working on — but adding it now means no follow-up change is needed when enforcement lands, and it matches the convention used by UserConfigResource / AdminSettingsResource. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eration Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation) end-to-end. - TaskCodegen trait + CodegenContext model the per-task variation - PythonCodegenBase emits the shared provider-fallback / process_table / _parse_response infrastructure with two holes for the per-task payload and parse snippets - TextGenCodegen supplies text-generation's chat-completions payload and the body["choices"][0]["message"]["content"] parse branch - HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines) holding @JsonProperty fields and the registeredCodegens map User-input string fields are typed as EncodableString and emitted via the pyb"..." macro so values reach Python as self.decode_python_template('<base64>') rather than raw literals; class constants are assigned in open(self) so self is in scope for the decode call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN check at runtime before any HF URL is composed. PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking task families by registering new *Codegen objects in the dispatcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…degen specs Addresses Codecov's 66.85% patch coverage warning by exercising the defensive null-handling branches in HuggingFaceInferenceOpDesc.scala and the TextGenCodegen contract that previously had no spec hits. - null-tolerance: feed null into every @JsonProperty (token, model, prompt col, system prompt, result col, task, maxNewTokens, temperature) and assert generatePythonCode still emits a parseable ProcessTableOperator with sane defaults (TASK falls back to text-generation, MAX_NEW_TOKENS clamps to 256, TEMPERATURE to 0.7). Covers the `if (x == null) ... else x` branches that previously had no test that took the null side. - TextGenCodegen.task: trivial canonical-value check. - TextGenCodegen ctx-independence: pass an "irrelevant"-filled ctx and assert payloadPython / parsePython still reference self.MODEL_ID and body["choices"]…. Catches a future refactor that accidentally splices ctx fields into the static snippets. 13/13 in HuggingFaceInferenceOpDescSpec, 2/2 in PythonCodeRawInvalidTextSpec (117/117 descriptors still py_compile cleanly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NAI_COMPATIBLE_PROVIDERS to class constants
Plugs the 9-task image family into the dispatcher pattern established
in PR 2:
image-only image-classification, object-detection,
image-segmentation, image-to-text
image + prompt visual-question-answering, document-question-answering,
zero-shot-image-classification, image-text-to-text,
image-to-image
- ImageTaskCodegen supplies payload + parse Python for all 9 tasks
- TaskCodegen trait gains a `tasks: Set[String]` default method so a
single codegen can register under multiple task strings; the
dispatcher map in HuggingFaceInferenceOpDesc is built from
registeredCodegens.tasks.flatMap(...)
- CodegenContext extended with imageInput + inputImageColumn
(EncodableString)
- HuggingFaceInferenceOpDesc gains 2 new @JsonProperty fields and
registers ImageTaskCodegen
PythonCodegenBase grows to host the shared image infrastructure:
- image_only_tasks / image_prompt_tasks / image_tasks tuples and
image_headers in process_table
- per-row image bytes resolution from upload (self._read_image_input)
or input column (self._read_binary_value + self._compress_image_bytes)
- use_raw_binary_body / raw_binary_headers state threaded through
_post_with_fallback (signature extended)
- _post_with_fallback adds the image-text-to-text chat-completions
branch and the model-author vision branch
- _call_provider adds branches for zai-org's custom API, Replicate
predictions + polling, Fal-ai, Wavespeed submit+poll, and image
embedding in OpenAI-compatible / unknown-provider fallbacks
- image-content-type response handling returns data:image URLs
- image helpers added: _read_image_input, _compress_image_bytes,
_image_input_as_base64, _read_binary_value, _looks_like_html,
_html_to_image_bytes, _extract_json_arg, _url_to_data_url
User-input strings continue to flow through pyb"..." + EncodableString
so they reach Python as self.decode_python_template('<base64>') rather
than raw literals. PythonCodeRawInvalidTextSpec still passes
(117/117 descriptors py_compile cleanly).
Frontend integration adds only the HF lines (no agent / dataset
noise from the source branch):
- HuggingFaceImageUploadComponent declared in app.module.ts
- huggingface-image-upload formly type registered in formly-config.ts
- Image upload component .ts/.html/.scss cherry-picked from huggingFace
- HuggingFace.png + sample-image.png assets
PR 3 of a stacked 9-PR series. Stacks on hf/02-operator-textgen.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/request-review @Ma77Ball |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5570 +/- ##
============================================
+ Coverage 51.88% 52.31% +0.42%
- Complexity 2472 2516 +44
============================================
Files 1067 1077 +10
Lines 41258 41537 +279
Branches 4437 4464 +27
============================================
+ Hits 21408 21731 +323
+ Misses 18591 18533 -58
- Partials 1259 1273 +14
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Ma77Ball
left a comment
There was a problem hiding this comment.
Please look over the comments below.
| | elif task == "text-to-speech": | ||
| | payload = {"inputs": prompt_value} | ||
| | else: | ||
| | payload = {"inputs": prompt_value}""".stripMargin |
There was a problem hiding this comment.
The else is unreachable: tasks only contains the three audio tasks, all covered above, and it duplicates the text-to-speech payload (which will drift if edited). Fix:
| | elif task == "text-to-speech": | |
| | payload = {"inputs": prompt_value} | |
| | else: | |
| | payload = {"inputs": prompt_value}""".stripMargin | |
| | elif task == "text-to-speech": | |
| | payload = {"inputs": prompt_value}""".stripMargin |
| """ if task in ("text-to-image", "text-to-video"): | ||
| | payload = {"inputs": prompt_value} | ||
| | else: | ||
| | payload = {"inputs": prompt_value}""".stripMargin |
There was a problem hiding this comment.
tasks is exactly {"text-to-image", "text-to-video"}, so the if always matches and this else is dead, duplicating the same payload. Fix:
| """ if task in ("text-to-image", "text-to-video"): | |
| | payload = {"inputs": prompt_value} | |
| | else: | |
| | payload = {"inputs": prompt_value}""".stripMargin | |
| """ payload = {"inputs": prompt_value}""".stripMargin |
| | elif task in ("text-to-image", "text-to-video"): | ||
| | inp = {"prompt": prompt_value} | ||
| | elif task == "automatic-speech-recognition" and img_b64: | ||
| | inp = {"audio": f"data:audio/wav;base64,{img_b64}"} |
There was a problem hiding this comment.
In the Replicate branch the audio data URL MIME type is hardcoded to audio/wav, but current_audio_bytes may be mp3/flac/ogg, and the real type is already computed by _get_audio_content_type(). A non-wav payload labeled audio/wav can be rejected or mis-decoded by the model. The audio type is not currently threaded into _call_provider, so this is a small refactor.
hf/03-image-tasks. Until that lands, the diff below may also include PR 3's image-task operator + codegen + spec changes depending on which base GitHub is showing. The new code in this PR iscodegen/AudioTaskCodegen.scala,codegen/MediaGenCodegen.scala, the audio/media-related additions tocodegen/PythonCodegenBase.scala, the new audio fields onHuggingFaceInferenceOpDesc.scala, and the audio/media-task tests inHuggingFaceInferenceOpDescSpec.scala. Once PR 3 merges and this PR is retargeted tomain, the diff should auto-clean to the PR 4 audio/media changes only.What changes were proposed in this PR?
Adds the audio and media-generation task families — 5 HF pipeline tasks — as new
TaskCodegens plugged into the dispatcher established by the text-generation PR:audio tasks:
automatic-speech-recognition,audio-classification,text-to-speechmedia-generation tasks:
text-to-image,text-to-videocodegen/AudioTaskCodegen.scalasupplies the per-task payload + parse Python branches for the 3 audio tasks.codegen/MediaGenCodegen.scalasupplies the per-task payload + parse Python branches for the 2 media-generation tasks.CodegenContextis extended withaudioInput+inputAudioColumn(EncodableString).HuggingFaceInferenceOpDesc.scalagains 2 new@JsonPropertyfields and registersAudioTaskCodegen+MediaGenCodegenin the dispatcher.PythonCodegenBase.scalagrows to host the shared audio/media infrastructure:audio_only_tasks) inprocess_table.automatic-speech-recognitionandaudio-classification.text-to-speech._call_provider, including OpenAI-compatible image/audio endpoints where supported.data:image/...,data:audio/..., ordata:video/...URLs where needed.User-input strings continue to flow through
pyb"..."+EncodableStringso they reach Python asself.decode_python_template('<base64>')rather than raw literals.PythonCodeRawInvalidTextSpecstill passes with 117/117 descriptors py_compile cleanly.Any related issues, documentation, or discussions?
Tracking issue: Add audio and media-generation task families to HuggingFace operator #5288
Closes #5288
Stacked on: Add image task family (
ImageTaskCodegen) to HuggingFace operator /hf/03-image-tasksParent issue: Add Hugging Face inference operator #5041
Closed sibling issue: Add HuggingFaceModelResource REST endpoints for HF operator UI #5134
How was this PR tested?
sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"clean.sbt scalafmtCheckclean.sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"— 26 focused tests pass, including HuggingFace audio/media task coverage and the raw Python descriptor scan.sbt "WorkflowOperator/testOnly org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"— 117/117 descriptors py_compile cleanly with the new operator code paths, no marker leaks.Was this PR authored or co-authored using generative AI tooling?
Yes, co-authored with generative AI tooling (Codex).