Skip to content

[Bug] ModelTrainer drops TrainingJobName for PipelineSession, breaking use_custom_job_prefix on TrainingStep #5776

@rojo1997

Description

@rojo1997

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
When a ModelTrainer is executed under a PipelineSession (i.e. produces a TrainingStep), ModelTrainer._create_training_job_args explicitly removes training_job_name from the request before serializing it to PascalCase:

# sagemaker/train/model_trainer.py (sagemaker-train 1.8.0)
if boto3 or isinstance(self.sagemaker_session, PipelineSession):
    if isinstance(self.sagemaker_session, PipelineSession):
        training_request.pop("training_job_name", None)
    # Convert snake_case to PascalCase for AWS API
    pipeline_request = {to_pascal_case(k): v for k, v in training_request.items()}
    serialized_request = serialize(pipeline_request)
    return serialized_request

Because the key is popped, the resulting request dict has no TrainingJobName. Downstream, TrainingStep.arguments (with PipelineDefinitionConfig(use_custom_job_prefix=True)) relies on TrainingJobName being present in the request so the prefix is preserved (and trim_request_dict removes it when use_custom_job_prefix=False).

The net effect is that use_custom_job_prefix=True is silently ignored for TrainingStep when the step is built from a ModelTrainer: every pipeline execution produces a random auto-generated training job name instead of the configured base_job_name prefix.

This is the same class of bug as #3991 and #4590 (which were about TransformStep), but for the new V3 ModelTrainerTrainingStep path.

To reproduce

from sagemaker.core.workflow.pipeline import Pipeline
from sagemaker.core.workflow.pipeline_context import PipelineSession
from sagemaker.core.workflow.pipeline_definition_config import PipelineDefinitionConfig
from sagemaker.train.model_trainer import ModelTrainer
# ... build a ModelTrainer `trainer` with base_job_name=\"my-prefix\" ...

pipeline_session = PipelineSession()
trainer.sagemaker_session = pipeline_session

step_args = trainer._create_training_job_args()
assert \"TrainingJobName\" in step_args, step_args  # FAILS — key was popped

pipeline = Pipeline(
    name=\"repro\",
    steps=[...],  # TrainingStep built from trainer
    sagemaker_session=pipeline_session,
    pipeline_definition_config=PipelineDefinitionConfig(use_custom_job_prefix=True),
)
# Pipeline executions will NOT use \"my-prefix-...\" as the training job name.

Expected behavior
TrainingJobName should remain in the request dict so that PipelineDefinitionConfig(use_custom_job_prefix=True) produces training jobs named with the configured prefix. When use_custom_job_prefix=False, TrainingStep.arguments/trim_request_dict will strip the key as usual.

A minimal fix is to stop popping the key:

if boto3 or isinstance(self.sagemaker_session, PipelineSession):
    pipeline_request = {to_pascal_case(k): v for k, v in training_request.items()}
    serialized_request = serialize(pipeline_request)
    return serialized_request

As a workaround we currently monkey-patch _create_training_job_args to re-insert TrainingJobName = _get_unique_name(self.base_job_name) when the session is a PipelineSession.

Screenshots or logs
N/A — silent misbehavior; the pipeline executes but job names use the default random name instead of the configured prefix.

System information

  • SageMaker Python SDK version: sagemaker-train 1.8.0, sagemaker-core 2.8.0, sagemaker-mlops 1.8.0, sagemaker-serve 1.8.0 (also reproduces on 1.7.1 / 2.7.1)
  • Framework name or algorithm: custom (source_code via ModelTrainer)
  • Framework version: N/A
  • Python version: 3.13
  • CPU or GPU: CPU (irrelevant, bug is SDK-side)
  • Custom Docker image (Y/N): Y

Additional context
Related closed issues for other step types: #3991 (TransformStep), #4590 (TransformStep/ProcessingStep).

Metadata

Metadata

Assignees

No one assigned

    Labels

    component: pipelinesRelates to the SageMaker Pipeline Platform

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions