PySDK Version
Describe the bug
FrameworkProcessor.run(requirements=...) in the SageMaker v3 sagemaker-core Processing API accepts a custom requirements file name, but the generated runtime script does not use it.
For example, this API call accepts:
framework_processor.run(
code="run_entrypoint.py",
source_dir=".",
requirements="cpu-requirements.txt",
)
However, the generated runproc.sh only checks for a hardcoded file named requirements.txt:
if [[ -f 'requirements.txt' ]]; then
pip uninstall --yes typing
pip install -r requirements.txt
fi
So cpu-requirements.txt may be packaged into sourcedir.tar.gz, but it is not installed at runtime.
Relevant file:
sagemaker-core/src/sagemaker/core/processing.py
Relevant method:
FrameworkProcessor._generate_framework_script()
Reference:
https://github.com/aws/sagemaker-python-sdk/blob/master/sagemaker-core/src/sagemaker/core/processing.py#L1187
To reproduce
Create the following project structure:
.
├── run_entrypoint.py
├── cpu-requirements.txt
└── scripts/
└── entrypoints/
└── step_preprocess_jobs.py
Example cpu-requirements.txt:
requests==2.32.3
Example run_entrypoint.py:
import importlib
import sys
module_name = sys.argv[1]
module = importlib.import_module(module_name)
if hasattr(module, "main"):
module.main()
Example scripts/entrypoints/step_preprocess_jobs.py:
def main():
import requests
print("requests imported successfully")
Then create a SageMaker Processing step using FrameworkProcessor.run(...) with a custom requirements file name:
from sagemaker.core.processing import FrameworkProcessor
from sagemaker.core.workflow.pipeline_context import PipelineSession
from sagemaker.core.shapes import ProcessingInput, ProcessingOutput, ProcessingS3Input, ProcessingS3Output
role = "<SAGEMAKER_EXECUTION_ROLE_ARN>"
image_uri = "<PROCESSING_IMAGE_URI>"
bucket = "<S3_BUCKET>"
pipeline_session = PipelineSession()
framework_processor = FrameworkProcessor(
image_uri=image_uri,
role=role,
instance_count=1,
instance_type="ml.m5.large",
sagemaker_session=pipeline_session,
)
step_args = framework_processor.run(
code="run_entrypoint.py",
source_dir=".",
requirements="cpu-requirements.txt",
arguments=["scripts.entrypoints.step_preprocess_jobs"],
inputs=[
ProcessingInput(
input_name="dummy_input",
s3_input=ProcessingS3Input(
s3_uri=f"s3://{bucket}/dummy-input/",
local_path="/opt/ml/processing/input/dummy",
s3_data_type="S3Prefix",
s3_input_mode="File",
),
)
],
outputs=[
ProcessingOutput(
output_name="dummy_output",
s3_output=ProcessingS3Output(
s3_uri=f"s3://{bucket}/dummy-output/",
local_path="/opt/ml/processing/output",
s3_upload_mode="EndOfJob",
),
)
],
)
The generated processing runtime script only installs requirements.txt, not the file passed through the requirements argument.
The relevant generated script logic is:
if [[ -f 'requirements.txt' ]]; then
pip uninstall --yes typing
pip install -r requirements.txt
fi
As a result, dependencies listed in cpu-requirements.txt are not installed.
Expected behavior
When a user passes:
requirements="cpu-requirements.txt"
to FrameworkProcessor.run(...), the generated runtime script should install dependencies from that file:
pip install -r cpu-requirements.txt
The requirements argument should be respected.
For backward compatibility, if requirements is not provided, the implementation may continue checking for and installing from requirements.txt.
Current generated script behavior:
cd /opt/ml/processing/input/code/
if [ -f sourcedir.tar.gz ]; then
tar -xzf sourcedir.tar.gz
else
echo "ERROR: sourcedir.tar.gz not found!"
exit 1
fi
if [[ -f 'requirements.txt' ]]; then
pip uninstall --yes typing
pip install -r requirements.txt
fi
Expected generated script behavior when requirements="cpu-requirements.txt" is provided:
if [[ -f 'cpu-requirements.txt' ]]; then
pip uninstall --yes typing
pip install -r cpu-requirements.txt
else
echo "WARNING: requirements file not found: cpu-requirements.txt"
fi
System information
- SageMaker Python SDK version: PySDK V3 / sagemaker-core
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): FrameworkProcessor
- Framework version: N/A
- Python version: N/A
- CPU or GPU: CPU
- Custom Docker image (Y/N): Y
Additional context
The requirements parameter is accepted by FrameworkProcessor.run(...) and passed into internal methods such as _pack_and_upload_code(...) and _package_code(...), but it is not used when generating the runtime shell script.
The root cause appears to be that _generate_framework_script(...) hardcodes requirements.txt instead of using the requirements argument passed by the user.
I would like to work on this issue and submit a PR to fix the behavior.
PySDK Version
Describe the bug
FrameworkProcessor.run(requirements=...) in the SageMaker v3 sagemaker-core Processing API accepts a custom requirements file name, but the generated runtime script does not use it.
For example, this API call accepts:
framework_processor.run(
code="run_entrypoint.py",
source_dir=".",
requirements="cpu-requirements.txt",
)
However, the generated runproc.sh only checks for a hardcoded file named requirements.txt:
So cpu-requirements.txt may be packaged into sourcedir.tar.gz, but it is not installed at runtime.
Relevant file:
sagemaker-core/src/sagemaker/core/processing.py
Relevant method:
FrameworkProcessor._generate_framework_script()
Reference:
https://github.com/aws/sagemaker-python-sdk/blob/master/sagemaker-core/src/sagemaker/core/processing.py#L1187
To reproduce
Create the following project structure:
.
├── run_entrypoint.py
├── cpu-requirements.txt
└── scripts/
└── entrypoints/
└── step_preprocess_jobs.py
Example cpu-requirements.txt:
requests==2.32.3
Example run_entrypoint.py:
Example scripts/entrypoints/step_preprocess_jobs.py:
Then create a SageMaker Processing step using FrameworkProcessor.run(...) with a custom requirements file name:
The generated processing runtime script only installs requirements.txt, not the file passed through the requirements argument.
The relevant generated script logic is:
if [[ -f 'requirements.txt' ]]; then
pip uninstall --yes typing
pip install -r requirements.txt
fi
As a result, dependencies listed in cpu-requirements.txt are not installed.
Expected behavior
When a user passes:
requirements="cpu-requirements.txt"
to FrameworkProcessor.run(...), the generated runtime script should install dependencies from that file:
pip install -r cpu-requirements.txt
The requirements argument should be respected.
For backward compatibility, if requirements is not provided, the implementation may continue checking for and installing from requirements.txt.
Current generated script behavior:
Expected generated script behavior when requirements="cpu-requirements.txt" is provided:
System information
Additional context
The requirements parameter is accepted by FrameworkProcessor.run(...) and passed into internal methods such as _pack_and_upload_code(...) and _package_code(...), but it is not used when generating the runtime shell script.
The root cause appears to be that _generate_framework_script(...) hardcodes requirements.txt instead of using the requirements argument passed by the user.
I would like to work on this issue and submit a PR to fix the behavior.