Skip to content

Commit 063b871

Browse files
FAQ update: detect low gpu utilization
1 parent 1a855d4 commit 063b871

File tree

1 file changed

+94
-1
lines changed

1 file changed

+94
-1
lines changed

FAQ.md

Lines changed: 94 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ We often see a lot of questions that surface repeatedly. This repository is an a
1515
* [How to troubleshoot jobs that are failing with the exception or error?](#how-to-troubleshoot-jobs-that-are-failing-with-the-exception-or-error)
1616
* [I see folders like Desktop, Documents, Downloads, Pictures in SageMaker Studio, is it fine?](#i-see-folders-like-desktop-documents-downloads-pictures-in-sagemaker-studio-is-it-fine)
1717
* [I'm running SageMaker in a VPC. Do I need to make extra configuration?](#im-running-sagemaker-in-a-vpc-do-i-need-to-make-extra-configuration)
18+
* [When debugging a training job with SageMaker SSH Helper and train_placeholder.py, I want to automatically stop the job when there are no users connected and there's no GPU utilization. How to do that?](#when-debugging-a-training-job-with-sagemaker-ssh-helper-and-training-placeholder-i-want-to-automatically-stop-the-job-when-there-are-no-users-connected-and-theres-no-gpu-utilization-how-to-do-that)
19+
* [I want to send users the SMS or email notification when the placeholder training job has issues with low GPU utilization. How to do that?](#i-want-to-send-users-the-sms-or-email-notification-when-the-placeholder-training-job-has-issues-with-low-gpu-utilization-how-to-do-that)
1820
* [API Questions](#api-questions)
1921
* [I'm using boto3 Python SDK instead of SageMaker Python SDK, how can I use SageMaker SSH Helper?](#im-using-boto3-python-sdk-instead-of-sagemaker-python-sdk-how-can-i-use-sagemaker-ssh-helper)
2022
* [How can I change the SSH authorized keys bucket and location when running sm-local-ssh-* commands?](#how-can-i-change-the-ssh-authorized-keys-bucket-and-location-when-running-sm-local-ssh--commands)
@@ -149,6 +151,97 @@ Yes, it's fine. They don't contain any of your local data. These are the freshly
149151
### I'm running SageMaker in a VPC. Do I need to make extra configuration?
150152
You might want (optionally) to configure [AWS PrivateLink for Session Manager endpoints](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html). But be aware that SageMaker SSH Helper needs Internet access to download and install extra packages inside SageMaker, such as AWS CLI and Sessions Manager Agent. To make it work, you will need a NAT gateway.
151153

154+
### When debugging a training job with SageMaker SSH Helper and training placeholder, I want to automatically stop the job when there are no users connected and there's no GPU utilization. How to do that?
155+
156+
To stop the job when no users connected, consider using `sagemaker_ssh_helper.is_last_session_timeout(timedelta)` method, as already described in the section [Remote code execution](README.md#remote-code-execution-with-pycharm-vscode-over-ssh). The method will count active SSM sessions, and time out when there are no sessions left.
157+
158+
To stop the job when there's no GPU utilization, consider using SageMaker Profiler and [LowGPUUtilization rule](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#low-gpu-utilization), in combination with `sagemaker_ssh_helper.is_profiler_issues_found()` method.
159+
160+
Your `train_placeholder.py` script then may look like as below. It will stop and the job will complete, if the container is left unattended and profiler found issues like low GPU utilization:
161+
162+
```python
163+
import time
164+
import os
165+
from datetime import timedelta
166+
167+
from sagemaker_ssh_helper import setup_and_start_ssh, is_last_session_timeout, is_profiler_issues_found
168+
169+
setup_and_start_ssh()
170+
171+
os.environ["AWS_DEFAULT_REGION"] = os.environ.get("AWS_REGION", "") # for boto3
172+
173+
while True:
174+
time.sleep(10)
175+
if is_last_session_timeout(timedelta(minutes=5)) and is_profiler_issues_found():
176+
break
177+
```
178+
179+
To execute this script with SageMaker Profiler, pass extra configuration parameters to the job:
180+
181+
```python
182+
import os
183+
from datetime import timedelta
184+
185+
from sagemaker.pytorch import PyTorch
186+
from sagemaker.debugger import ProfilerRule, rule_configs, ProfilerConfig
187+
188+
from sagemaker_ssh_helper.wrapper import SSHEstimatorWrapper
189+
190+
191+
profiler_config = ProfilerConfig(
192+
system_monitor_interval_millis=100, # grab metrics 10 times per second
193+
)
194+
rules = [
195+
ProfilerRule.sagemaker(rule_configs.LowGPUUtilization(
196+
scan_interval_us=60 * 1000 * 1000, # scan every minute
197+
patience=2, # skip the first 2 minutes
198+
threshold_p95=50, # GPU should be at least 50% utilized, 95% of the time
199+
threshold_p5=0, # skip detecting accidental drops
200+
window=1200, # take the last 1200 readings, i.e., the last 2 minutes
201+
)),
202+
]
203+
204+
estimator = PyTorch(
205+
entry_point='train_placeholder.py',
206+
source_dir='source_dir/training_placeholder/',
207+
dependencies=[SSHEstimatorWrapper.dependency_dir()],
208+
base_job_name='ssh-training-low-gpu',
209+
framework_version='1.9.1',
210+
py_version='py38',
211+
instance_count=1,
212+
instance_type='ml.g4dn.xlarge',
213+
max_run=int(timedelta(minutes=15).total_seconds()),
214+
profiler_config=profiler_config,
215+
rules=rules
216+
)
217+
218+
ssh_wrapper = SSHEstimatorWrapper.create(estimator, connection_wait_time_seconds=0)
219+
220+
estimator.fit(wait=False)
221+
222+
status = ssh_wrapper.wait_training_job_with_status()
223+
```
224+
225+
*Tip:* To avoid unnecessary disruption of users' work, consider sending them notification emails instead of stopping their jobs automatically. See [the related notification question](FAQ.md#i-want-to-send-users-the-sms-or-email-notification-when-the-placeholder-training-job-has-issues-with-low-gpu-utilization-how-to-do-that).
226+
227+
228+
### I want to send users the SMS or email notification when the placeholder training job has issues with low GPU utilization. How to do that?
229+
230+
First, you need to detect the low GPU utilization [with SageMaker Profiler](FAQ.md#when-debugging-a-training-job-with-sagemaker-ssh-helper-and-training-placeholder-i-want-to-automatically-stop-the-job-when-there-are-no-users-connected-and-theres-no-gpu-utilization-how-to-do-that).
231+
232+
When the SageMaker Profiler job, which is a SageMaker Processing job running the SageMaker Debugger container, finds issues with a Training job, it stops.
233+
234+
So the trick here is to listen for the [Amazon EventBridge events](https://docs.aws.amazon.com/sagemaker/latest/dg/automating-sagemaker-with-eventbridge.html) when a SageMaker Processing job changes the status. Then to check if it's the completed SageMaker Debugger job, and if yes, check the profiler issues of the related training job. In case there are issues found, send the notification to the [Amazon SNS](https://aws.amazon.com/sns/) topic.
235+
236+
See the [CDK code to deploy the Lambda and the EventBridge rule](https://github.com/aws-samples/sagemaker-ssh-helper/tree/main/sagemaker_ssh_helper/cdk/low_gpu) for more details.
237+
238+
To deploy the stack into your environment with CDK, run these commands:
239+
240+
```bash
241+
APP="python -m sagemaker_ssh_helper.cdk.low_gpu_lambda_app"
242+
AWS_REGION=$REGION cdk -a "$APP" deploy Low-GPU-Lambda-Stack \
243+
-c sns_notification_topic_arn="$SNS_NOTIFICATION_TOPIC_ARN"
244+
```
152245

153246
## API Questions
154247

@@ -473,4 +566,4 @@ or AWS config file:
473566
474567
```text
475568
region eu-west-1 config-file ~/.aws/config
476-
```
569+
```

0 commit comments

Comments
 (0)