You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: FAQ.md
+94-1Lines changed: 94 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,8 @@ We often see a lot of questions that surface repeatedly. This repository is an a
15
15
*[How to troubleshoot jobs that are failing with the exception or error?](#how-to-troubleshoot-jobs-that-are-failing-with-the-exception-or-error)
16
16
*[I see folders like Desktop, Documents, Downloads, Pictures in SageMaker Studio, is it fine?](#i-see-folders-like-desktop-documents-downloads-pictures-in-sagemaker-studio-is-it-fine)
17
17
*[I'm running SageMaker in a VPC. Do I need to make extra configuration?](#im-running-sagemaker-in-a-vpc-do-i-need-to-make-extra-configuration)
18
+
*[When debugging a training job with SageMaker SSH Helper and train_placeholder.py, I want to automatically stop the job when there are no users connected and there's no GPU utilization. How to do that?](#when-debugging-a-training-job-with-sagemaker-ssh-helper-and-training-placeholder-i-want-to-automatically-stop-the-job-when-there-are-no-users-connected-and-theres-no-gpu-utilization-how-to-do-that)
19
+
*[I want to send users the SMS or email notification when the placeholder training job has issues with low GPU utilization. How to do that?](#i-want-to-send-users-the-sms-or-email-notification-when-the-placeholder-training-job-has-issues-with-low-gpu-utilization-how-to-do-that)
18
20
*[API Questions](#api-questions)
19
21
*[I'm using boto3 Python SDK instead of SageMaker Python SDK, how can I use SageMaker SSH Helper?](#im-using-boto3-python-sdk-instead-of-sagemaker-python-sdk-how-can-i-use-sagemaker-ssh-helper)
20
22
*[How can I change the SSH authorized keys bucket and location when running sm-local-ssh-* commands?](#how-can-i-change-the-ssh-authorized-keys-bucket-and-location-when-running-sm-local-ssh--commands)
@@ -149,6 +151,97 @@ Yes, it's fine. They don't contain any of your local data. These are the freshly
149
151
### I'm running SageMaker in a VPC. Do I need to make extra configuration?
150
152
You might want (optionally) to configure [AWS PrivateLink for Session Manager endpoints](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html). But be aware that SageMaker SSH Helper needs Internet access to download and install extra packages inside SageMaker, such as AWS CLI and Sessions Manager Agent. To make it work, you will need a NAT gateway.
151
153
154
+
### When debugging a training job with SageMaker SSH Helper and training placeholder, I want to automatically stop the job when there are no users connected and there's no GPU utilization. How to do that?
155
+
156
+
To stop the job when no users connected, consider using `sagemaker_ssh_helper.is_last_session_timeout(timedelta)` method, as already described in the section [Remote code execution](README.md#remote-code-execution-with-pycharm-vscode-over-ssh). The method will count active SSM sessions, and time out when there are no sessions left.
157
+
158
+
To stop the job when there's no GPU utilization, consider using SageMaker Profiler and [LowGPUUtilization rule](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#low-gpu-utilization), in combination with `sagemaker_ssh_helper.is_profiler_issues_found()` method.
159
+
160
+
Your `train_placeholder.py` script then may look like as below. It will stop and the job will complete, if the container is left unattended and profiler found issues like low GPU utilization:
161
+
162
+
```python
163
+
import time
164
+
import os
165
+
from datetime import timedelta
166
+
167
+
from sagemaker_ssh_helper import setup_and_start_ssh, is_last_session_timeout, is_profiler_issues_found
168
+
169
+
setup_and_start_ssh()
170
+
171
+
os.environ["AWS_DEFAULT_REGION"] = os.environ.get("AWS_REGION", "") # for boto3
172
+
173
+
whileTrue:
174
+
time.sleep(10)
175
+
if is_last_session_timeout(timedelta(minutes=5)) and is_profiler_issues_found():
176
+
break
177
+
```
178
+
179
+
To execute this script with SageMaker Profiler, pass extra configuration parameters to the job:
180
+
181
+
```python
182
+
import os
183
+
from datetime import timedelta
184
+
185
+
from sagemaker.pytorch import PyTorch
186
+
from sagemaker.debugger import ProfilerRule, rule_configs, ProfilerConfig
187
+
188
+
from sagemaker_ssh_helper.wrapper import SSHEstimatorWrapper
189
+
190
+
191
+
profiler_config = ProfilerConfig(
192
+
system_monitor_interval_millis=100, # grab metrics 10 times per second
status = ssh_wrapper.wait_training_job_with_status()
223
+
```
224
+
225
+
*Tip:* To avoid unnecessary disruption of users' work, consider sending them notification emails instead of stopping their jobs automatically. See [the related notification question](FAQ.md#i-want-to-send-users-the-sms-or-email-notification-when-the-placeholder-training-job-has-issues-with-low-gpu-utilization-how-to-do-that).
226
+
227
+
228
+
### I want to send users the SMS or email notification when the placeholder training job has issues with low GPU utilization. How to do that?
229
+
230
+
First, you need to detect the low GPU utilization [with SageMaker Profiler](FAQ.md#when-debugging-a-training-job-with-sagemaker-ssh-helper-and-training-placeholder-i-want-to-automatically-stop-the-job-when-there-are-no-users-connected-and-theres-no-gpu-utilization-how-to-do-that).
231
+
232
+
When the SageMaker Profiler job, which is a SageMaker Processing job running the SageMaker Debugger container, finds issues with a Training job, it stops.
233
+
234
+
So the trick here is to listen for the [Amazon EventBridge events](https://docs.aws.amazon.com/sagemaker/latest/dg/automating-sagemaker-with-eventbridge.html) when a SageMaker Processing job changes the status. Then to check if it's the completed SageMaker Debugger job, and if yes, check the profiler issues of the related training job. In case there are issues found, send the notification to the [Amazon SNS](https://aws.amazon.com/sns/) topic.
235
+
236
+
See the [CDK code to deploy the Lambda and the EventBridge rule](https://github.com/aws-samples/sagemaker-ssh-helper/tree/main/sagemaker_ssh_helper/cdk/low_gpu) for more details.
237
+
238
+
To deploy the stack into your environment with CDK, run these commands:
0 commit comments