Skip to content

Commit 6752eb1

Browse files
committed
Merge branch 'main' into geniac
2 parents 0d327b3 + d5b7dcd commit 6752eb1

File tree

69 files changed

+6212
-394
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+6212
-394
lines changed

0.docs/fsx-lustre-template.png

-47.1 KB
Loading

1.architectures/3.aws-batch/0.aws-batch-distributed-training-p5.yaml

+699
Large diffs are not rendered by default.

1.architectures/3.aws-batch/README.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,11 @@ The templates takes parameters that are mandatory and optional, see below for mo
4242
If you'd like to deploy through the AWS CLI instead of the quick create link above, the command to deploy the template is shown below. Please edit the parameters values with your own configuration.
4343

4444
```bash
45-
aws cloudformation create-stack --stack-name batch-distributed-training \
46-
--template-body file://0.aws-batch-distributed-training.yaml \
47-
--parameters ParameterKey=VPCStackParameter,ParameterValue="vpc-stack-ml" \
48-
ParameterKey=CapacityReservationId,ParameterValue="cr-123567890abc" \
49-
--capabilities CAPABILITY_IAM
45+
aws cloudformation create-stack --stack-name aws-batch-p5 \
46+
--template-body file://0.aws-batch-distributed-training-p5.yaml \
47+
--parameters ParameterKey=VPCStackParameter,ParameterValue="aws-batch-vpc" \
48+
ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \
49+
--capabilities CAPABILITY_NAMED_IAM
5050
```
5151

5252
## Gotchas

1.architectures/4.amazon-eks/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ kubectl get nodes
171171
5. Apply [K8 Nvidia CNI Plugin](https://github.com/NVIDIA/k8s-device-plugin):
172172

173173
```bash
174-
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml
174+
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
175175
```
176176

177177
6. If using EFA, make sure to install the [EFA CNI Plugin](https://docs.aws.amazon.com/eks/latest/userguide/node-efa.html).

1.architectures/4.amazon-eks/amazon-eks-nodegroup.yaml

+13-8
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
AWSTemplateFormatVersion: "2010-09-09"
22

3-
Description: Amazon EKS - Create an unmanaged P4d/P5 node group for Capacity Blocks for ML.
3+
Description: Amazon EKS - Create an unmanaged P4d/P5/P5e node group for Capacity Blocks for ML.
44

55
Metadata:
66
"AWS::CloudFormation::Interface":
@@ -75,7 +75,7 @@ Parameters:
7575

7676
NodeImageIdSSMParam:
7777
Type: "AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>"
78-
Default: /aws/service/eks/optimized-ami/1.29/amazon-linux-2-gpu/recommended/image_id
78+
Default: /aws/service/eks/optimized-ami/1.31/amazon-linux-2-gpu/recommended/image_id
7979
Description: AWS Systems Manager Parameter Store parameter of the AMI ID for the worker node instances. Change this value to match the version of Kubernetes you are using.
8080

8181
DisableIMDSv1:
@@ -89,6 +89,7 @@ Parameters:
8989
Type: String
9090
Default: p5.48xlarge
9191
AllowedValues:
92+
- p5e.48xlarge
9293
- p5.48xlarge
9394
- p4d.24xlarge
9495
Description: EC2 instance type for the node instances
@@ -134,12 +135,16 @@ Conditions:
134135
- "Fn::Equals":
135136
- !Ref NodeImageId
136137
- ""
137-
isP5: !Equals
138-
- !Ref NodeInstanceType
139-
- "p5.48xlarge"
140138
isP4d: !Equals
141139
- !Ref NodeInstanceType
142140
- "p4d.24xlarge"
141+
isP5Family: !Or
142+
- !Equals
143+
- !Ref NodeInstanceType
144+
- "p5.48xlarge"
145+
- !Equals
146+
- !Ref NodeInstanceType
147+
- "p5e.48xlarge"
143148

144149
IMDSv1Disabled:
145150
"Fn::Equals":
@@ -280,7 +285,7 @@ Resources:
280285

281286
NodeLaunchTemplateP5:
282287
Type: "AWS::EC2::LaunchTemplate"
283-
Condition: isP5
288+
Condition: isP5Family
284289
Properties:
285290
LaunchTemplateData:
286291
InstanceMarketOptions:
@@ -659,11 +664,11 @@ Resources:
659664
DesiredCapacity: !Ref NodeAutoScalingGroupDesiredCapacity
660665
LaunchTemplate:
661666
LaunchTemplateId: !If
662-
- isP5
667+
- isP5Family
663668
- !Ref NodeLaunchTemplateP5
664669
- !Ref NodeLaunchTemplateP4
665670
Version: !If
666-
- isP5
671+
- isP5Family
667672
- !GetAtt NodeLaunchTemplateP5.LatestVersionNumber
668673
- !GetAtt NodeLaunchTemplateP4.LatestVersionNumber
669674
MaxSize: !Ref NodeAutoScalingGroupMaxSize

1.architectures/5.sagemaker-hyperpod/3.FSxLustre.yaml

+31-29
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
AWSTemplateFormatVersion: '2010-09-09'
2-
Description: Creates an FSxL filesystem of PERSISENT_2 type plus the Security Group needed for use with SageMaker
2+
Description: This template deploys an FSx for Luster File System
33

4-
### Stack metadata
54
Metadata:
65
AWS::CloudFormation::Interface:
76
ParameterGroups:
@@ -13,15 +12,18 @@ Metadata:
1312
- Compression
1413
- LustreVersion
1514

15+
- Label:
16+
default: Networking Options
17+
Parameters:
18+
- SecurityGroup
19+
- Subnet
20+
1621
Parameters:
17-
NetworkStack:
18-
Description: Name of the Networking stack
19-
Type: String
20-
Default: SageMakerVPC
2122
Capacity:
2223
Description: Storage capacity in GiB (1200 or increments of 2400)
2324
Type: Number
2425
Default: 1200
26+
2527
PerUnitStorageThroughput:
2628
Description: Provisioned Read/Write (MB/s/TiB)
2729
Type: Number
@@ -31,13 +33,15 @@ Parameters:
3133
- 250
3234
- 500
3335
- 1000
36+
3437
Compression:
3538
Description: Data compression type
3639
Type: String
3740
AllowedValues:
3841
- "LZ4"
3942
- "NONE"
4043
Default: "LZ4"
44+
4145
LustreVersion:
4246
Description: Lustre software version
4347
Type: String
@@ -46,27 +50,18 @@ Parameters:
4650
- "2.12"
4751
Default: "2.15"
4852

49-
Resources:
53+
SecurityGroup:
54+
Description: Security group ID
55+
Type: String
56+
Default: ""
57+
58+
Subnet:
59+
Description: Subnet ID
60+
Type: String
61+
Default: ""
5062

51-
LambdaExecutionRole:
52-
Type: "AWS::IAM::Role"
53-
Properties:
54-
AssumeRolePolicyDocument:
55-
Version: 2012-10-17
56-
Statement:
57-
- Effect: Allow
58-
Principal:
59-
Service:
60-
- lambda.amazonaws.com
61-
Action:
62-
- "sts:AssumeRole"
63-
Path: /
64-
ManagedPolicyArns:
65-
- 'arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess'
66-
- 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
67-
- 'arn:aws:iam::aws:policy/IAMFullAccess'
68-
- 'arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
6963

64+
Resources:
7065
FSxLFilesystem:
7166
Type: AWS::FSx::FileSystem
7267
DeletionPolicy: Delete
@@ -77,22 +72,29 @@ Resources:
7772
FileSystemTypeVersion: !Ref LustreVersion
7873
StorageCapacity: !Ref Capacity
7974
SecurityGroupIds:
80-
- Fn::ImportValue:
81-
!Sub "${NetworkStack}-SecurityGroup"
75+
- !Ref SecurityGroup
8276
SubnetIds:
83-
- Fn::ImportValue:
84-
!Sub "${NetworkStack}-PrivateSubnet"
77+
- !Ref Subnet
8578
LustreConfiguration:
8679
DataCompressionType: !Ref Compression
8780
DeploymentType: PERSISTENT_2
8881
PerUnitStorageThroughput: !Ref PerUnitStorageThroughput
82+
MetadataConfiguration:
83+
Mode: AUTOMATIC
8984

9085
Outputs:
9186
FSxLustreFilesystemMountname:
9287
Description: The ID of the FSxL filesystem that has been created
9388
Value: !GetAtt FSxLFilesystem.LustreMountName
9489
Export:
9590
Name: !Sub ${AWS::StackName}-FSxLustreFilesystemMountname
91+
92+
FSxLustreFilesystemDNSname:
93+
Description: The DNS of the FSxL filesystem that has been created
94+
Value: !GetAtt FSxLFilesystem.DNSName
95+
Export:
96+
Name: !Sub ${AWS::StackName}-FSxLustreFilesystemDNSname
97+
9698
FSxLustreFilesystemId:
9799
Description: The ID of the FSxL filesystem that has been created
98100
Value: !Ref FSxLFilesystem

1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py

+11-3
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# Basic configuration parameters
33
class Config:
44

5-
# Set true if you want to install Docker/Enroot/Pyxis.
5+
# Default is true to install Docker/Enroot/Pyxis.
66
enable_docker_enroot_pyxis = True
77

88
# Set true if you want to install metric exporter software and Prometheus for observability
@@ -24,8 +24,16 @@ class Config:
2424
# You need to configure parameters in SssdConfig as well.
2525
enable_sssd = False
2626

27-
# Set true to install quality-of-live improvements
28-
enable_initsmhp = False
27+
# Set true if you want to use mountpoint for s3 on cluster nodes.
28+
# If enabled, a systemctl mount-s3.service file will be writen that will mount at /mnt/<BucketName>.
29+
# requires s3 permissions to be added to cluster execution role.
30+
enable_mount_s3 = False
31+
32+
s3_bucket = "" # required when enable_mount_s3 = True, replace with your actual data bucket name in quotes, ie. "my-dataset-bucket"
33+
34+
if enable_mount_s3 and not s3_bucket:
35+
raise ValueError("Error: A bucket name must be specified when enable_mount_s3 is True")
36+
2937

3038
# Configuration parameters for ActiveDirectory/LDAP/SSSD
3139
class SssdConfig:

1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/initsmhp.sh

-32
This file was deleted.

1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/initsmhp/fix-profile.sh

-19
This file was deleted.

1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/initsmhp/gen-keypair-ubuntu.sh

-21
This file was deleted.

1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/initsmhp/howto-miniconda.sh

-30
This file was deleted.

1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/initsmhp/install-git-remote-codecommit.sh

-6
This file was deleted.

1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/initsmhp/install-mount-s3.sh

-9
This file was deleted.

1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/initsmhp/install-pkgs.sh

-21
This file was deleted.

0 commit comments

Comments
 (0)