Deploying the Llama-2-Chat Model on Ray Serve
With both node pools provisioned, we can now proceed to deploy the Llama2 chatbot infrastructure.
Let's begin by deploying the ray-service-llama2.yaml
file:
namespace/llama2 created
rayservice.ray.io/llama2 created
Creating the Ray Service Pods for Inference
The ray-service-llama2.yaml
file defines the Kubernetes configuration for deploying the Ray Serve service for the Llama2 chatbot:
apiVersion: v1
kind: Namespace
metadata:
name: llama2
---
# target_num_ongoing_requests_per_replica will be deprecated soon
# and will be updated when Data on EKS updates
apiVersion: ray.io/v1
kind: RayService
metadata:
name: llama2
namespace: llama2
spec:
serviceUnhealthySecondThreshold: 900
deploymentUnhealthySecondThreshold: 300
serveConfigV2: |
applications:
- name: llama2
import_path: "ray_serve_llama2:entrypoint"
runtime_env:
env_vars:
MODEL_ID: "NousResearch/Llama-2-13b-chat-hf"
NEURON_CC_FLAGS: "-O1"
LD_LIBRARY_PATH: "/home/ray/anaconda3/lib:$LD_LIBRARY_PATH"
NEURON_CORES: "24"
deployments:
- name: Llama-2-13b-chat-hf
autoscaling_config:
metrics_interval_s: 0.2
min_replicas: 1
max_replicas: 1
look_back_period_s: 2
downscale_delay_s: 30
upscale_delay_s: 2
target_num_ongoing_requests_per_replica: 1
graceful_shutdown_timeout_s: 5
ray_actor_options:
num_cpus: 180
resources: {"neuron_cores": 24}
runtime_env:
env_vars:
LD_LIBRARY_PATH: "/home/ray/anaconda3/lib:$LD_LIBRARY_PATH"
rayClusterConfig:
rayVersion: 2.22.0
headGroupSpec:
headService:
metadata:
name: llama2
namespace: llama2
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
containers:
- name: head
image: public.ecr.aws/data-on-eks/ray2.22.0-py310-llama2-13b-neuron:latest # Image created using the Dockerfile attached in the folder
imagePullPolicy: Always # Ensure the image is always pulled when updated
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 1
memory: 2Gi
env:
- name: LD_LIBRARY_PATH
value: "/home/ray/anaconda3/lib:$LD_LIBRARY_PATH"
nodeSelector:
instanceType: mixed-x86
provisionerType: Karpenter
workload: rayhead
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
- groupName: inf2
replicas: 1
minReplicas: 1
maxReplicas: 1
rayStartParams: {}
template:
spec:
containers:
- name: worker
image: public.ecr.aws/data-on-eks/ray2.22.0-py310-llama2-13b-neuron:latest
imagePullPolicy: Always # Ensure the image is always pulled when updated
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
resources:
limits:
cpu: "180"
memory: "700G"
aws.amazon.com/neuron: "12"
requests:
cpu: "180"
memory: "700G"
aws.amazon.com/neuron: "12"
env:
- name: LD_LIBRARY_PATH
value: "/home/ray/anaconda3/lib:$LD_LIBRARY_PATH"
nodeSelector:
instanceType: inferentia-inf2
provisionerType: Karpenter
tolerations:
- key: "aws.amazon.com/neuron"
operator: "Exists"
effect: "NoSchedule"
- key: "hub.jupyter.org/dedicated"
operator: "Equal"
value: "user"
effect: "NoSchedule"
This configuration accomplishes the following:
- Creates a Kubernetes namespace named
llama2
for resource isolation - Deploys a RayService named
llama-2-service
that utilizes a Python script to create the Ray Serve component - Provisions a Head Pod and Worker Pods to pull Docker images from Amazon Elastic Container Registry (ECR)
After applying the configurations, we'll monitor the progress of the head and worker pods:
NAME READY STATUS RESTARTS AGE
pod/llama2-raycluster-fcmtr-head-bf58d 1/1 Running 0 67m
pod/llama2-raycluster-fcmtr-worker-inf2-lgnb2 1/1 Running 0 5m30s
It may take up to 15 minutes for both pods to be ready.
We can wait for the pods to be ready using the following command:
pod/llama2-raycluster-fcmtr-head-bf58d met
pod/llama2-raycluster-fcmtr-worker-inf2-lgnb2 met
Once the pods are fully deployed, we'll verify that everything is in place:
NAME READY STATUS RESTARTS AGE
pod/llama2-raycluster-fcmtr-head-bf58d 1/1 Running 0 67m
pod/llama2-raycluster-fcmtr-worker-inf2-lgnb2 1/1 Running 0 5m30s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/llama2 ClusterIP 172.20.118.243 <none> 10001/TCP,8000/TCP,8080/TCP,6379/TCP,8265/TCP 67m
service/llama2-head-svc ClusterIP 172.20.168.94 <none> 8080/TCP,6379/TCP,8265/TCP,10001/TCP,8000/TCP 57m
service/llama2-serve-svc ClusterIP 172.20.61.167 <none> 8000/TCP 57m
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
raycluster.ray.io/llama2-raycluster-fcmtr 1 1 184 704565270Ki 0 ready 67m
NAME SERVICE STATUS NUM SERVE ENDPOINTS
rayservice.ray.io/llama2 Running 2
Configuring RayService may take up to 10 minutes.
We can wait for the RayService to be running with this command:
rayservice.ray.io/llama2 condition met
With everything properly deployed, we can now proceed to create the web interface for the chatbot.