Skip to main content

Deploying the Llama-2-Chat Model on Ray Serve

With both node pools provisioned, we can now proceed to deploy the Llama2 chatbot infrastructure.

Let's begin by deploying the ray-service-llama2.yaml file:

~$kubectl apply -k ~/environment/eks-workshop/modules/aiml/chatbot/ray-service-llama2-chatbot
namespace/llama2 created
rayservice.ray.io/llama2 created

Creating the Ray Service Pods for Inference

The ray-service-llama2.yaml file defines the Kubernetes configuration for deploying the Ray Serve service for the Llama2 chatbot:

~/environment/eks-workshop/modules/aiml/chatbot/ray-service-llama2-chatbot/ray-service-llama2.yaml
apiVersion: v1
kind: Namespace
metadata:
name: llama2

---
# target_num_ongoing_requests_per_replica will be deprecated soon
# and will be updated when Data on EKS updates
apiVersion: ray.io/v1
kind: RayService
metadata:
name: llama2
namespace: llama2
spec:
serviceUnhealthySecondThreshold: 900
deploymentUnhealthySecondThreshold: 300
serveConfigV2: |
applications:
- name: llama2
import_path: "ray_serve_llama2:entrypoint"
runtime_env:
env_vars:
MODEL_ID: "NousResearch/Llama-2-13b-chat-hf"
NEURON_CC_FLAGS: "-O1"
LD_LIBRARY_PATH: "/home/ray/anaconda3/lib:$LD_LIBRARY_PATH"
NEURON_CORES: "24"
deployments:
- name: Llama-2-13b-chat-hf
autoscaling_config:
metrics_interval_s: 0.2
min_replicas: 1
max_replicas: 1
look_back_period_s: 2
downscale_delay_s: 30
upscale_delay_s: 2
target_num_ongoing_requests_per_replica: 1
graceful_shutdown_timeout_s: 5
ray_actor_options:
num_cpus: 180
resources: {"neuron_cores": 24}
runtime_env:
env_vars:
LD_LIBRARY_PATH: "/home/ray/anaconda3/lib:$LD_LIBRARY_PATH"
rayClusterConfig:
rayVersion: 2.22.0
headGroupSpec:
headService:
metadata:
name: llama2
namespace: llama2
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
containers:
- name: head
image: public.ecr.aws/data-on-eks/ray2.22.0-py310-llama2-13b-neuron:latest # Image created using the Dockerfile attached in the folder
imagePullPolicy: Always # Ensure the image is always pulled when updated
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 1
memory: 2Gi
env:
- name: LD_LIBRARY_PATH
value: "/home/ray/anaconda3/lib:$LD_LIBRARY_PATH"
nodeSelector:
instanceType: mixed-x86
provisionerType: Karpenter
workload: rayhead
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
- groupName: inf2
replicas: 1
minReplicas: 1
maxReplicas: 1
rayStartParams: {}
template:
spec:
containers:
- name: worker
image: public.ecr.aws/data-on-eks/ray2.22.0-py310-llama2-13b-neuron:latest
imagePullPolicy: Always # Ensure the image is always pulled when updated
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
resources:
limits:
cpu: "180"
memory: "700G"
aws.amazon.com/neuron: "12"
requests:
cpu: "180"
memory: "700G"
aws.amazon.com/neuron: "12"
env:
- name: LD_LIBRARY_PATH
value: "/home/ray/anaconda3/lib:$LD_LIBRARY_PATH"
nodeSelector:
instanceType: inferentia-inf2
provisionerType: Karpenter
tolerations:
- key: "aws.amazon.com/neuron"
operator: "Exists"
effect: "NoSchedule"
- key: "hub.jupyter.org/dedicated"
operator: "Equal"
value: "user"
effect: "NoSchedule"

This configuration accomplishes the following:

  1. Creates a Kubernetes namespace named llama2 for resource isolation
  2. Deploys a RayService named llama-2-service that utilizes a Python script to create the Ray Serve component
  3. Provisions a Head Pod and Worker Pods to pull Docker images from Amazon Elastic Container Registry (ECR)

After applying the configurations, we'll monitor the progress of the head and worker pods:

~$kubectl get pod -n llama2
NAME                                            READY   STATUS    RESTARTS   AGE
pod/llama2-raycluster-fcmtr-head-bf58d          1/1     Running   0          67m
pod/llama2-raycluster-fcmtr-worker-inf2-lgnb2   1/1     Running   0          5m30s
caution

It may take up to 15 minutes for both pods to be ready.

We can wait for the pods to be ready using the following command:

~$kubectl wait pod \
--all \
--for=condition=Ready \
--namespace=llama2 \
--timeout=15m
pod/llama2-raycluster-fcmtr-head-bf58d met
pod/llama2-raycluster-fcmtr-worker-inf2-lgnb2 met

Once the pods are fully deployed, we'll verify that everything is in place:

~$kubectl get all -n llama2
NAME                                            READY   STATUS    RESTARTS   AGE
pod/llama2-raycluster-fcmtr-head-bf58d          1/1     Running   0          67m
pod/llama2-raycluster-fcmtr-worker-inf2-lgnb2   1/1     Running   0          5m30s
 
NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                         AGE
service/llama2             ClusterIP   172.20.118.243   <none>        10001/TCP,8000/TCP,8080/TCP,6379/TCP,8265/TCP   67m
service/llama2-head-svc    ClusterIP   172.20.168.94    <none>        8080/TCP,6379/TCP,8265/TCP,10001/TCP,8000/TCP   57m
service/llama2-serve-svc   ClusterIP   172.20.61.167    <none>        8000/TCP                                        57m
 
NAME                                        DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY        GPUS   STATUS   AGE
raycluster.ray.io/llama2-raycluster-fcmtr   1                 1                   184    704565270Ki   0      ready    67m
 
NAME                       SERVICE STATUS   NUM SERVE ENDPOINTS
rayservice.ray.io/llama2   Running          2
caution

Configuring RayService may take up to 10 minutes.

We can wait for the RayService to be running with this command:

~$kubectl wait --for=jsonpath='{.status.serviceStatus}'=Running rayservice/llama2 -n llama2 --timeout=10m
rayservice.ray.io/llama2 condition met

With everything properly deployed, we can now proceed to create the web interface for the chatbot.