Multiple Schedulers
Concept and Usage of Multiple Schedulers
Reference (opens in a new tab)
Multiple schedulers are used to schedule pods on different nodes based on the requirements. By default, Kubernetes uses the default scheduler to schedule pods on nodes using an algorithm to distribute the pods across the nodes evenly. But in some cases, you may want to setup your own scheduling algorithm or any custom conditions to place pods on nodes.
Therefore, Kubernetes allows you to write and deploy your own scheduler as default scheduler or as an additional scheduler. In this case, you can use your own custom scheduler to schedule some specific pods (applications) on specific nodes based on your requirements, while other pods can still be scheduled by the default scheduler.
This is the default scheduler, the scheduler name is default-scheduler
and it must be unique in the cluster. You can find your default scheduler configuration on the master node at /etc/kubernetes/manifests/kube-scheduler.yaml
.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler # unique name
Steps to setup and use Multiple Schedulers
Step 1: Create a new scheduler configuration file
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: my-new-scheduler
leaderElection:
leaderElect: true
resourceNamespace: kube-system
resourceName: lock-object-my-scheduler
leaderElect
- ensures that only one instance of the scheduler is active at a time. Assume that there are multiple instances of the scheduler running on different master nodes as a high-availability setup, only one instance will be elected (selected) as a leader to schedule the pods.resourceName
- assume you have multiple masters, you will need to specify the name of the resource object used for leader election. This is used to ensure that only one instance of the scheduler is active at a time and to avoid conflicts between multiple instances of the scheduler.
Step 2: Deploy Additional Scheduler
You may choose to deploy the scheduler as a Pod or Deployment.
Deploy as a Pod
apiVersion: v1
kind: Pod
metadata:
name: my-new-scheduler
namespace: kube-system
spec:
containers:
- name: my-new-scheduler
image: k8s.gcr.io/kube-scheduler:v1.22.0
command:
- kube-scheduler
# make sure these files are exists on the host
- --kubeconfig=/etc/kubernetes/scheduler.conf # this file has the authentication information to access the API server
- --config=/etc/kubernetes.my-new-scheduler.yaml
volumeMounts:
- name: kubeconfig
mountPath: /etc/kubernetes
readOnly: true
volumes:
- name: kubeconfig
hostPath:
path: /etc/kubernetes
kubectl apply -f my-new-scheduler.yaml
Deploy as a Deployment
Package the Scheduler
git clone https://github.com/kubernetes/kubernetes.git
cd kubernetes
make
Create a new container image containing the kube-scheduler binary
FROM busybox
ADD ./_output/local/bin/linux/amd64/kube-scheduler /usr/local/bin/kube-scheduler
Build the dockerfile
docker build -t gcr.io/my-gcp-project/my-kube-scheduler:1.0 . # The image name and the repository
gcloud docker -- push gcr.io/my-gcp-project/my-kube-scheduler:1.0 # used in here is just an example
Define a Kubernetes Deployment for the scheduler
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-scheduler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: my-scheduler-as-kube-scheduler
subjects:
- kind: ServiceAccount
name: my-scheduler
namespace: kube-system
roleRef:
kind: ClusterRole
name: system:kube-scheduler
apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: my-scheduler-as-volume-scheduler
subjects:
- kind: ServiceAccount
name: my-scheduler
namespace: kube-system
roleRef:
kind: ClusterRole
name: system:volume-scheduler
apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: my-scheduler-extension-apiserver-authentication-reader
namespace: kube-system
roleRef:
kind: Role
name: extension-apiserver-authentication-reader
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: my-scheduler
namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
name: my-scheduler-config
namespace: kube-system
data:
my-scheduler-config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: my-scheduler
leaderElection:
leaderElect: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
component: scheduler
tier: control-plane
name: my-scheduler
namespace: kube-system
spec:
selector:
matchLabels:
component: scheduler
tier: control-plane
replicas: 1
template:
metadata:
labels:
component: scheduler
tier: control-plane
version: second
spec:
serviceAccountName: my-scheduler
containers:
- command:
- /usr/local/bin/kube-scheduler
- --config=/etc/kubernetes/my-scheduler/my-scheduler-config.yaml
image: gcr.io/my-gcp-project/my-kube-scheduler:1.0
livenessProbe:
httpGet:
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 15
name: kube-second-scheduler
readinessProbe:
httpGet:
path: /healthz
port: 10259
scheme: HTTPS
resources:
requests:
cpu: '0.1'
securityContext:
privileged: false
volumeMounts:
- name: config-volume
mountPath: /etc/kubernetes/my-scheduler
hostNetwork: false
hostPID: false
volumes:
- name: config-volume
configMap:
name: my-scheduler-config
kubectl apply -f my-new-scheduler.yaml
Step 3: Verify the new scheduler is running
kubectl get pods -n kube-system
# output
NAME READY STATUS RESTARTS AGE
....
my-scheduler-lnf4s-4744f 1/1 Running 0 2m
...
Step 4: Create a Pod with the new scheduler
apiVersion: v1
kind: Pod
metadata:
name: sample-pod
spec:
containers:
- name: sample-pod
image: ubuntu
schedulerName: my-new-scheduler # the name of the scheduler
When you create a pod with the schedulerName
field, the pod will be scheduled by the specified scheduler. You can see the pod assignment events by running the following command:
# method 1 ---> view the events of the pod
kubectl get events -o wide
# output
LAST SEEN TYPE REASON OBJECT SUBOBJECT SOURCE MESSAGE FIRST SEEN COUNT NAME
10s Normal Scheduled pod/ubuntu custom-scheduler, custom-scheduler-kind-cluster-control-plane Successfully assigned default/ubuntu to kind-cluster-control-plane 10s 1 ubuntu.18159790928c8e97
10s Normal Pulling pod/ubuntu spec.containers{ubuntu} kubelet, kind-cluster-control-plane Pulling image "ubuntu" 10s 1 ubuntu.18159790b760375c
# method 2 ---> view the logs of the scheduler
kubectl logs <your-new-scheduler-pod-name> -n kube-system
kubectl logs my-new-scheduler -n kube-system
# output
I1229 05:00:45.515663 1 serving.go:380] Generated self-signed cert in-memory
I1229 05:00:47.196685 1 server.go:154] "Starting Kubernetes Scheduler" version="v1.30.0"
I1229 05:00:47.196790 1 server.go:156] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I1229 05:00:47.208976 1 secure_serving.go:213] Serving securely on 127.0.0.1:10259
I1229 05:00:47.209676 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I1229 05:00:47.212183 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I1229 05:00:47.212327 1 shared_informer.go:313] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I1229 05:00:47.212336 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I1229 05:00:47.212350 1 shared_informer.go:313] Waiting for caches to sync for RequestHeaderAuthRequestController
I1229 05:00:47.212578 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I1229 05:00:47.212627 1 shared_informer.go:313] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I1229 05:00:47.311593 1 leaderelection.go:250] attempting to acquire leader lease kube-system/my-new-scheduler...
I1229 05:00:47.312578 1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I1229 05:00:47.312641 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I1229 05:00:47.312773 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I1229 05:01:05.150116 1 leaderelection.go:260] successfully acquired lease kube-system/my-new-scheduler
Scheduler Priority and Plugins
We know that pods are sorted based on the priority defined on the pods, you can read more from this page kube-scheduler.
To set a priority for a pod, you need to create a priority class first and apply it to the pod. Reference (opens in a new tab)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000 # the higher the value, the higher the priority
globalDefault: false
description: "This priority class should be used for XYZ service pods only."
apiVersion: v1
kind: Pod
metadata:
name: sample-pod
spec:
priorityClassName: high-priority
containers:
- name: sample
image: ubuntu
priorityClassName
- It determines the scheduling priority of the Pod. Pods with higher priority are scheduled before Pods with lower priority.
Steps of how a pod is scheduled:
- The pods with higher priority are scheduled first by performing sorting based on values (beginning of the queue).
- The pod will enter the filtering phase where the scheduler will check the nodes based on the node selector and affinity rules. Also, the scheduler will check the resources (CPU, memory) of the nodes to ensure that the pods can be scheduled on the nodes.
- Then the node will enter the scoring phase where the scheduler will score the nodes based on the resources. The node with the highest score will be selected to schedule the pod.
- Binding the pod to the node with the highest score.
Scheduler Plugins
Reference (opens in a new tab)
Actually, every steps that I mentioned got its own plugins, for example
-
Scheduling Queue plugins
- PrioritySort - Sort the pods based on the priority of the pods.
-
Filtering
- NodeResourcesFit - Identify the nodes that have enough resources to run the pod.
- NodeName - Check the pod has a specific node name mentioned in the pod spec.
- NodeUnschedulable - Check the node is unschedulable or not. You can use this command to check the node unschedulable status
kubectl describe node <node-name>
-
Scoring
- NodeResourcesFit - Score the nodes based on the resources. Remember a single plugin can be used in multiple phases.
- ImageLocality - Score the nodes based on the container image that the pod runs. Meaning, it will select the node that has the container image cached. What if there is no nodes available? It will still place the pod on a node that doesn't have the container image cached.
-
Binding
- DefaultBinder - Bind the pod to the node.
Actually, you can write your own plugins to extend the scheduler functionalities. We call it as extension points. For example, you can write a plugin to check the node health on the filtering phase. Reference (opens in a new tab)
Scheduling Profiles
Reference (opens in a new tab)
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: my-new-scheduler-1
- schedulerName: my-new-scheduler-2
- schedulerName: my-new-scheduler-3
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: test-scheduler
Let's said we deploy separate schedulers each with a separate scheduler binary and configuration** file**. It's a lot of work required to manage these separate processes and due to separate processes, the other scheduler may schedule a pod on a node without considering the other scheduler's decision (race condition).
So, we can use scheduling profiles to configure multiple schedulers in a single process.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
- schedulerName: my-new-scheduler-1
plugins:
score:
disabled:
- name: TaintToleration
enabled:
- name: CustomPlugin1
- name: CustomPlugin2
- name: CustomPlugin3
- schedulerName: no-scoring-scheduler
plugins:
preScore:
disabled:
- name: '*'
score:
disabled:
- name: '*'