Kubernetes Pod Preemption Architecture and Logic Flow
Pod priority enables the scheduler to order pods within the queue and facilitates preempsion when cluster resources are constrained. High-priority pods can displace lower-priority workloads to ensure critical services remain operational. This analysis examines the internal mechanisms governing preemptive scheduling within the Kubernetes scheduler.
Priority Configuration
Preemption relies on the PriorityClass resource to define relative importance. Pods reference these classes to inherit a numeric priority value.
PriorityClass Definition
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-tier
value: 1000000
globalDefault: false
description: "Reserved for essential infrastructure services."
Pod Specification
apiVersion: v1
kind: Pod
metadata:
name: app-core
labels:
component: backend
spec:
containers:
- name: server
image: core-service:v1
imagePullPolicy: IfNotPresent
priorityClassName: critical-tier
Preemption Trigger Point
The scheduling cycle initiates in scheduleOne. When standard predicate checking fails to find a suitable node, the scheduler evaluates preemption possibilities.
// pkg/scheduler/scheduler.go
func (sched *Scheduler) scheduleOne() {
// ... retrieval logic ...
suggestedHost, err := sched.schedule(pod)
if err != nil {
if fitError, ok := err.(*core.FitError); ok {
// Attempt preemption if standard scheduling fails
sched.preempt(pod, fitError)
metrics.PreemptionAttempts.Inc()
} else {
klog.Errorf("error selecting node for pod: %v", err)
}
return
}
// ... binding logic ...
}
If sched.schedule returns an error indicating no nodes fit, sched.preempt is invoked. This function orchestrates the victim selection and nomination process.
func (sched *Scheduler) preempt(preemptor *v1.Pod, scheduleErr error) (string, error) {
if !util.PodPriorityEnabled() || sched.config.DisablePreemption {
return "", nil
}
// Refresh pod state
preemptor, err := sched.config.PodPreemptor.GetUpdatedPod(preemptor)
// Execute core preemption algorithm
node, victims, nominatedPodsToClear, err := sched.config.Algorithm.Preempt(
preemptor, sched.config.NodeLister, scheduleErr)
if node != nil {
// Update queue with nominated node
sched.config.SchedulingQueue.UpdateNominatedPodForNode(preemptor, node.Name)
// Annotate pod status
err = sched.config.PodPreemptor.SetNominatedNodeName(preemptor, node.Name)
// Evict selected victims
for _, victim := range victims {
sched.config.PodPreemptor.DeletePod(victim)
}
}
// Cleanup stale nominations
for _, p := range nominatedPodsToClear {
sched.config.PodPreemptor.RemoveNominatedNodeName(p)
}
return node.Name, err
}
Queue Management Structures
The scheduler maintains pods in a SchedulingQueue. Two primary implementations exist: FIFO and PriorityQueue.
PriorityQueue Implementation
The PriorityQueue manages active and unschedulable pods separately, ensuring high-priority items are processed first.
type PriorityQueue struct {
lock sync.RWMutex
activeQ *Heap // Pods ready for scheduling
unschedulableQ *UnschedulablePodsMap // Pods previously failed
nominatedPods *nominatedPodMap // Pods assigned to specific nodes
receivedMoveRequest bool
}
Key operations include adding pods to the active queue and managing nominations.
func (p *PriorityQueue) Add(pod *v1.Pod) error {
p.lock.Lock()
defer p.lock.Unlock()
err := p.activeQ.Add(pod)
if err == nil {
// Remove from unschedulable if present
p.unschedulableQ.delete(pod)
// Reset nomination
p.nominatedPods.add(pod, "")
p.cond.Broadcast()
}
return err
}
Nominated pods are tracked to prevent resource contention during the preemption window.
type nominatedPodMap struct {
nominatedPods map[string][]*v1.Pod // Node -> Pods
nominatedPodToNode map[ktypes.UID]string // Pod UID -> Node
}
Preemption Algorithm Details
The core logic resides in genericScheduler.Preempt. It follows a multi-stage filtering process to identify the optimal node and victims.
Interface Definition
type ScheduleAlgorithm interface {
Preempt(*v1.Pod, NodeLister, error) (
selectedNode *v1.Node,
preemptedPods []*v1.Pod,
cleanupNominatedPods []*v1.Pod,
err error)
}
Execution Flow
- Eligibility Check: Determines if the pod should attempt preemption.
- Node Filtering: Identifies nodes where preemption could potentially succeed.
- Victim Selection: Calculates which pods to evict on candidate nodes.
- Node Selection: Chooses the best node based on victim cost.
func (g *genericScheduler) Preempt(pod *v1.Pod, nodeLister algorithm.NodeLister, scheduleErr error) (*v1.Node, []*v1.Pod, []*v1.Pod, error) {
// 1. Check eligibility
if !podEligibleToPreemptOthers(pod, g.cachedNodeInfoMap) {
return nil, nil, nil, nil
}
allNodes, err := nodeLister.List()
// 2. Find potential nodes
potentialNodes := nodesWherePreemptionMightHelp(allNodes, fitError.FailedPredicates)
if len(potentialNodes) == 0 {
return nil, nil, []*v1.Pod{pod}, nil
}
// 3. Calculate victims per node
nodeToVictims, err := selectNodesForPreemption(pod, g.cachedNodeInfoMap, potentialNodes, ...)
// 4. Pick the best node
candidateNode := pickOneNodeForPreemption(nodeToVictims)
// Cleanup lower priority nominations on the selected node
nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
return candidateNode, nodeToVictims[candidateNode].Pods, nominatedPods, nil
}
Eligibility Verification
A pod is ineligible if its already nominated on a node where lower-priority pods are terminating.
func podEligibleToPreemptOthers(pod *v1.Pod, nodeNameToInfo map[string]*schedulercache.NodeInfo) bool {
nomNodeName := pod.Status.NominatedNodeName
if len(nomNodeName) > 0 {
if nodeInfo, found := nodeNameToInfo[nomNodeName]; found {
for _, p := range nodeInfo.Pods() {
if p.DeletionTimestamp != nil && util.GetPodPriority(p) < util.GetPodPriority(pod) {
return false
}
}
}
}
return true
}
Candidate Node Identification
Nodes are filtered based on failure reasons. Hard constraints like node selectors or taints cannot be resolved by preemption.
func nodesWherePreemptionMightHelp(nodes []*v1.Node, failedPredicatesMap FailedPredicateMap) []*v1.Node {
var potentialNodes []*v1.Node
for _, node := range nodes {
unresolvable := false
failedPredicates, _ := failedPredicatesMap[node.Name]
for _, reason := range failedPredicates {
if isUnresolvableReason(reason) {
unresolvable = true
break
}
}
if !unresolvable {
potentialNodes = append(potentialNodes, node)
}
}
return potentialNodes
}
Victim Selection Strategy
For each candidate node, the scheduler calculates the minimal set of pods to evict. It prioritizes preserving pods that do not violate PodDisruptionBudgets (PDB).
func selectVictimsOnNode(...) ([]*v1.Pod, int, bool) {
// Collect lower priority pods
var potentialVictims util.SortableList
for _, p := range nodeInfo.Pods() {
if util.GetPodPriority(p) < util.GetPodPriority(pod) {
potentialVictims.Items = append(potentialVictims.Items, p)
}
}
potentialVictims.Sort()
// Simulate removal of all lower priority pods
// Check if pod fits after removal
if !fitsAfterRemoval(pod, potentialVictims) {
return nil, 0, false
}
// Attempt to reprieve pods starting from highest priority
// Separate into PDB-violating and non-violating groups
violating, nonViolating := filterPodsWithPDBViolation(potentialVictims.Items, pdbs)
var victims []*v1.Pod
// Try to keep violating pods first
for _, p := range violating {
if !reprievePod(p) {
victims = append(victims, p)
}
}
// Then try to keep non-violating pods
for _, p := range nonViolating {
reprievePod(p)
}
return victims, countViolations, true
}
Final Node Scoring
The scheduler selects the node requiring the least disruptive preemption. Criteria include PDB violations, highest victim priority, total priority sum, and victim count.
func pickOneNodeForPreemption(nodesToVictims map[*v1.Node]*schedulerapi.Victims) *v1.Node {
// 1. Minimize PDB violations
minNodes := findNodesWithMinPDBViolations(nodesToVictims)
if len(minNodes) == 1 { return minNodes[0] }
// 2. Minimize highest priority victim
minNodes = findNodesWithMinHighestPriority(minNodes, nodesToVictims)
if len(minNodes) == 1 { return minNodes[0] }
// 3. Minimize total priority sum
minNodes = findNodesWithMinPrioritySum(minNodes, nodesToVictims)
if len(minNodes) == 1 { return minNodes[0] }
// 4. Minimize total victim count
minNodes = findNodesWithMinVictimCount(minNodes, nodesToVictims)
if len(minNodes) > 0 {
return minNodes[0]
}
return nil
}