Home > Tech > Content

Kubernetes Pod Preemption Architecture and Logic Flow

Tech May 17 2

Pod priority enables the scheduler to order pods within the queue and facilitates preempsion when cluster resources are constrained. High-priority pods can displace lower-priority workloads to ensure critical services remain operational. This analysis examines the internal mechanisms governing preemptive scheduling within the Kubernetes scheduler.

Priority Configuration

Preemption relies on the PriorityClass resource to define relative importance. Pods reference these classes to inherit a numeric priority value.

PriorityClass Definition

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-tier
value: 1000000
globalDefault: false
description: "Reserved for essential infrastructure services."

Pod Specification

apiVersion: v1
kind: Pod
metadata:
  name: app-core
  labels:
    component: backend
spec:
  containers:
  - name: server
    image: core-service:v1
    imagePullPolicy: IfNotPresent
  priorityClassName: critical-tier

Preemption Trigger Point

The scheduling cycle initiates in scheduleOne. When standard predicate checking fails to find a suitable node, the scheduler evaluates preemption possibilities.

// pkg/scheduler/scheduler.go
func (sched *Scheduler) scheduleOne() {
    // ... retrieval logic ...
    
    suggestedHost, err := sched.schedule(pod)
    if err != nil {
        if fitError, ok := err.(*core.FitError); ok {
            // Attempt preemption if standard scheduling fails
            sched.preempt(pod, fitError)
            metrics.PreemptionAttempts.Inc()
        } else {
            klog.Errorf("error selecting node for pod: %v", err)
        }
        return
    }
    // ... binding logic ...
}

If sched.schedule returns an error indicating no nodes fit, sched.preempt is invoked. This function orchestrates the victim selection and nomination process.

func (sched *Scheduler) preempt(preemptor *v1.Pod, scheduleErr error) (string, error) {
    if !util.PodPriorityEnabled() || sched.config.DisablePreemption {
        return "", nil
    }
    
    // Refresh pod state
    preemptor, err := sched.config.PodPreemptor.GetUpdatedPod(preemptor)
    
    // Execute core preemption algorithm
    node, victims, nominatedPodsToClear, err := sched.config.Algorithm.Preempt(
        preemptor, sched.config.NodeLister, scheduleErr)

    if node != nil {
        // Update queue with nominated node
        sched.config.SchedulingQueue.UpdateNominatedPodForNode(preemptor, node.Name)
        
        // Annotate pod status
        err = sched.config.PodPreemptor.SetNominatedNodeName(preemptor, node.Name)
        
        // Evict selected victims
        for _, victim := range victims {
            sched.config.PodPreemptor.DeletePod(victim)
        }
    }
    
    // Cleanup stale nominations
    for _, p := range nominatedPodsToClear {
        sched.config.PodPreemptor.RemoveNominatedNodeName(p)
    }
    
    return node.Name, err
}

Queue Management Structures

The scheduler maintains pods in a SchedulingQueue. Two primary implementations exist: FIFO and PriorityQueue.

PriorityQueue Implementation

The PriorityQueue manages active and unschedulable pods separately, ensuring high-priority items are processed first.

type PriorityQueue struct {
    lock  sync.RWMutex
    activeQ       *Heap                // Pods ready for scheduling
    unschedulableQ *UnschedulablePodsMap // Pods previously failed
    nominatedPods *nominatedPodMap     // Pods assigned to specific nodes
    receivedMoveRequest bool
}

Key operations include adding pods to the active queue and managing nominations.

func (p *PriorityQueue) Add(pod *v1.Pod) error {
    p.lock.Lock()
    defer p.lock.Unlock()
    
    err := p.activeQ.Add(pod)
    if err == nil {
        // Remove from unschedulable if present
        p.unschedulableQ.delete(pod)
        // Reset nomination
        p.nominatedPods.add(pod, "")
        p.cond.Broadcast()
    }
    return err
}

Nominated pods are tracked to prevent resource contention during the preemption window.

type nominatedPodMap struct {
    nominatedPods      map[string][]*v1.Pod // Node -> Pods
    nominatedPodToNode map[ktypes.UID]string // Pod UID -> Node
}

Preemption Algorithm Details

The core logic resides in genericScheduler.Preempt. It follows a multi-stage filtering process to identify the optimal node and victims.

Interface Definition

type ScheduleAlgorithm interface {
    Preempt(*v1.Pod, NodeLister, error) (
        selectedNode *v1.Node, 
        preemptedPods []*v1.Pod, 
        cleanupNominatedPods []*v1.Pod, 
        err error)
}

Execution Flow

Eligibility Check: Determines if the pod should attempt preemption.
Node Filtering: Identifies nodes where preemption could potentially succeed.
Victim Selection: Calculates which pods to evict on candidate nodes.
Node Selection: Chooses the best node based on victim cost.

func (g *genericScheduler) Preempt(pod *v1.Pod, nodeLister algorithm.NodeLister, scheduleErr error) (*v1.Node, []*v1.Pod, []*v1.Pod, error) {
    // 1. Check eligibility
    if !podEligibleToPreemptOthers(pod, g.cachedNodeInfoMap) {
        return nil, nil, nil, nil
    }
    
    allNodes, err := nodeLister.List()
    
    // 2. Find potential nodes
    potentialNodes := nodesWherePreemptionMightHelp(allNodes, fitError.FailedPredicates)
    if len(potentialNodes) == 0 {
        return nil, nil, []*v1.Pod{pod}, nil
    }
    
    // 3. Calculate victims per node
    nodeToVictims, err := selectNodesForPreemption(pod, g.cachedNodeInfoMap, potentialNodes, ...)
    
    // 4. Pick the best node
    candidateNode := pickOneNodeForPreemption(nodeToVictims)
    
    // Cleanup lower priority nominations on the selected node
    nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
    
    return candidateNode, nodeToVictims[candidateNode].Pods, nominatedPods, nil
}

Eligibility Verification

A pod is ineligible if its already nominated on a node where lower-priority pods are terminating.

func podEligibleToPreemptOthers(pod *v1.Pod, nodeNameToInfo map[string]*schedulercache.NodeInfo) bool {
    nomNodeName := pod.Status.NominatedNodeName
    if len(nomNodeName) > 0 {
        if nodeInfo, found := nodeNameToInfo[nomNodeName]; found {
            for _, p := range nodeInfo.Pods() {
                if p.DeletionTimestamp != nil && util.GetPodPriority(p) < util.GetPodPriority(pod) {
                    return false
                }
            }
        }
    }
    return true
}

Candidate Node Identification

Nodes are filtered based on failure reasons. Hard constraints like node selectors or taints cannot be resolved by preemption.

func nodesWherePreemptionMightHelp(nodes []*v1.Node, failedPredicatesMap FailedPredicateMap) []*v1.Node {
    var potentialNodes []*v1.Node
    for _, node := range nodes {
        unresolvable := false
        failedPredicates, _ := failedPredicatesMap[node.Name]
        
        for _, reason := range failedPredicates {
            if isUnresolvableReason(reason) {
                unresolvable = true
                break
            }
        }
        
        if !unresolvable {
            potentialNodes = append(potentialNodes, node)
        }
    }
    return potentialNodes
}

Victim Selection Strategy

For each candidate node, the scheduler calculates the minimal set of pods to evict. It prioritizes preserving pods that do not violate PodDisruptionBudgets (PDB).

func selectVictimsOnNode(...) ([]*v1.Pod, int, bool) {
    // Collect lower priority pods
    var potentialVictims util.SortableList
    for _, p := range nodeInfo.Pods() {
        if util.GetPodPriority(p) < util.GetPodPriority(pod) {
            potentialVictims.Items = append(potentialVictims.Items, p)
        }
    }
    potentialVictims.Sort()
    
    // Simulate removal of all lower priority pods
    // Check if pod fits after removal
    if !fitsAfterRemoval(pod, potentialVictims) {
        return nil, 0, false
    }
    
    // Attempt to reprieve pods starting from highest priority
    // Separate into PDB-violating and non-violating groups
    violating, nonViolating := filterPodsWithPDBViolation(potentialVictims.Items, pdbs)
    
    var victims []*v1.Pod
    // Try to keep violating pods first
    for _, p := range violating {
        if !reprievePod(p) {
            victims = append(victims, p)
        }
    }
    // Then try to keep non-violating pods
    for _, p := range nonViolating {
        reprievePod(p)
    }
    
    return victims, countViolations, true
}

Final Node Scoring

The scheduler selects the node requiring the least disruptive preemption. Criteria include PDB violations, highest victim priority, total priority sum, and victim count.

func pickOneNodeForPreemption(nodesToVictims map[*v1.Node]*schedulerapi.Victims) *v1.Node {
    // 1. Minimize PDB violations
    minNodes := findNodesWithMinPDBViolations(nodesToVictims)
    if len(minNodes) == 1 { return minNodes[0] }
    
    // 2. Minimize highest priority victim
    minNodes = findNodesWithMinHighestPriority(minNodes, nodesToVictims)
    if len(minNodes) == 1 { return minNodes[0] }
    
    // 3. Minimize total priority sum
    minNodes = findNodesWithMinPrioritySum(minNodes, nodesToVictims)
    if len(minNodes) == 1 { return minNodes[0] }
    
    // 4. Minimize total victim count
    minNodes = findNodesWithMinVictimCount(minNodes, nodesToVictims)
    
    if len(minNodes) > 0 {
        return minNodes[0]
    }
    return nil
}

Tags: Kubernetes scheduling preemption golang

Back to List

Prev: Hybrid Segmentation Pipeline for Cerebellum and Brainstem Extraction in Brain MRI

Next: Getting Started with MyBatis and Core Configuration

Fading Coder

Kubernetes Pod Preemption Architecture and Logic Flow

Priority Configuration

PriorityClass Definition

Pod Specification

Preemption Trigger Point

Queue Management Structures

PriorityQueue Implementation

Preemption Algorithm Details

Interface Definition

Execution Flow

Eligibility Verification

Candidate Node Identification

Victim Selection Strategy

Final Node Scoring

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Kubernetes Pod Preemption Architecture and Logic Flow

Priority Configuration

PriorityClass Definition

Pod Specification

Preemption Trigger Point

Queue Management Structures

PriorityQueue Implementation

Preemption Algorithm Details

Interface Definition

Execution Flow

Eligibility Verification

Candidate Node Identification

Victim Selection Strategy

Final Node Scoring

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment