Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Kubernetes Pod Preemption Architecture and Logic Flow

Tech May 17 2

Pod priority enables the scheduler to order pods within the queue and facilitates preempsion when cluster resources are constrained. High-priority pods can displace lower-priority workloads to ensure critical services remain operational. This analysis examines the internal mechanisms governing preemptive scheduling within the Kubernetes scheduler.

Priority Configuration

Preemption relies on the PriorityClass resource to define relative importance. Pods reference these classes to inherit a numeric priority value.

PriorityClass Definition

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-tier
value: 1000000
globalDefault: false
description: "Reserved for essential infrastructure services."

Pod Specification

apiVersion: v1
kind: Pod
metadata:
  name: app-core
  labels:
    component: backend
spec:
  containers:
  - name: server
    image: core-service:v1
    imagePullPolicy: IfNotPresent
  priorityClassName: critical-tier

Preemption Trigger Point

The scheduling cycle initiates in scheduleOne. When standard predicate checking fails to find a suitable node, the scheduler evaluates preemption possibilities.

// pkg/scheduler/scheduler.go
func (sched *Scheduler) scheduleOne() {
    // ... retrieval logic ...
    
    suggestedHost, err := sched.schedule(pod)
    if err != nil {
        if fitError, ok := err.(*core.FitError); ok {
            // Attempt preemption if standard scheduling fails
            sched.preempt(pod, fitError)
            metrics.PreemptionAttempts.Inc()
        } else {
            klog.Errorf("error selecting node for pod: %v", err)
        }
        return
    }
    // ... binding logic ...
}

If sched.schedule returns an error indicating no nodes fit, sched.preempt is invoked. This function orchestrates the victim selection and nomination process.

func (sched *Scheduler) preempt(preemptor *v1.Pod, scheduleErr error) (string, error) {
    if !util.PodPriorityEnabled() || sched.config.DisablePreemption {
        return "", nil
    }
    
    // Refresh pod state
    preemptor, err := sched.config.PodPreemptor.GetUpdatedPod(preemptor)
    
    // Execute core preemption algorithm
    node, victims, nominatedPodsToClear, err := sched.config.Algorithm.Preempt(
        preemptor, sched.config.NodeLister, scheduleErr)

    if node != nil {
        // Update queue with nominated node
        sched.config.SchedulingQueue.UpdateNominatedPodForNode(preemptor, node.Name)
        
        // Annotate pod status
        err = sched.config.PodPreemptor.SetNominatedNodeName(preemptor, node.Name)
        
        // Evict selected victims
        for _, victim := range victims {
            sched.config.PodPreemptor.DeletePod(victim)
        }
    }
    
    // Cleanup stale nominations
    for _, p := range nominatedPodsToClear {
        sched.config.PodPreemptor.RemoveNominatedNodeName(p)
    }
    
    return node.Name, err
}

Queue Management Structures

The scheduler maintains pods in a SchedulingQueue. Two primary implementations exist: FIFO and PriorityQueue.

PriorityQueue Implementation

The PriorityQueue manages active and unschedulable pods separately, ensuring high-priority items are processed first.

type PriorityQueue struct {
    lock  sync.RWMutex
    activeQ       *Heap                // Pods ready for scheduling
    unschedulableQ *UnschedulablePodsMap // Pods previously failed
    nominatedPods *nominatedPodMap     // Pods assigned to specific nodes
    receivedMoveRequest bool
}

Key operations include adding pods to the active queue and managing nominations.

func (p *PriorityQueue) Add(pod *v1.Pod) error {
    p.lock.Lock()
    defer p.lock.Unlock()
    
    err := p.activeQ.Add(pod)
    if err == nil {
        // Remove from unschedulable if present
        p.unschedulableQ.delete(pod)
        // Reset nomination
        p.nominatedPods.add(pod, "")
        p.cond.Broadcast()
    }
    return err
}

Nominated pods are tracked to prevent resource contention during the preemption window.

type nominatedPodMap struct {
    nominatedPods      map[string][]*v1.Pod // Node -> Pods
    nominatedPodToNode map[ktypes.UID]string // Pod UID -> Node
}

Preemption Algorithm Details

The core logic resides in genericScheduler.Preempt. It follows a multi-stage filtering process to identify the optimal node and victims.

Interface Definition

type ScheduleAlgorithm interface {
    Preempt(*v1.Pod, NodeLister, error) (
        selectedNode *v1.Node, 
        preemptedPods []*v1.Pod, 
        cleanupNominatedPods []*v1.Pod, 
        err error)
}

Execution Flow

  1. Eligibility Check: Determines if the pod should attempt preemption.
  2. Node Filtering: Identifies nodes where preemption could potentially succeed.
  3. Victim Selection: Calculates which pods to evict on candidate nodes.
  4. Node Selection: Chooses the best node based on victim cost.
func (g *genericScheduler) Preempt(pod *v1.Pod, nodeLister algorithm.NodeLister, scheduleErr error) (*v1.Node, []*v1.Pod, []*v1.Pod, error) {
    // 1. Check eligibility
    if !podEligibleToPreemptOthers(pod, g.cachedNodeInfoMap) {
        return nil, nil, nil, nil
    }
    
    allNodes, err := nodeLister.List()
    
    // 2. Find potential nodes
    potentialNodes := nodesWherePreemptionMightHelp(allNodes, fitError.FailedPredicates)
    if len(potentialNodes) == 0 {
        return nil, nil, []*v1.Pod{pod}, nil
    }
    
    // 3. Calculate victims per node
    nodeToVictims, err := selectNodesForPreemption(pod, g.cachedNodeInfoMap, potentialNodes, ...)
    
    // 4. Pick the best node
    candidateNode := pickOneNodeForPreemption(nodeToVictims)
    
    // Cleanup lower priority nominations on the selected node
    nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
    
    return candidateNode, nodeToVictims[candidateNode].Pods, nominatedPods, nil
}

Eligibility Verification

A pod is ineligible if its already nominated on a node where lower-priority pods are terminating.

func podEligibleToPreemptOthers(pod *v1.Pod, nodeNameToInfo map[string]*schedulercache.NodeInfo) bool {
    nomNodeName := pod.Status.NominatedNodeName
    if len(nomNodeName) > 0 {
        if nodeInfo, found := nodeNameToInfo[nomNodeName]; found {
            for _, p := range nodeInfo.Pods() {
                if p.DeletionTimestamp != nil && util.GetPodPriority(p) < util.GetPodPriority(pod) {
                    return false
                }
            }
        }
    }
    return true
}

Candidate Node Identification

Nodes are filtered based on failure reasons. Hard constraints like node selectors or taints cannot be resolved by preemption.

func nodesWherePreemptionMightHelp(nodes []*v1.Node, failedPredicatesMap FailedPredicateMap) []*v1.Node {
    var potentialNodes []*v1.Node
    for _, node := range nodes {
        unresolvable := false
        failedPredicates, _ := failedPredicatesMap[node.Name]
        
        for _, reason := range failedPredicates {
            if isUnresolvableReason(reason) {
                unresolvable = true
                break
            }
        }
        
        if !unresolvable {
            potentialNodes = append(potentialNodes, node)
        }
    }
    return potentialNodes
}

Victim Selection Strategy

For each candidate node, the scheduler calculates the minimal set of pods to evict. It prioritizes preserving pods that do not violate PodDisruptionBudgets (PDB).

func selectVictimsOnNode(...) ([]*v1.Pod, int, bool) {
    // Collect lower priority pods
    var potentialVictims util.SortableList
    for _, p := range nodeInfo.Pods() {
        if util.GetPodPriority(p) < util.GetPodPriority(pod) {
            potentialVictims.Items = append(potentialVictims.Items, p)
        }
    }
    potentialVictims.Sort()
    
    // Simulate removal of all lower priority pods
    // Check if pod fits after removal
    if !fitsAfterRemoval(pod, potentialVictims) {
        return nil, 0, false
    }
    
    // Attempt to reprieve pods starting from highest priority
    // Separate into PDB-violating and non-violating groups
    violating, nonViolating := filterPodsWithPDBViolation(potentialVictims.Items, pdbs)
    
    var victims []*v1.Pod
    // Try to keep violating pods first
    for _, p := range violating {
        if !reprievePod(p) {
            victims = append(victims, p)
        }
    }
    // Then try to keep non-violating pods
    for _, p := range nonViolating {
        reprievePod(p)
    }
    
    return victims, countViolations, true
}

Final Node Scoring

The scheduler selects the node requiring the least disruptive preemption. Criteria include PDB violations, highest victim priority, total priority sum, and victim count.

func pickOneNodeForPreemption(nodesToVictims map[*v1.Node]*schedulerapi.Victims) *v1.Node {
    // 1. Minimize PDB violations
    minNodes := findNodesWithMinPDBViolations(nodesToVictims)
    if len(minNodes) == 1 { return minNodes[0] }
    
    // 2. Minimize highest priority victim
    minNodes = findNodesWithMinHighestPriority(minNodes, nodesToVictims)
    if len(minNodes) == 1 { return minNodes[0] }
    
    // 3. Minimize total priority sum
    minNodes = findNodesWithMinPrioritySum(minNodes, nodesToVictims)
    if len(minNodes) == 1 { return minNodes[0] }
    
    // 4. Minimize total victim count
    minNodes = findNodesWithMinVictimCount(minNodes, nodesToVictims)
    
    if len(minNodes) > 0 {
        return minNodes[0]
    }
    return nil
}

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.