Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Understanding the Linux Completely Fair Scheduler (CFS)

Tech 3

Overview of the CFS Scheduler

The Completely Fair Scheduler (CFS) is the default process scheduler for non-real-time tasks in the Linux kernel, introduced in version 2.6.23. Its primary design goal is to approximate an idealized, perfectly multitasking processsor on real hardware.

At its core, CFS operates on the concept of virtual runtime (vruntime). It imagines a virtual CPU with a total capacity of 1. If there are N runnable tasks, each task should ideally receive 1/N of this virtual CPU's capacity over time. CFS strives to equalize the vruntime of all tasks, meaning each task progresses through its virtual timeline at a fair rate relative to others.

Task priority, expressed through the nice value, is integrated by weighting the distribution of actual physical CPU time. While CFS aims for fairness in virtual time, higher-priority tasks receive a larger share of real CPU time, translating their nice value into a weight that scales their slice of the processor.

Core Design Principles

  • Idealized Processor Model: CFS attempts to emulate a processor that can run multiple tasks simultaneously, granting each an equal fraction of its power.
  • Target Latency & Time Slices: CFS defines a target latency—the maximum acceptable period within which every runnable task should run at least once (default: 6ms). The ideal time slice for a task is target_latency / N. This slice is dynamic, changing with the number of runnable tasks (N).
  • Minimum Granularity: To prevent excessive overhead from frequent context switches, a minimum granularity (default: 0.75ms) defines the shortest duration a task will run before being eligible for preemption.
  • Scheduling Decision: CFS always selects the task with smallest vruntime to run next. A task runs for its calculated weighted time slice (or until it blocks) before being preempted to allow other tasks to catch up in vruntime.
  • Tracking Fairness: CFS maintains the vruntime for every task, both runnable and blocked. A lower vruntime indicates a task is more "deserving" of CPU time.

Key Implementation Details

Virtual Runtime and Task Structure

The vruntime for a task is stored within its scheduling entity, struct sched_entity, which is embedded in the main struct task_struct.

/* include/linux/sched.h */
struct sched_entity {
    // ... other fields ...
    u64 vruntime; // Virtual runtime of the task in nanoseconds
    // ...
};

struct task_struct {
    // ...
    const struct sched_class *sched_class;
    struct sched_entity se; // CFS scheduling parameters
    // ...
};

The Runqueue: A Red-Black Tree

CFS does not use a traditional FIFO runqueue. Instead, each CPU has its own CFS runqueue (struct cfs_rq) which uses a red-black tree keyed by p->se.vruntime to maintain a timeline of runnable tasks.

/* kernel/sched/sched.h */
struct cfs_rq {
    // ...
    u64 min_vruntime; // Monotonically increasing minimum vruntime in the tree
    struct rb_root_cached tasks_timeline; // Root of the red-black tree
    // ...
};

struct rq {
    // ...
    struct cfs_rq cfs; // The CFS-specific part of the per-CPU runqueue
    // ...
};

The tree is self-balancing, ensuring efficient O(log N) operations for insertion and deletion. The task with the smallest vruntime is always the leftmost node in the tree, which can be accessed in O(1) time thanks to caching.

  • Insertion Complexity: O(log N)
  • Selection Complexity (pick next task): O(1)
  • Overall Scheduling Complexity: O(log N)

This structure naturally prioritizes I/O-bound tasks (which sleep often and have low vruntime) by placing them towards the left, while CPU-bound tasks gradually move to the right.

CFS Parameters and Policies

Tunable Parameters

  • sysctl_sched_latency (Target Latency): Default 6,000,000 ns (6 ms). The period aimed at for scheduling all runnable tasks.
  • sysctl_sched_min_granularity: Default 750,000 ns (0.75 ms). The minimum time a task runs before being preempted.
  • sched_prio_to_weight: An array mapping nice values (from -20 to +19) to weight multipliers. A lower nice (higher priority) corresponds to a larger weight.

When the number of runnable tasks causes the ideal slice (latency / N) to fall below min_granularity, the scheduler period is effectively extended to N * min_granularity to preserve efficiency.

Scheduling Policies

CFS implements three scheduling policies for non-real-time tasks:

  1. SCHED_NORMAL (formerly SCHED_OTHER): The default policy for regular tasks.
  2. SCHED_BATCH: For CPU-intensive background tasks that do not require low latency. The scheduler allows them to run longer batches to improve cache utilization and throughput.
  3. SCHED_IDLE: For very low-priority tasks that should only run when the system is otherwise idle.

The Scheduler Class Interface

Linux supports multiple scheduler classes (e.g., CFS, Real-Time). Each is represented by a struct sched_class containing callback function pointers for key scheduling events.

/* kernel/sched/sched.h */
struct sched_class {
    const struct sched_class *next;
    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*yield_task) (struct rq *rq);
    struct task_struct * (*pick_next_task) (struct rq *rq,
                                            struct task_struct *prev,
                                            struct rq_flags *rf);
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);
    void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
    void (*update_curr) (struct rq *rq);
    // ... many other callbacks ...
};

The CFS scheduler class is defined as:

const struct sched_class fair_sched_class = {
    .next           = &idle_sched_class,
    .enqueue_task   = enqueue_task_fair,
    .dequeue_task   = dequeue_task_fair,
    .pick_next_task = pick_next_task_fair,
    .put_prev_task  = put_prev_task_fair,
    .task_tick      = task_tick_fair,
    .update_curr    = update_curr_fair,
    // ...
};

Group Scheduling and Features

CFS supports group scheduling via control groups (cgroups). This allows administrators to allocate CPU shares hierarchically—for example, fair shares first between users, then between tasks within a user's group.

Key configuration options:

  • CONFIG_CGROUP_SCHED: Enables task grouping for scheduling.
  • CONFIG_FAIR_GROUP_SCHED: Enables group scheduling for CFS tasks.
  • CONFIG_RT_GROUP_SCHED: Enables group scheduling for real-time tasks.

With group scheduling enabled, a cpu.shares file appears in each cgroup directory, allowing control over the relative CPU bandwidth allocated to that group of tasks.

Example of setting up groups:

# Mount cgroup filesystem
mount -t cgroup -o cpu none /sys/fs/cgroup/cpu
cd /sys/fs/cgroup/cpu

# Create groups for multimedia and browser tasks
mkdir multimedia
mkdir browser

# Allocate CPU shares (relative weights)
echo 2048 > multimedia/cpu.shares  # 2x share
echo 1024 > browser/cpu.shares     # 1x share

# Assign processes to groups
echo $FIREFOX_PID > browser/tasks
echo $PLAYER_PID > multimedia/tasks

CFS also provides strong support for Symmetric Multiprocessing (SMP), including sophisticated load-balancing logic across CPU cores and scheduling domains for configuring processor groupings.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.