Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Process Management in Linux

Tech May 19 2

Process Overview

A process represents a running program instance. Beyond the executable code segment, a process includes open files, pending signals, kernel data structures, processor state, virtual memory mappings, threads, and global variables' data segment.

Threads are scheduling units managed by the kernel. A process can contain one or more threads. In Linux, threads are not distinctly treated from processes—essentially, threads are special processes.

Most Linux processes are created through the fork() system call, which duplicates an existing process. The initiating process is termed the parent, and the newly created one is the child.

The fork() system call returns twice: once to the parent and once to the child.

After fork(), a new process does not execute immediately. Instead, it loads a new program into memory via exec() family functions.

Linux refers to processes as tasks from a kernel perspective.

Process Descriptor and Task Structure

The kernel maintains a list of all processes in a doubly linked circular list called the task list. Each list item is a task_struct (process descriptor) defined in , containing all necessary information for kernel process management: open files, address space, pending signals, process state, and more.

Figure 3-1 Process Descriptor and Task Queue

Process Descriptor Allocation

Prior to kernel version 2.6, task_struct instances were located at the end of kernel stacks. Since version 2.6, the slab allocator dynamically creates task_struct instances, placing a new struct thread_info at the end of the stack in the growing direction.

In x86 architecture, struct thread_info is defined in :

/* version 2.6, x86 */
struct thread_info {
    struct task_struct    *task;
    struct exec_domain    *exec_domain;
    __u32                 flags;
    __u32                 status;
    __u32                 cpu;
    int                   preempt_count;
    mm_segment_t          addr_limit;
    struct restart_block  restart_block;
    void                  *sysenter_return;
    int                   uaccess_err;
};

Figure 3-2 Process Descriptor and Kernel Stack

In kernel version 5.10.220, x86 struct thread_info is defined as:

/* version 5.10.220, x86 */
struct thread_info {
    unsigned long    flags;    /* low level flags */
    u32              status;   /* thread synchronous flags */
};

This differs from the 2.6 version, as there is no direct pointer to struct task_struct. In 5.10.220, struct task_struct defines thread_info as its first member:

/* version 5.10.220, x86 */
struct task_struct{
#ifdef CONFIG_THREAD_INFO_IN_TASK
    /*
     * For reasons of header soup (see current_thread_info()), this
     * must be the first element of task_struct.
     */
    struct thread_info    thread_info;
#endif
        ...........
};

Thus, in 5.10.220, the x86 kernel stack tail still stores the task_struct structure due to the first member placement.

In the same kernel version, ARM's struct thread_info is defined in :

/* version 5.10.220, arm */
/*
 * low level task data that entry.S needs immediate access to.
 * __switch_to() assumes cpu_context follows immediately after cpu_domain.
 */
struct thread_info {
	unsigned long		flags;		/* low level flags */
	int			preempt_count;	/* 0 => preemptable, <0 => bug */
	mm_segment_t		addr_limit;	/* address limit */
	struct task_struct	*task;		/* main task structure */
	__u32			cpu;		/* cpu */
	__u32			cpu_domain;	/* cpu domain */
#ifdef CONFIG_STACKPROTECTOR_PER_TASK
	unsigned long		stack_canary;
#endif
	struct cpu_context_save	cpu_context;	/* cpu context */
	__u32			syscall;	/* syscall number */
	__u8			used_cp[16];	/* thread used copro */
	unsigned long		tp_value[2];	/* TLS registers */
#ifdef CONFIG_CRUNCH
	struct crunch_state	crunchstate;
#endif
	union fp_state		fpstate __attribute__((aligned(8)));
	union vfp_state		vfpstate;
#ifdef CONFIG_ARM_THUMBEE
	unsigned long		thumbee_state;	/* ThumbEE Handler Base register */
#endif
};

Process Descriptor Storage

Processes are uniquely identified by their process ID (PID), a numeric identifier of type pid_t (effectively an int). The maximum PID value is defined in , with smaller values resulting in shorter round-robin scheduling times.

/*
 * A maximum of 4 million PIDs should be enough for a while.
 * [NOTE: PID/TIDs are limited to 2^30 ~= 1 billion, see FUTEX_TID_MASK.]
 */
#define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
	(sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))

Users can adjust the upper limit via /proc/sys/kernel/pid_max.

Figure 3-3 Ubuntu24.04 x86_64 PID_MAX=4194304

Accessing a task usually requires a pointer to its task_struct. The current macro is crucial for finding the currently executing process descriptor (tss). Some architectures use a dedicated register, but x86 uses thread_info structure at the stack's end, accessed via offset calculation.

On x86, current masks the last 13 bits of the stack pointer to calculate the thread_info offset, using current_thread_info():

movl $-8192, %eax  # kernel stack size is 8K
# movl $-4096, %eax# kernel stack size is 4K
andl $esp, %eax  

Then, it retrieves the task_struct address from the task field:

current_thread_info()->task;

In kernel 5.10.220, current_thread_info() is implemented as:

/* <linux/thread_info.h> */
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
 * For CONFIG_THREAD_INFO_IN_TASK kernels we need <asm/current.h> for the
 * definition of current, but for !CONFIG_THREAD_INFO_IN_TASK kernels,
 * including <asm/current.h> can cause a circular dependency on some platforms.
 */
#include <asm/current.h>
#define current_thread_info() ((struct thread_info *)current)
#endif


/* <asm/current.h> */
DECLARE_PER_CPU(struct task_struct *, current_task); 
static __always_inline struct task_struct *get_current(void)
{
	return this_cpu_read_stable(current_task);
}

#define current get_current()

/* arch/x86/kernel/cpu/common.c */
/*
 * The following percpu variables are hot.  Align current_task to
 * cacheline size such that they fall in the same cacheline.
 */
DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned =
	&init_task;
EXPORT_PER_CPU_SYMBOL(current_task);

In kernel 5.10.220 x86, current_thread_info() points to current_task's thread_info field.

Process States

There are five process states:

  1. TASK_RUNNING (Running) — Process is runnable: either executing or waiting in the run queue.
  2. TASK_INTERRUPTIBLE (Interruptible) — Sleeping (blocked) waiting for conditions; kernel sets to running upon condition fulfillment.
  3. TASK_UNINTERRUPTIBLE (Uninterruptible) — Similar to interruptible, but not awakened by signals.
  4. __TASK_STOPPED — Process suspended, typically due to SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU.
  5. __TASK_TRACED — Process being traced by another process, e.g., via ptrace.

Figure 3-3 Process State Transitions

/* <linux/sched.h> */
/* Used in tsk->state: */
#define TASK_RUNNING			0x0000
#define TASK_INTERRUPTIBLE		0x0001
#define TASK_UNINTERRUPTIBLE    0x0002
#define __TASK_STOPPED			0x0004
#define __TASK_TRACED			0x0008

Setting Process State

To modify a process’s state, use set_task_state(task, state):

set_task_state(task, state);    // Set task state to state

This function updates the specified task's state. It enforces memory barriers if needed (typically on SMP systems), otherwise equivalent to:

task->state = state;

set_current_state(state) is equivalent to set_task_state(current, state).

/* <linux/sched.h> */
#define __set_current_state(state_value)				\
	current->state = (state_value)

#define set_current_state(state_value)					\
	smp_store_mb(current->state, (state_value))

Process Context

The executable code is a core part of a process. When a program makes a system call or triggers an exception, it enters kernel mode, executing within the kernel context of the process. The current macro is valid in this context.

System calls and exception handlers are kernel-defined interfaces. Processes must use these to enter kernel execution—any kernel access must go through these interfaces.

Process Family Tree

Linux processes have inheritance relationships. All processes are descendants of init (PID 1). Inits started during system initialization, reads init scripts, and executes related programs to complete booting.

Each process has a parent and zero or more children. Processes with the same parent are siblings. Relationships are stored in task_struct, which includes a parent pointer and a children list.

struct task_struct *my_parent = current->parent;

Process Creation

Unix separates process creation and execution into two distinct functions: fork() and exec(). First, fork() copies the current process to create a child, differing only in PID, PPID, and certain resources. Then, exec() loads and runs the executable file.

Copy-on-Write

Linux fork() uses copy-on-write pages. Instead of copying the entire address space, parent and child share a copy. Data is copied only when writing occurs, deferring the actual copy until necessary.

fork() overhead is minimal: copying page tables and creating a new task_struct.

fork()

Linux implements fork() via clone(), which uses flags to specify shared resources. fork(), vfork(), and __clone() all call clone(), which then calls do_fork().

/*
 * cloning flags:
 */
#define CSIGNAL		0x000000ff	/* signal mask to be sent at exit */
#define CLONE_VM	0x00000100	/* set if VM shared between processes */
#define CLONE_FS	0x00000200	/* set if fs info shared between processes */
#define CLONE_FILES	0x00000400	/* set if open files shared between processes */
#define CLONE_SIGHAND	0x00000800	/* set if signal handlers and blocked signals shared */
#define CLONE_PIDFD	0x00001000	/* set if a pidfd should be placed in parent */
#define CLONE_PTRACE	0x00002000	/* set if we want to let tracing continue on the child too */
#define CLONE_VFORK	0x00004000	/* set if the parent wants the child to wake it up on mm_release */
#define CLONE_PARENT	0x00008000	/* set if we want to have the same parent as the cloner */
#define CLONE_THREAD	0x00010000	/* Same thread group? */
#define CLONE_NEWNS	0x00020000	/* New mount namespace group */
#define CLONE_SYSVSEM	0x00040000	/* share system V SEM_UNDO semantics */
#define CLONE_SETTLS	0x00080000	/* create a new TLS for the child */
#define CLONE_PARENT_SETTID	0x00100000	/* set the TID in the parent */
#define CLONE_CHILD_CLEARTID	0x00200000	/* clear the TID in the child */
#define CLONE_DETACHED		0x00400000	/* Unused, ignored */
#define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
#define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
#define CLONE_NEWUTS		0x04000000	/* New utsname namespace */
#define CLONE_NEWIPC		0x08000000	/* New ipc namespace */
#define CLONE_NEWUSER		0x10000000	/* New user namespace */
#define CLONE_NEWPID		0x20000000	/* New pid namespace */
#define CLONE_NEWNET		0x40000000	/* New network namespace */
#define CLONE_IO		0x80000000	/* Clone io context */

do_fork() handles most of the work, calling copy_process() to create the new process and start it.

  1. dup_task_struct() allocates a new kernel stack, thread_info, and task_struct for the child, identical to the parent.
  2. Checks resource limits for the new process count.
  3. Differentiates parent and child by clearing or initializing members, mostly inheriting from the parent.
  4. Sets child state to TASK_UNINTERRUPTIBLE to prevent execution.
  5. copy_process() updates task_struct flags, clearing PF_SUPERPRIV and setting PF_FORKNOEXEC.
  6. Allocates a new PID with alloc_pid().
  7. Shares or copies resources based on clone() flags.
  8. Finalizes and returns a pointer to the child.

If copy_process() succeeds, the child is woken up and scheduled. The kernel prefers the child to run first to avoid unnecessary copy overhead. The child typically calls exec() immediately.

SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
	struct kernel_clone_args args = {
		.exit_signal = SIGCHLD,
	};

	return kernel_clone(&args);
#else
	/* can not support in nommu mode */
	return -EINVAL;
#endif
}

vfork()

vfork() differs from fork() by not copying page table entries. The child runs in the parent's address space, blocking the parent until it exits or calls exec(). The child cannot write to memory.

Implementation uses clone() with a special flag:

  1. During copy_process(), vfork_done is set to NULL.
  2. In do_fork(), if the flag is set, vfork_done points to a specific address.
  3. Child starts first; parent waits until signaled by vfork_done.
  4. mm_release() checks vfork_done and signals parent.
  5. Parent resumes and returns.
#ifdef __ARCH_WANT_SYS_VFORK
SYSCALL_DEFINE0(vfork)
{
	struct kernel_clone_args args = {
		.flags		= CLONE_VFORK | CLONE_VM,
		.exit_signal	= SIGCHLD,
	};

	return kernel_clone(&args);
}
#endif

/*
 *  Ok, this is the main fork-routine.
 *
 * It copies the process, and if successful kick-starts
 * it and waits for it to finish using the VM if required.
 *
 * args->exit_signal is expected to be checked for sanity by the caller.
 */
pid_t kernel_clone(struct kernel_clone_args *args)
{
	u64 clone_flags = args->flags;
	struct completion vfork;
	struct pid *pid;
	struct task_struct *p;
	int trace = 0;
	pid_t nr;

	/*
	 * For legacy clone() calls, CLONE_PIDFD uses the parent_tid argument
	 * to return the pidfd. Hence, CLONE_PIDFD and CLONE_PARENT_SETTID are
	 * mutually exclusive. With clone3() CLONE_PIDFD has grown a separate
	 * field in struct clone_args and it still doesn't make sense to have
	 * them both point at the same memory location. Performing this check
	 * here has the advantage that we don't need to have a separate helper
	 * to check for legacy clone().
	 */
	if ((args->flags & CLONE_PIDFD) &&
	    (args->flags & CLONE_PARENT_SETTID) &&
	    (args->pidfd == args->parent_tid))
		return -EINVAL;

	/*
	 * Determine whether and which event to report to ptracer.  When
	 * called from kernel_thread or CLONE_UNTRACED is explicitly
	 * requested, no event is reported; otherwise, report if the event
	 * for the type of forking is enabled.
	 */
	if (!(clone_flags & CLONE_UNTRACED)) {
		if (clone_flags & CLONE_VFORK)
			trace = PTRACE_EVENT_VFORK;
		else if (args->exit_signal != SIGCHLD)
			trace = PTRACE_EVENT_CLONE;
		else
			trace = PTRACE_EVENT_FORK;

		if (likely(!ptrace_event_enabled(current, trace)))
			trace = 0;
	}

	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
	add_latent_entropy();

	if (IS_ERR(p))
		return PTR_ERR(p);

	/*
	 * Do this prior waking up the new thread - the thread pointer
	 * might get invalid after that point, if the thread exits quickly.
	 */
	trace_sched_process_fork(current, p);

	pid = get_task_pid(p, PIDTYPE_PID);
	nr = pid_vnr(pid);

	if (clone_flags & CLONE_PARENT_SETTID)
		put_user(nr, args->parent_tid);

	if (clone_flags & CLONE_VFORK) {
		p->vfork_done = &vfork;
		init_completion(&vfork);
		get_task_struct(p);
	}

	wake_up_new_task(p);

	/* forking complete and child started to run, tell ptracer */
	if (unlikely(trace))
		ptrace_event_pid(trace, pid);

	if (clone_flags & CLONE_VFORK) {
		if (!wait_for_vfork_done(p, &vfork))
			ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
	}

	put_pid(pid);
	return nr;
}

Thread Implementation in Linux

Linux implements threads by treating them as regular processes sharing memory spaces. Threads can share open files and other resources. This mechanism supports concurrent programming and enables true parallelism on multi-processor systems.

From the kernel perspective, Linux does not distinguish threads from processes. Each thread has its own unique task_struct, making them appear as normal processes.

Creating Threads

Creating threads is similar to creating processes, differing only in parameters passed to clone():

clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);

Standard fork():

clone(SIGCHLD, 0);

vfork():

clone(CLONE_VFORK | CLONE_VM | SIGCHLD, 0);

Table 3-1 clone() Flags

Flag Description
CLONE_FILES Share open files
CLONE_FS Share filesystem info
CLONE_IDLETASK Set PID to 0 (idle only)
CLONE_NEWNS Create new namespace
CLONE_PARENT Same parent as cloner
CLONE_PTRACE Continue tracing child
CLONE_SETTID Write TID to userspace
CLONE_SETTLS Create new TLS
CLONE_SIGHAND Share signal handlers and blocked signals
CLONE_SYSVSEM Share System V SEM_UNDO semantics
CLONE_THREAD Same thread group
CLONE_VFORK Parent sleeps until child exits or execs
CLONE_UNTRACED Prevent ptrace from forcing CLONE_PTRACE
CLONE_STOP Start with TASK_STOPPED
CLONE_CHILD_CLEARTID Clear child TID
CLONE_CHILD_SETTID Set child TID
CLONE_PARENT_SETTID Set parent TID
CLONE_VM Share address space

Kernel Threads

Kernel threads are standard processes running in kernel space, performing background tasks. Unlike regular processes, they lack independent address spaces (mm pointer is NULL). They are schedulable and preemptible.

New kernel threads are spawned from kthreadd. The interface is declared in :

static __printf(4, 0)
struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
						    void *data, int node,
						    const char namefmt[],
						    va_list args)
{
	DECLARE_COMPLETION_ONSTACK(done);
	struct task_struct *task;
	struct kthread_create_info *create = kmalloc(sizeof(*create),
						     GFP_KERNEL);

	if (!create)
		return ERR_PTR(-ENOMEM);
	create->threadfn = threadfn;
	create->data = data;
	create->node = node;
	create->done = &done;

	spin_lock(&kthread_create_lock);
	list_add_tail(&create->list, &kthread_create_list);
	spin_unlock(&kthread_create_lock);

	wake_up_process(kthreadd_task);
	/*
	 * Wait for completion in killable state, for I might be chosen by
	 * the OOM killer while kthreadd is trying to allocate memory for
	 * new kernel thread.
	 */
	if (unlikely(wait_for_completion_killable(&done))) {
		/*
		 * If I was SIGKILLed before kthreadd (or new kernel thread)
		 * calls complete(), leave the cleanup of this structure to
		 * that thread.
		 */
		if (xchg(&create->done, NULL))
			return ERR_PTR(-EINTR);
		/*
		 * kthreadd (or new kernel thread) will call complete()
		 * shortly.
		 */
		wait_for_completion(&done);
	}
	task = create->result;
	if (!IS_ERR(task)) {
		static const struct sched_param param = { .sched_priority = 0 };
		char name[TASK_COMM_LEN];

		/*
		 * task is already visible to other tasks, so updating
		 * COMM must be protected.
		 */
		vsnprintf(name, sizeof(name), namefmt, args);
		set_task_comm(task, name);
		/*
		 * root may have changed our (kthreadd's) priority or CPU mask.
		 * The kernel thread should not inherit these properties.
		 */
		sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
		set_cpus_allowed_ptr(task,
				     housekeeping_cpumask(HK_FLAG_KTHREAD));
	}
	kfree(create);
	return task;
}

New threads are created via clone() and run threadfn with data. They are named using namefmt and must be woken up with wake_up_process(). The kthread_run() macro simplifies this:

#define kthread_run(threadfn, data, namefmt, ...)			   \
({									   \
	struct task_struct *__k						   \
		= kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \
	if (!IS_ERR(__k))						   \
		wake_up_process(__k);					   \
	__k;								   \
})

Kernel threads run indefinitely until kthread_stop() or do_exit().

/**
 * kthread_stop - stop a thread created by kthread_create().
 * @k: thread created by kthread_create().
 *
 * Sets kthread_should_stop() for @k to return true, wakes it, and
 * waits for it to exit. This can also be called after kthread_create()
 * instead of calling wake_up_process(): the thread will exit without
 * calling threadfn().
 *
 * If threadfn() may call kthread_exit() itself, the caller must ensure
 * task_struct can't go away.
 *
 * Returns the result of threadfn(), or %-EINTR if wake_up_process()
 * was never called.
 */
int kthread_stop(struct task_struct *k)
{
	struct kthread *kthread;
	int ret;

	trace_sched_kthread_stop(k);

	get_task_struct(k);
	kthread = to_kthread(k);
	set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
	kthread_unpark(k);
	wake_up_process(k);
	wait_for_completion(&kthread->exited);
	ret = k->exit_code;
	put_task_struct(k);

	trace_sched_kthread_stop_ret(ret);
	return ret;
}

Process Termination

When a process terminates, the kernel must release its resources and notify the parent.

The main work is handled by do_exit() in :

  1. Sets PF_EXITING flag in task_struct;
  2. Deletes kernel timers with del_timer_sync();
  3. Outputs accounting info if enabled;
  4. Releases mm_struct with exit_mm();
  5. Removes IPC semaphore waiters with sem_exit();
  6. Decrements file reference counts with exit_files() and exit_fs();
  7. Stores exit code in exit_code;
  8. Sends notification to parent, reassigns child's parent, sets state to EXIT_ZOMBIE;
  9. Calls schedule() to switch to another process.

Upon reaching EXIT_ZOMBIE, the process is no longer scheduled. It exists solely to provide information to its parent. After the parent retrieves the info, the remaining memory is freed.

Removing Process Descriptors

After do_exit(), the descriptor remains to allow retrieval of process information. Cleanup and descriptor deletion are separated. The task_struct is released after the parent gets the child's information.

Wait functions implement the standard wait4() system call. They suspend until a child exits, returning its PID and exit code.

When ready to free the descriptor, release_task() is invoked:

  1. Calls __exit_signal(), which unhashes the process and removes it from the pidhash and task list.
  2. Releases remaining resources of the zombie process.
  3. Notifies the parent if the leader is dead and the last thread.
  4. Frees the kernel stack, thread_info, and task_struct from the slab cache.

Orphaned Processes

If a parent exits before a child, mechanisms ensure the child finds a new parent. Otherwise, the child would remain as a zombie. The solution is to find a thread in the same thread group or assign init as the parent. This happens in exit_notify(), which calls forget_original_parent() and find_new_reaper().

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Overview In a recent project, I utilized the SBUS protocol with the Fus remote controller to control a vehicle's basic operations, including movement, lights, and mode switching. This article is aimed...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.