Process Management in Linux
Process Overview
A process represents a running program instance. Beyond the executable code segment, a process includes open files, pending signals, kernel data structures, processor state, virtual memory mappings, threads, and global variables' data segment.
Threads are scheduling units managed by the kernel. A process can contain one or more threads. In Linux, threads are not distinctly treated from processes—essentially, threads are special processes.
Most Linux processes are created through the fork() system call, which duplicates an existing process. The initiating process is termed the parent, and the newly created one is the child.
The fork() system call returns twice: once to the parent and once to the child.
After fork(), a new process does not execute immediately. Instead, it loads a new program into memory via exec() family functions.
Linux refers to processes as tasks from a kernel perspective.
Process Descriptor and Task Structure
The kernel maintains a list of all processes in a doubly linked circular list called the task list. Each list item is a task_struct (process descriptor) defined in , containing all necessary information for kernel process management: open files, address space, pending signals, process state, and more.
Figure 3-1 Process Descriptor and Task Queue
Process Descriptor Allocation
Prior to kernel version 2.6, task_struct instances were located at the end of kernel stacks. Since version 2.6, the slab allocator dynamically creates task_struct instances, placing a new struct thread_info at the end of the stack in the growing direction.
In x86 architecture, struct thread_info is defined in :
/* version 2.6, x86 */
struct thread_info {
struct task_struct *task;
struct exec_domain *exec_domain;
__u32 flags;
__u32 status;
__u32 cpu;
int preempt_count;
mm_segment_t addr_limit;
struct restart_block restart_block;
void *sysenter_return;
int uaccess_err;
};
Figure 3-2 Process Descriptor and Kernel Stack
In kernel version 5.10.220, x86 struct thread_info is defined as:
/* version 5.10.220, x86 */
struct thread_info {
unsigned long flags; /* low level flags */
u32 status; /* thread synchronous flags */
};
This differs from the 2.6 version, as there is no direct pointer to struct task_struct. In 5.10.220, struct task_struct defines thread_info as its first member:
/* version 5.10.220, x86 */
struct task_struct{
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
* For reasons of header soup (see current_thread_info()), this
* must be the first element of task_struct.
*/
struct thread_info thread_info;
#endif
...........
};
Thus, in 5.10.220, the x86 kernel stack tail still stores the task_struct structure due to the first member placement.
In the same kernel version, ARM's struct thread_info is defined in :
/* version 5.10.220, arm */
/*
* low level task data that entry.S needs immediate access to.
* __switch_to() assumes cpu_context follows immediately after cpu_domain.
*/
struct thread_info {
unsigned long flags; /* low level flags */
int preempt_count; /* 0 => preemptable, <0 => bug */
mm_segment_t addr_limit; /* address limit */
struct task_struct *task; /* main task structure */
__u32 cpu; /* cpu */
__u32 cpu_domain; /* cpu domain */
#ifdef CONFIG_STACKPROTECTOR_PER_TASK
unsigned long stack_canary;
#endif
struct cpu_context_save cpu_context; /* cpu context */
__u32 syscall; /* syscall number */
__u8 used_cp[16]; /* thread used copro */
unsigned long tp_value[2]; /* TLS registers */
#ifdef CONFIG_CRUNCH
struct crunch_state crunchstate;
#endif
union fp_state fpstate __attribute__((aligned(8)));
union vfp_state vfpstate;
#ifdef CONFIG_ARM_THUMBEE
unsigned long thumbee_state; /* ThumbEE Handler Base register */
#endif
};
Process Descriptor Storage
Processes are uniquely identified by their process ID (PID), a numeric identifier of type pid_t (effectively an int). The maximum PID value is defined in , with smaller values resulting in shorter round-robin scheduling times.
/*
* A maximum of 4 million PIDs should be enough for a while.
* [NOTE: PID/TIDs are limited to 2^30 ~= 1 billion, see FUTEX_TID_MASK.]
*/
#define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
(sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))
Users can adjust the upper limit via /proc/sys/kernel/pid_max.
Figure 3-3 Ubuntu24.04 x86_64 PID_MAX=4194304
Accessing a task usually requires a pointer to its task_struct. The current macro is crucial for finding the currently executing process descriptor (tss). Some architectures use a dedicated register, but x86 uses thread_info structure at the stack's end, accessed via offset calculation.
On x86, current masks the last 13 bits of the stack pointer to calculate the thread_info offset, using current_thread_info():
movl $-8192, %eax # kernel stack size is 8K
# movl $-4096, %eax# kernel stack size is 4K
andl $esp, %eax
Then, it retrieves the task_struct address from the task field:
current_thread_info()->task;
In kernel 5.10.220, current_thread_info() is implemented as:
/* <linux/thread_info.h> */
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
* For CONFIG_THREAD_INFO_IN_TASK kernels we need <asm/current.h> for the
* definition of current, but for !CONFIG_THREAD_INFO_IN_TASK kernels,
* including <asm/current.h> can cause a circular dependency on some platforms.
*/
#include <asm/current.h>
#define current_thread_info() ((struct thread_info *)current)
#endif
/* <asm/current.h> */
DECLARE_PER_CPU(struct task_struct *, current_task);
static __always_inline struct task_struct *get_current(void)
{
return this_cpu_read_stable(current_task);
}
#define current get_current()
/* arch/x86/kernel/cpu/common.c */
/*
* The following percpu variables are hot. Align current_task to
* cacheline size such that they fall in the same cacheline.
*/
DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned =
&init_task;
EXPORT_PER_CPU_SYMBOL(current_task);
In kernel 5.10.220 x86, current_thread_info() points to current_task's thread_info field.
Process States
There are five process states:
- TASK_RUNNING (Running) — Process is runnable: either executing or waiting in the run queue.
- TASK_INTERRUPTIBLE (Interruptible) — Sleeping (blocked) waiting for conditions; kernel sets to running upon condition fulfillment.
- TASK_UNINTERRUPTIBLE (Uninterruptible) — Similar to interruptible, but not awakened by signals.
- __TASK_STOPPED — Process suspended, typically due to SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU.
- __TASK_TRACED — Process being traced by another process, e.g., via ptrace.
Figure 3-3 Process State Transitions
/* <linux/sched.h> */
/* Used in tsk->state: */
#define TASK_RUNNING 0x0000
#define TASK_INTERRUPTIBLE 0x0001
#define TASK_UNINTERRUPTIBLE 0x0002
#define __TASK_STOPPED 0x0004
#define __TASK_TRACED 0x0008
Setting Process State
To modify a process’s state, use set_task_state(task, state):
set_task_state(task, state); // Set task state to state
This function updates the specified task's state. It enforces memory barriers if needed (typically on SMP systems), otherwise equivalent to:
task->state = state;
set_current_state(state) is equivalent to set_task_state(current, state).
/* <linux/sched.h> */
#define __set_current_state(state_value) \
current->state = (state_value)
#define set_current_state(state_value) \
smp_store_mb(current->state, (state_value))
Process Context
The executable code is a core part of a process. When a program makes a system call or triggers an exception, it enters kernel mode, executing within the kernel context of the process. The current macro is valid in this context.
System calls and exception handlers are kernel-defined interfaces. Processes must use these to enter kernel execution—any kernel access must go through these interfaces.
Process Family Tree
Linux processes have inheritance relationships. All processes are descendants of init (PID 1). Inits started during system initialization, reads init scripts, and executes related programs to complete booting.
Each process has a parent and zero or more children. Processes with the same parent are siblings. Relationships are stored in task_struct, which includes a parent pointer and a children list.
struct task_struct *my_parent = current->parent;
Process Creation
Unix separates process creation and execution into two distinct functions: fork() and exec(). First, fork() copies the current process to create a child, differing only in PID, PPID, and certain resources. Then, exec() loads and runs the executable file.
Copy-on-Write
Linux fork() uses copy-on-write pages. Instead of copying the entire address space, parent and child share a copy. Data is copied only when writing occurs, deferring the actual copy until necessary.
fork() overhead is minimal: copying page tables and creating a new task_struct.
fork()
Linux implements fork() via clone(), which uses flags to specify shared resources. fork(), vfork(), and __clone() all call clone(), which then calls do_fork().
/*
* cloning flags:
*/
#define CSIGNAL 0x000000ff /* signal mask to be sent at exit */
#define CLONE_VM 0x00000100 /* set if VM shared between processes */
#define CLONE_FS 0x00000200 /* set if fs info shared between processes */
#define CLONE_FILES 0x00000400 /* set if open files shared between processes */
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */
#define CLONE_PIDFD 0x00001000 /* set if a pidfd should be placed in parent */
#define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */
#define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */
#define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */
#define CLONE_THREAD 0x00010000 /* Same thread group? */
#define CLONE_NEWNS 0x00020000 /* New mount namespace group */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
#define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */
#define CLONE_CHILD_CLEARTID 0x00200000 /* clear the TID in the child */
#define CLONE_DETACHED 0x00400000 /* Unused, ignored */
#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
#define CLONE_NEWUTS 0x04000000 /* New utsname namespace */
#define CLONE_NEWIPC 0x08000000 /* New ipc namespace */
#define CLONE_NEWUSER 0x10000000 /* New user namespace */
#define CLONE_NEWPID 0x20000000 /* New pid namespace */
#define CLONE_NEWNET 0x40000000 /* New network namespace */
#define CLONE_IO 0x80000000 /* Clone io context */
do_fork() handles most of the work, calling copy_process() to create the new process and start it.
- dup_task_struct() allocates a new kernel stack, thread_info, and task_struct for the child, identical to the parent.
- Checks resource limits for the new process count.
- Differentiates parent and child by clearing or initializing members, mostly inheriting from the parent.
- Sets child state to TASK_UNINTERRUPTIBLE to prevent execution.
- copy_process() updates task_struct flags, clearing PF_SUPERPRIV and setting PF_FORKNOEXEC.
- Allocates a new PID with alloc_pid().
- Shares or copies resources based on clone() flags.
- Finalizes and returns a pointer to the child.
If copy_process() succeeds, the child is woken up and scheduled. The kernel prefers the child to run first to avoid unnecessary copy overhead. The child typically calls exec() immediately.
SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
struct kernel_clone_args args = {
.exit_signal = SIGCHLD,
};
return kernel_clone(&args);
#else
/* can not support in nommu mode */
return -EINVAL;
#endif
}
vfork()
vfork() differs from fork() by not copying page table entries. The child runs in the parent's address space, blocking the parent until it exits or calls exec(). The child cannot write to memory.
Implementation uses clone() with a special flag:
- During copy_process(), vfork_done is set to NULL.
- In do_fork(), if the flag is set, vfork_done points to a specific address.
- Child starts first; parent waits until signaled by vfork_done.
- mm_release() checks vfork_done and signals parent.
- Parent resumes and returns.
#ifdef __ARCH_WANT_SYS_VFORK
SYSCALL_DEFINE0(vfork)
{
struct kernel_clone_args args = {
.flags = CLONE_VFORK | CLONE_VM,
.exit_signal = SIGCHLD,
};
return kernel_clone(&args);
}
#endif
/*
* Ok, this is the main fork-routine.
*
* It copies the process, and if successful kick-starts
* it and waits for it to finish using the VM if required.
*
* args->exit_signal is expected to be checked for sanity by the caller.
*/
pid_t kernel_clone(struct kernel_clone_args *args)
{
u64 clone_flags = args->flags;
struct completion vfork;
struct pid *pid;
struct task_struct *p;
int trace = 0;
pid_t nr;
/*
* For legacy clone() calls, CLONE_PIDFD uses the parent_tid argument
* to return the pidfd. Hence, CLONE_PIDFD and CLONE_PARENT_SETTID are
* mutually exclusive. With clone3() CLONE_PIDFD has grown a separate
* field in struct clone_args and it still doesn't make sense to have
* them both point at the same memory location. Performing this check
* here has the advantage that we don't need to have a separate helper
* to check for legacy clone().
*/
if ((args->flags & CLONE_PIDFD) &&
(args->flags & CLONE_PARENT_SETTID) &&
(args->pidfd == args->parent_tid))
return -EINVAL;
/*
* Determine whether and which event to report to ptracer. When
* called from kernel_thread or CLONE_UNTRACED is explicitly
* requested, no event is reported; otherwise, report if the event
* for the type of forking is enabled.
*/
if (!(clone_flags & CLONE_UNTRACED)) {
if (clone_flags & CLONE_VFORK)
trace = PTRACE_EVENT_VFORK;
else if (args->exit_signal != SIGCHLD)
trace = PTRACE_EVENT_CLONE;
else
trace = PTRACE_EVENT_FORK;
if (likely(!ptrace_event_enabled(current, trace)))
trace = 0;
}
p = copy_process(NULL, trace, NUMA_NO_NODE, args);
add_latent_entropy();
if (IS_ERR(p))
return PTR_ERR(p);
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
*/
trace_sched_process_fork(current, p);
pid = get_task_pid(p, PIDTYPE_PID);
nr = pid_vnr(pid);
if (clone_flags & CLONE_PARENT_SETTID)
put_user(nr, args->parent_tid);
if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
init_completion(&vfork);
get_task_struct(p);
}
wake_up_new_task(p);
/* forking complete and child started to run, tell ptracer */
if (unlikely(trace))
ptrace_event_pid(trace, pid);
if (clone_flags & CLONE_VFORK) {
if (!wait_for_vfork_done(p, &vfork))
ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
}
put_pid(pid);
return nr;
}
Thread Implementation in Linux
Linux implements threads by treating them as regular processes sharing memory spaces. Threads can share open files and other resources. This mechanism supports concurrent programming and enables true parallelism on multi-processor systems.
From the kernel perspective, Linux does not distinguish threads from processes. Each thread has its own unique task_struct, making them appear as normal processes.
Creating Threads
Creating threads is similar to creating processes, differing only in parameters passed to clone():
clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
Standard fork():
clone(SIGCHLD, 0);
vfork():
clone(CLONE_VFORK | CLONE_VM | SIGCHLD, 0);
Table 3-1 clone() Flags
| Flag | Description |
|---|---|
| CLONE_FILES | Share open files |
| CLONE_FS | Share filesystem info |
| CLONE_IDLETASK | Set PID to 0 (idle only) |
| CLONE_NEWNS | Create new namespace |
| CLONE_PARENT | Same parent as cloner |
| CLONE_PTRACE | Continue tracing child |
| CLONE_SETTID | Write TID to userspace |
| CLONE_SETTLS | Create new TLS |
| CLONE_SIGHAND | Share signal handlers and blocked signals |
| CLONE_SYSVSEM | Share System V SEM_UNDO semantics |
| CLONE_THREAD | Same thread group |
| CLONE_VFORK | Parent sleeps until child exits or execs |
| CLONE_UNTRACED | Prevent ptrace from forcing CLONE_PTRACE |
| CLONE_STOP | Start with TASK_STOPPED |
| CLONE_CHILD_CLEARTID | Clear child TID |
| CLONE_CHILD_SETTID | Set child TID |
| CLONE_PARENT_SETTID | Set parent TID |
| CLONE_VM | Share address space |
Kernel Threads
Kernel threads are standard processes running in kernel space, performing background tasks. Unlike regular processes, they lack independent address spaces (mm pointer is NULL). They are schedulable and preemptible.
New kernel threads are spawned from kthreadd. The interface is declared in :
static __printf(4, 0)
struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
void *data, int node,
const char namefmt[],
va_list args)
{
DECLARE_COMPLETION_ONSTACK(done);
struct task_struct *task;
struct kthread_create_info *create = kmalloc(sizeof(*create),
GFP_KERNEL);
if (!create)
return ERR_PTR(-ENOMEM);
create->threadfn = threadfn;
create->data = data;
create->node = node;
create->done = &done;
spin_lock(&kthread_create_lock);
list_add_tail(&create->list, &kthread_create_list);
spin_unlock(&kthread_create_lock);
wake_up_process(kthreadd_task);
/*
* Wait for completion in killable state, for I might be chosen by
* the OOM killer while kthreadd is trying to allocate memory for
* new kernel thread.
*/
if (unlikely(wait_for_completion_killable(&done))) {
/*
* If I was SIGKILLed before kthreadd (or new kernel thread)
* calls complete(), leave the cleanup of this structure to
* that thread.
*/
if (xchg(&create->done, NULL))
return ERR_PTR(-EINTR);
/*
* kthreadd (or new kernel thread) will call complete()
* shortly.
*/
wait_for_completion(&done);
}
task = create->result;
if (!IS_ERR(task)) {
static const struct sched_param param = { .sched_priority = 0 };
char name[TASK_COMM_LEN];
/*
* task is already visible to other tasks, so updating
* COMM must be protected.
*/
vsnprintf(name, sizeof(name), namefmt, args);
set_task_comm(task, name);
/*
* root may have changed our (kthreadd's) priority or CPU mask.
* The kernel thread should not inherit these properties.
*/
sched_setscheduler_nocheck(task, SCHED_NORMAL, ¶m);
set_cpus_allowed_ptr(task,
housekeeping_cpumask(HK_FLAG_KTHREAD));
}
kfree(create);
return task;
}
New threads are created via clone() and run threadfn with data. They are named using namefmt and must be woken up with wake_up_process(). The kthread_run() macro simplifies this:
#define kthread_run(threadfn, data, namefmt, ...) \
({ \
struct task_struct *__k \
= kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \
if (!IS_ERR(__k)) \
wake_up_process(__k); \
__k; \
})
Kernel threads run indefinitely until kthread_stop() or do_exit().
/**
* kthread_stop - stop a thread created by kthread_create().
* @k: thread created by kthread_create().
*
* Sets kthread_should_stop() for @k to return true, wakes it, and
* waits for it to exit. This can also be called after kthread_create()
* instead of calling wake_up_process(): the thread will exit without
* calling threadfn().
*
* If threadfn() may call kthread_exit() itself, the caller must ensure
* task_struct can't go away.
*
* Returns the result of threadfn(), or %-EINTR if wake_up_process()
* was never called.
*/
int kthread_stop(struct task_struct *k)
{
struct kthread *kthread;
int ret;
trace_sched_kthread_stop(k);
get_task_struct(k);
kthread = to_kthread(k);
set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
kthread_unpark(k);
wake_up_process(k);
wait_for_completion(&kthread->exited);
ret = k->exit_code;
put_task_struct(k);
trace_sched_kthread_stop_ret(ret);
return ret;
}
Process Termination
When a process terminates, the kernel must release its resources and notify the parent.
The main work is handled by do_exit() in :
- Sets PF_EXITING flag in task_struct;
- Deletes kernel timers with del_timer_sync();
- Outputs accounting info if enabled;
- Releases mm_struct with exit_mm();
- Removes IPC semaphore waiters with sem_exit();
- Decrements file reference counts with exit_files() and exit_fs();
- Stores exit code in exit_code;
- Sends notification to parent, reassigns child's parent, sets state to EXIT_ZOMBIE;
- Calls schedule() to switch to another process.
Upon reaching EXIT_ZOMBIE, the process is no longer scheduled. It exists solely to provide information to its parent. After the parent retrieves the info, the remaining memory is freed.
Removing Process Descriptors
After do_exit(), the descriptor remains to allow retrieval of process information. Cleanup and descriptor deletion are separated. The task_struct is released after the parent gets the child's information.
Wait functions implement the standard wait4() system call. They suspend until a child exits, returning its PID and exit code.
When ready to free the descriptor, release_task() is invoked:
- Calls __exit_signal(), which unhashes the process and removes it from the pidhash and task list.
- Releases remaining resources of the zombie process.
- Notifies the parent if the leader is dead and the last thread.
- Frees the kernel stack, thread_info, and task_struct from the slab cache.
Orphaned Processes
If a parent exits before a child, mechanisms ensure the child finds a new parent. Otherwise, the child would remain as a zombie. The solution is to find a thread in the same thread group or assign init as the parent. This happens in exit_notify(), which calls forget_original_parent() and find_new_reaper().