Fading Coder

One Final Commit for the Last Sprint

Home > Notes > Content

ARM Linux Memory Layout and MMU Configuration: Device Tree, Memblock, and Page Tables

Notes 2

Kernel virtual address map on 32‑bit ARM

  • PAGE_OFFSET: base of kernel virtual space at 0xC000_0000
  • lowmem: direct linear mapping of RAM into the kernel (virt = phys + PAGE_OFFSET); capped at 896 MiB
  • HIGH_MEMORY: first virtual address not linearly mapped (end of lowmem)
  • pkmap: permanent mappings for highmem pages (kmap/kunmap), typically ~2 MiB but platform specific
  • guard gap: unmapped padding between regions to reduce overrun risk (often ~8 MiB)
  • vmalloc: non‑contiguous kernel virtual region for vmalloc()/ioremap()
  • DMA: region used for consistent DMA allocations depending on platform constraints
  • fixmap: fixed virtual slots for atomic mappings (kmap_atomic), usable in interrupt context
  • vector: CPU exception vectors mapping
  • modules: loadable kernel modules region (about 16 MiB, ~14 MiB when HIGHMEM is enabled)

Lowmem is reduced by subtracting reserved areas described by firmware or device tree (peripherals, co‑processors, secure world, display splash, etc.). The discussion below traces how those reservations are discovered and applied.

Bootloader populates the device tree memory node

A bootloader commonly patches the /memory node in the flattened device tree to reflect discovered DRAM windows before transferring control to the kernel. A typical flow:

int fdt_update_memory(void *blob)
{
    int off = fdt_path_offset(blob, "/memory");
    if (off < 0)
        return off;
    return target_populate_mem(blob, off);
}

The skeleton DTS (64‑bit example) provides the /memory placeholder so the bootloader can fill reg latter:

/ {
    #address-cells = <2>;
    #size-cells = <2>;
    cpus { };
    soc { };
    chosen { };
    aliases { };
    memory { device_type = "memory"; reg = <0 0 0 0>; };
};

Kernel imports memory from DT into memblock

During early boot the kernel scans the flat DT for memory descriptors and seeds memblock with the discovered ranges:

static void __init parse_fdt_memory(void)
{
    of_scan_flat_dt(early_init_dt_scan_memory, NULL);
}

Simplified logic of the memory scanner:

int __init early_init_dt_scan_memory(unsigned long node, const char *name,
                                     int depth, void *arg)
{
    const char *t = of_get_flat_dt_prop(node, "device_type", NULL);
    const __be32 *p;
    int len;

    /* accept "memory" nodes (or legacy /memory@0 on some platforms) */
    if ((!t && depth == 1 && !strcmp(name, "memory@0")) ||
        (t && !strcmp(t, "memory"))) {
        p = of_get_flat_dt_prop(node, "linux,usable-memory", &len);
        if (!p)
            p = of_get_flat_dt_prop(node, "reg", &len);
        if (!p)
            return 0;

        const __be32 *end = p + (len / sizeof(*p));
        while ((end - p) >= (dt_root_addr_cells + dt_root_size_cells)) {
            u64 base = dt_mem_next_cell(dt_root_addr_cells, &p);
            u64 size = dt_mem_next_cell(dt_root_size_cells, &p);
            if (!size)
                continue;
            early_init_dt_add_memory_arch(base, size);
        }
    }
    return 0;
}

Memblock state before platform reservations

Immediately after importing DT memory (before platform‑specific carveouts and CMA regions are removed), a sample trace might look like:

[    0.000000] memblock.memory.cnt = 2
[    0.000000] vmalloc_limit       = 0xa9c00000
[    0.000000] region#1: base=0x80000000 size=0x2fd00000
[    0.000000] region#2: base=0xb0000000 size=0x30000000
[    0.000000] arm_lowmem_limit    = 0xa9c00000

Example platform has two chip‑selects:

  • CS0: 0x8000_0000 .. +0x3000_0000
  • CS1: 0xB000_0000 .. +0x3000_0000

A gap of ~3 MiB near the end of CS0 may be pre‑reserved by the boot chain. Debug logs often reveal a small secure debug carveout:

[    0.000000] sec_dbg_setup: secdbg_paddr = 0xaff00008
[    0.000000] sec_dbg_setup: secdbg_size  = 0x00080000

Reserving CMA and device carveouts from DT

During architecture setup, the kernel removes regions tagged for CMA/peripheral use, typically via:

start_kernel()
  -> setup_arch()
     -> arm_memblock_init()
        -> dma_contiguous_reserve()
           -> dma_contiguous_early_removal_fixup()

Example CMA reservation discovery from DT:

[    0.000000] cma: Found external_image__region@0 base 0x85500000 size 19 MiB
[    0.000000] cma: Found modem_adsp_region@0    base 0x86800000 size 88 MiB
[    0.000000] cma: Found pheripheral_region@0   base 0x8c000000 size 6 MiB
[    0.000000] cma: Found venus_region@0         base 0x8c600000 size 5 MiB
[    0.000000] cma: Found secure_region@0        base 0x00000000 size 109 MiB
[    0.000000] cma: Found qseecom_region@0       base 0x00000000 size 13 MiB
[    0.000000] cma: Found audio_region@0         base 0x00000000 size 3 MiB
[    0.000000] cma: Found splash_region@8E000000 base 0x8e000000 size 20 MiB

The corresponding DT reserved-memory style entries are usually embedded below the memory node. Example (values illustrative):

memory {
    #address-cells = <2>;
    #size-cells = <2>;

    external_image_mem: external_image__region@0 {
        linux,reserve-contiguous-region;
        linux,reserve-region;
        linux,remove-completely;
        reg = <0x0 0x85500000 0x0 0x01300000>; /* 19 MiB */
        label = "external_image_mem";
    };

    modem_adsp_mem: modem_adsp_region@0 {
        linux,reserve-contiguous-region;
        linux,reserve-region;
        linux,remove-completely;
        reg = <0x0 0x86800000 0x0 0x05800000>; /* 88 MiB */
        label = "modem_adsp_mem";
    };

    peripheral_mem: pheripheral_region@0 {
        linux,reserve-contiguous-region;
        linux,reserve-region;
        linux,remove-completely;
        reg = <0x0 0x8C000000 0x0 0x0600000>;  /* 6 MiB */
        label = "peripheral_mem";
    };

    venus_mem: venus_region@0 {
        linux,reserve-contiguous-region;
        linux,reserve-region;
        linux,remove-completely;
        reg = <0x0 0x8C600000 0x0 0x0500000>;  /* 5 MiB */
        label = "venus_mem";
    };

    secure_mem: secure_region@0 {
        linux,reserve-contiguous-region;
        reg = <0 0 0 0x06D00000>;               /* 109 MiB */
        label = "secure_mem";
    };

    qseecom_mem: qseecom_region@0 {
        linux,reserve-contiguous-region;
        reg = <0 0 0 0x00D00000>;               /* 13 MiB */
        label = "qseecom_mem";
    };

    audio_mem: audio_region@0 {
        linux,reserve-contiguous-region;
        reg = <0 0 0 0x00314000>;               /* ~3 MiB */
        label = "audio_mem";
    };

    cont_splash_mem: splash_region@8E000000 {
        linux,reserve-contiguous-region;
        linux,reserve-region;
        reg = <0x0 0x8E000000 0x0 0x01400000>; /* 20 MiB */
        label = "cont_splash_mem";
    };
};

After removal fixups, memblock is revalidated (sanity_check_meminfo() is re‑run), yielding new memory segments for the kernel:

[    0.000000] vmalloc_limit       = 0xa9c00000
[    0.000000] region#1: base=0x80000000 size=0x05500000
[    0.000000] region#2: base=0x8cb00000 size=0x23200000
[    0.000000] region#3: base=0xb0000000 size=0x30000000
[    0.000000] arm_lowmem_limit    = 0xb1200000

From the two snapshots you can infer which ranges were removed from CS0:

  • external_image_mem: 0x8550_0000 – 0x8680_0000 (19 MiB)
  • modem_adsp_mem: 0x8680_0000 – 0x8C00_0000 (88 MiB)
  • peripheral_mem: 0x8C00_0000 – 0x8C60_0000 (6 MiB)
  • venus_mem: 0x8C60_0000 – 0x8CB0_0000 (5 MiB)

Additional reservations (secure region, qseecom, audio, splash, default carveouts) may be held back by other subsystems such as ION or secure monitor firmware and thus not reflected in the early memblock delta. Example adjusted carveouts observed later on some platforms:

  • secure_mem: 0xD900_0000 – 0xE000_0000 (112 MiB)
  • qseecom_region:0xD800_0000 – 0xD900_0000 (16 MiB)
  • audio_mem: 0xD7C0_0000 – 0xD800_0000 (4 MiB)
  • splash_region: 0x8E00_0000 – 0x8F40_0000 (20 MiB)
  • default: 0xA940_0000 – 0xA9C0_0000 (8 MiB)

The boundary between lowmem and vmalloc is arm_lowmem_limit. It may change as reservations are applied and also depends on kernel parameters (for example, a cmdline vmalloc= value). With vmalloc sized at 340 MiB, the limit can move accordingly (e.g., 0xFF000000 − 0x15400000 = 0xEAC00000 on a different split; example log shows 0xB1200000 based on that board’s layout).

Example virtual layout dump

[    0.000000] Memory: 1243908K/1448960K available (10539K text, 1363K rwdata, 4472K rodata, 1417K init, 5844K bss, 205052K reserved, 632832K highmem)
[    0.000000] Virtual kernel memory layout:
[    0.000000]   vector : 0xFFFF0000 - 0xFFFF1000   (   4 kB)
[    0.000000]   fixmap : 0xFFF00000 - 0xFFFE0000   ( 896 kB)
[    0.000000]   arm_lowmem_limit = 0xF1200000

[    0.000000]   vmalloc: 0xF1200000 - 0xFF000000   ( 222 MB)
[    0.000000]   lowmem : 0xF0000000 - 0xF1200000   (  18 MB)
[    0.000000]   vmalloc: 0xEFD00000 - 0xF0000000   (   3 MB)
[    0.000000]   lowmem : 0xCCB00000 - 0xEFD00000   ( 562 MB)
[    0.000000]   vmalloc: 0xC5500000 - 0xCCB00000   ( 118 MB)
[    0.000000]   lowmem : 0xC0000000 - 0xC5500000   (  85 MB)
[    0.000000]   pkmap  : 0xBFE00000 - 0xC0000000   (   2 MB)
[    0.000000]   modules: 0xBF000000 - 0xBFE00000   (  14 MB)
[    0.000000]   .text  : 0xC0008000 - 0xC0FA8EC4   (16004 kB)
[    0.000000]   .init  : 0xC1000000 - 0xC1162480   ( 1418 kB)
[    0.000000]   .data  : 0xC1164000 - 0xC12B8DE4   ( 1364 kB)
[    0.000000]   .bss   : 0xC12C1B3C - 0xC1876B78   ( 5845 kB)

Normal and HighMem zones in contig_page_data align with this split:

  • ZONE_NORMAL: zone_start_pfn begins at the physical PFN mapping to PAGE_OFFSET; its span ends at arm_lowmem_limit
  • ZONE_HIGHMEM: zone_start_pfn begins where lowmem stops (arm_lowmem_limit) and spans to the end of highmem (example end around 0xE0000000 in some boards)

high_memory is computed as __va(arm_lowmem_limit − 1) + 1. VMALLOC_START is high_memory + VMALLOC_OFFSET (VMALLOC_OFFSET is typically 8 MiB), so increasing vmalloc shrinks lowmem.

Page table selection and sizes

On ARM, non‑LPAE kernels use a 2‑level translation scheme, while LPAE enables a 3‑level layout:

#ifdef CONFIG_ARM_LPAE
#include <asm/pgtable-3level.h>
#else
#include <asm/pgtable-2level.h>
#endif

Base page size is defined by PAGE_SHIFT:

#define PAGE_SHIFT  12
#define PAGE_SIZE   (1UL << PAGE_SHIFT) /* 4 KiB */

For a 2‑level setup, Linux exposes counts for directory entries. One common configuration is:

#define PTRS_PER_PTE 512   /* 2nd level entries */
#define PTRS_PER_PMD 1     /* unused on classic ARM */
#define PTRS_PER_PGD 2048  /* 1st level entries */

A variant layout (less typical) could be:

#define PTRS_PER_PTE 256
#define PTRS_PER_PMD 1
#define PTRS_PER_PGD 4096

Building mappings early in boot

Static mappings used before the MM subsystem fully initializes are installed by a helper that walks (or allocates) the kernel page tables given virtual and physical ranges:

- devicemaps_init()  -> create_mapping(...)
- map_lowmem()       -> create_mapping(...)
- iotable_init()     -> create_mapping(...)
- debug_ll_io_init() -> create_mapping(...)

These paths construct 1:1 or fixed offset mappings for lowmem and essential IO ranges.

Translating a user virtual address to a physical address (2‑level, 4 KiB pages)

Given a user task’s page directory and a 32‑bit virtual address VA, the translation follows these index/offset extractions (assuming 4096/256 layout above):

  • pgd_index = VA[31:20] (12 bits)
  • pte_index = VA[19:12] (8 bits)
  • page_offset = VA[11:0] (12 bits)

Steps:

  1. Read current task’s PGD base from task_struct->mm (or active_mm) and hardware TTBR (CP15 C2).
  2. L1 descriptor addr = pgd_base + (pgd_index * sizeof(pgd_t)).
  3. L1 entry yields the L2 table base (after masking off descriptor bits).
  4. L2 descriptor addr = l2_base + (pte_index * sizeof(pte_t)).
  5. L2 entry gives the page frame base (mask low attribute bits), then add page_offset.

Example with VA 0x01206000 and a particulra task’s pgd:

  • pgd_index = 0x12
  • pte_index = 0x06
  • offset = 0x000
  • Locate PGD[0x12] → L2 base
  • Locate PTE[0x06] → PFN
  • Physical = (PFN << PAGE_SHIFT) | offset

Kernel virtual addresses inside lowmem are simpler: virt_to_phys is a constant offfset subtraction (phys = virt − PAGE_OFFSET), while highmem and vmalloc areas require page table walks similar to user mappings.

Core per‑process memory structures

  • task_struct: holds scheduling and memory pointers for a task
  • mm_struct: per‑process memory manager; contains:
    • pgd: base of the process’s top‑level page table
    • mmap: head of the VMA list representing user virtual regions
  • vm_area_struct: describes a contiguous virtual memory region
    • vm_start, vm_end: inclusive/exclusive virtual address bounds

The VMA list and page tables together determine how a particular VA maps to a physical page frame, with faults handled by the MM subsystem to populate or adjust mappings as needed.

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

How to craft Alertmanager templates to format alert messages, improving clarity and presentation. Alertmanager uses Go’s text/template engine with additional helper functions. Alerting rules referenc...

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Tomcat 9 does not provide a dedicated Maven plugin. The Tomcat Manager interface, however, is backward-compatible, so the Tomcat 7 Maven Plugin can be used to deploy to Tomcat 9. This guide shows two...

Skipping Errors in MySQL Asynchronous Replication

When a replica halts because the SQL thread encounters an error, you can resume replication by skipping the problematic event(s). Two common approaches are available. Methods to Skip Errors 1) Skip a...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.