Author Jean Vincent
Category Exploitation
Tags kernel, exploit, linux, CVE-2022-0995, vulnerability, 2026
PageJack is a Linux kernel exploitation technique useful to generate a Use After Free (UAF) in the page allocator. In this article we provide a detailed example of how to use it to exploit a Linux kernel vulnerability from 2022.
Introduction
In this article, we will explore how a relatively old CVE can be exploited using PageJack, a modern kernel exploitation technique introduced in 2024 by Zhiyun Qian at Black Hat USA.
You can find a link to the full exploit at the end of this article.
The vulnerability (CVE-2022-0995)
CVE-2022-0995 is an out-of-bounds (OOB) write vulnerability caused by an incorrect bounds check in the watch_queue event notification mechanism of the Linux kernel. It affects kernel version 5.17 and above and can lead to privilege escalation.
Root cause analysis
In Linux systems, the kernel needs a mechanism to notify user space about various events. To achieve this, it implements an internal pipe-backed ring buffer used to store messages generated by the kernel. These messages can then be retrieved from user space using the read() system call.
A process can specify which event sources it wants to monitor through an ioctl. Filters can also be applied so that only selected source types and sub-events are delivered, thus allowing certain types of notifications to be ignored.
When a process adds a filter, the kernel invokes the watch_queue_set_filter() function. However, in kernel version 5.17 and above, a flaw in this function can lead to an out-of-bounds write in the kernel heap.
watch_queue_set_filter() implementation
If a user wants to set a filter for kernel messages, they must provide a list of filters that the kernel will use. To do so, the user supplies two structures:
struct watch_notification_filter {
__u32 nr_filters;
__u32 __reserved;
struct watch_notification_type_filter filters[];
};
struct watch_notification_type_filter {
__u32 type;
__u32 info_filter;
__u32 info_mask;
__u32 subtype_filter[8];
};
The user can specify the number of filters they want to apply, as well as the type of each filter, and those filters are passed to the kernel through the ioctl IOC_WATCH_QUEUE_SET_FILTER.
The kernel-side handler for this ioctl is the watch_queue_set_filter() function. It takes two parameters:
- a
pipe_inode_infostructure (which represents the pipe in the kernel) - the filter list provided by the user
The purpose of this function is to copy all the filters set in user-space into the kernel. To do this, the kernel first copies the filter from user space, counts the number of valid filters provided by the user, and then copies these filters into the kernel heap.
This is done with two for loops:
long watch_queue_set_filter(struct pipe_inode_info *pipe,
struct watch_notification_filter __user *_filter)
{
struct watch_notification_type_filter *tf; // Filter list
struct watch_notification_filter filter;
struct watch_type_filter *q;
struct watch_filter *wfilter;
int ret, nr_filter = 0, i;
...
if (copy_from_user(&filter, _filter, sizeof(filter)) != 0)
return -EFAULT;
...
tf = memdup_user(_filter->filters, filter.nr_filters * sizeof(*tf));
...
for (i = 0; i < filter.nr_filters; i++) {// Count the number of filters
if ((tf[i].info_filter & ~tf[i].info_mask) ||
tf[i].info_mask & WATCH_INFO_LENGTH)
goto err_filter;
/* Ignore any unknown types */
if (tf[i].type >= sizeof(wfilter->type_filter) * 8)
continue;
nr_filter++;
}
...
wfilter = kzalloc(struct_size(wfilter, filters, nr_filter), GFP_KERNEL);// Alloc enough space for the filters
...
q = wfilter->filters;
for (i = 0; i < filter.nr_filters; i++) {// Copy filters
if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG)
continue;
q->type = tf[i].type;
q->info_filter = tf[i].info_filter;
q->info_mask = tf[i].info_mask;
q->subtype_filter[0] = tf[i].subtype_filter[0];
__set_bit(q->type, wfilter->type_filter);
q++;
}
...
}
Here, tf is a copy of the filter list provided by the user.
The first for loop counts the number of valid filters. In this loop, the validity of a filter type is checked using:
if (tf[i].type >= sizeof(wfilter->type_filter) * 8)
After counting the valid filters, the function allocates enough memory to store them. Here, kzalloc() allocates a kernel object whose size depends on the value of nr_filter. Since the filters come from user-space, we can control the number of filters and, consequently, the size of the allocation.
In the second for loop, the filter values are copied into kernel heap memory. The function checks if the user provided filter type is valid, using:
if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG)
Out Of bounds bug
This code is the root cause of the out-of-bounds write vulnerability. The problem is that sizeof(wfilter->type_filter) * BITS_PER_LONG is not equal to sizeof(wfilter->type_filter) * 8. More precisely, in the first loop the type is checked to be less than 128, while in the second loop it is checked to be less than 1024.
Because of this bug, the second loop can accept a filter type that was not accounted for during the allocation in the first loop.
Here we have two out-of-bounds (OOB) issues:
- The second loop can write out of the allocated object with:
q->type = tf[i].type;
q->info_filter = tf[i].info_filter;
q->info_mask = tf[i].info_mask;
q->subtype_filter[0] = tf[i].subtype_filter[0];
Each time, the number of filters is increased by doing q++. If a filter type is between 128 and 1024, no space is allocated for this filter during the first loop. However, because of the previous checks in watch_queue_set_filter(), we can only have specific values for .type, .info_filter, .info_mask, .subtype_filter[0] which is not convenient for exploitation (some of these values must be 0x00).
- The second OOB is less obvious and occurs in
__set_bit(q->type, wfilter->type_filter). In x64 assembly, this corresponds to:
bts [wfilter->type_filter], q->type
This means that the bit at offset q->type is set to 1 in wfilter->type_filter. For example, if we have:
wfilter->type_filter = 0x00000000
q->type = 16
This results in wfilter->type_filter being set to 0x00010000.
In our case, possible values for q->type are between 128 and 1024. As a result, we can set a bit to 1 that lies outside the bounds of the wfilter->type_filter object.
In summary, due to this out-of-bounds (OOB) vulnerability, we can:
- write
0x00to memory beyond the allocated structure - set a single bit in the range from 128 to 1024
Exploitation – PageJack
With the first bug, we can set a byte after our watch_type_filter structure to 0x00. This is not very interesting on its own. However, the second bug allows us to set a single bit to 1 in the next object.
To exploit this vulnerability, we will use the PageJack technique. This technique abuses the struct pipe_buffer to create a use-after-free (UAF) in a page.
To start the exploit, we only use two types of structures:
watch_type_filter, to trigger the out-of-bounds conditionpipe_buffer, to create the UAF
Creating the UAF
To create a UAF in the kernel page we will use pipe_buffer:
struct pipe_buffer {
struct page * page; /* 0 8 */
unsigned int offset; /* 8 4 */
unsigned int len; /* 12 4 */
const struct pipe_buf_operations * ops; /* 16 8 */
unsigned int flags; /* 24 4 */
long unsigned int private; /* 28 8 */
};
Here, struct page *page refers to the structure representing the physical page that stores a pipe’s data. It is allocated via alloc_page() rather than through the SLUB allocator. When we write to a pipe, the data is written to page + offset (this is not exactly true, but for the sake of simplicity we will assume it is). Our goal is to have two pipe_buffer structures that reference the same page. When one pipe is closed, the page is freed (returned to the buddy allocator) and becomes available for other allocations, while we still retain a pointer to it. As a result, we are able to generate a use-after-free (UAF) on an entire page.
The question is: how can we make a pipe_buffer point to the same page by setting a single bit to 1?
Let us assume we have two pipe_buffer structures whose pages are located at:
0xffffffff00000xffffffff1000
Using the second bug, we can set a bit in the first pipe_buffer.page pointer, for instance, flipping the address from 0xffffffff0000 to 0xffffffff1000, which results in two pipe_buffer structures that reference the same page.

Next, we need to determine how to place a pipe_buffer structure adjacent to a watch_type_filter structure.
To ease exploitation, we will use Linux kernel version 5.13.
SLUB Allocator
In the Linux kernel, the SLUB allocator manages memory by grouping objects of the same size into caches (kmem_cache). Each cache is backed by one or more slabs, which are contiguous blocks of memory (one or more pages) divided into multiple fixed-size objects.
When an object is allocated, the allocator searches for a slab in the corresponding cache that contains free objects and returns one of them. For example, if we call kmalloc(80, flags), the kernel will allocate memory from the kmalloc-96 cache. If a slab in that cache has free slots (i.e. it is not full), a pointer to one of those slots is returned.
OOB in pipe_buffer
watch_type_filter allocation
We now need to find a way to trigger an out-of-bounds write into a pipe_buffer structure using a watch_type_filter structure. To achieve this, both structures must be placed within the same slab, meaning they must belong to the same kmem_cache.
In other words, they must be allocated from the same size class.
watch_type_filter structure are allocated in watch_queue_set_filter() using:
wfilter = kzalloc(struct_size(wfilter, filters, nr_filter), GFP_KERNEL);
// Allocate an object that contains a watch_filter and an array of watch_type_filter
As mentioned earlier, the kzalloc() size depends on the value of nr_filter. Since the user can specify between 1 and 15 valid filters, the allocation size ranges from 0x18 to 0x118. This means the resulting object can fall into slab caches ranging from kmalloc-32 up to kmalloc-512.
pipe_buffer allocation
When a pipe is created, the kernel allocates a pipe_inode_info structure, which represents the pipe. This structure contains a field called bufs, which is an array of pipe_buffer structures. By default, this array contains 16 pipe_buffer structures and is allocated as follows:
pipe->bufs = kcalloc(16, sizeof(struct pipe_buffer),
GFP_KERNEL_ACCOUNT);
This means we allocate an object of size 640 bytes, unfortunately placing it in the kmalloc-1k cache.
However, if we look at the source code in /fs/pipe.c, we can see a function called pipe_resize_ring() that allows us to resize the number of pipe_buffer entries in a pipe_inode_info. Internally, it reallocates the pipe_buffer array as follows:
bufs = kcalloc(nr_slots, sizeof(*bufs),
GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
We can use the following small program to resize the pipe_buffer array:
void setup_pipe(int pipefd[2]) {
int nb_pipe_buffer = 2;
long pagesize = sysconf(_SC_PAGESIZE);
if (fcntl(pipefd[0], F_SETPIPE_SZ,
pagesize * nb_pipe_buffer) != pagesize * nb_pipe_buffer) {
perror("fcntl");
exit(EXIT_FAILURE);
}
}
It is important to note that you don’t need to be root to resize a pipe_buffer.
Our pipe_buffer and watch_type_filter structures could be in the same cache. However, that is not guaranteed (we will see why).
If we examine the allocation flags used by kcalloc for these two objects, we can see that they respectively use the GFP_KERNEL and GFP_KERNEL_ACCOUNT flag. The GFP_KERNEL_ACCOUNT flag instructs the allocator to use a kmalloc-cg cache instead of the default kmalloc cache placing the objects in separate caches, kmalloc-x and kmalloc-cg-x respectively.
However, this feature is not active in the Linux kernel versions between 5.9 and 5.13.
For our exploit, we will use the kmalloc-96 cache.
Heap spraying
We know our target structures can be in the same cache, but still need to find a way to place them next to each other. More precisely, we need a watch_type_filter structure to be allocated just before a pipe_buffer structure in order to effectively use our oob write.
However, due to freelist randomization, we do not know the exact state of the heap and the order in which objects will be allocated. We cannot even guarantee that both objects will end up in the same slab of the same cache.
To increase our chances of success, we spray the kernel heap with a large number of pipe_buffer array objects, and when we allocate our watch_type_filter, the next object in memory will most likely be:
- a free object
- a
pipe_bufferarrays object
Even with heap spraying, we still cannot guarantee the OOB write actually targets a pipe_buffer and we may need to repeatedly run the exploit until the OOB successfully corrupts the intended structure.
Exploiting a UAF in pages
We now need to determine if our attempt was successful, ie. one of our pipe_buffer.page was located right after the watch_type_filter on which the OOB was triggered, the targeted bit was 0 and the new address matches another of our opened pipes.
In the heap spraying stage, we sprayed the heap with a large number of pipes. If the memory corruption was successful, two different pipes may now reference the same page. To identify them, we wrote a unique identifier into each pipe before triggering the OOB. After the corruption, if two pipes contain the same identifier, we know they are sharing the same page, this means we have found the pipes we were targeting.
Once we have identified these two pipes, we can close one of them, causing the page referenced by its pipe_buffer to be freed and returned to the buddy allocator. The second pipe still holds a reference to that same page, giving us a UAF primitive on it.
Our next goal is to have the buddy allocator reassign this freed page to a sensitive kernel object, so that we can overwrite its contents through the dangling pipe reference.
This exploit could target struct cred (the structure that contains the process UID/GID/etc.), but we chose struct file instead, to explore another approach.
This structure represents the kernel-side state of a file opened from user space in Linux kernel 5.13 (some fields may change in different versions):
struct file {// Order of the fields depends on the kernel version
union {
struct llist_node fu_llist; /* 0 8 */
struct callback_head fu_rcuhead __attribute__((__aligned__(8))); /* 0 16 */
} f_u __attribute__((__aligned__(8))); /* 0 16 */
struct path f_path; /* 16 16 */
struct inode * f_inode; /* 32 8 */
const struct file_operations * f_op; /* 40 8 */
spinlock_t f_lock; /* 48 4 */
enum rw_hint f_write_hint; /* 52 4 */
atomic_long_t f_count; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
unsigned int f_flags; /* 64 4 */
fmode_t f_mode; /* 68 4 */
struct mutex f_pos_lock; /* 72 32 */
loff_t f_pos; /* 104 8 */
struct fown_struct f_owner; /* 112 32 */
/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
const struct cred * f_cred; /* 144 8 */
struct file_ra_state f_ra; /* 152 32 */
u64 f_version; /* 184 8 */
/* --- cacheline 3 boundary (192 bytes) --- */
void * f_security; /* 192 8 */
void * private_data; /* 200 8 */
struct hlist_head * f_ep; /* 208 8 */
struct address_space * f_mapping; /* 216 8 */
errseq_t f_wb_err; /* 224 4 */
errseq_t f_sb_err; /* 228 4 */
/* size: 232, cachelines: 4, members: 21 */
}
The f_mode field represents the access mode of the file, i.e., how it was opened (read-only, write-only, read/write, etc.).
If we manage to place a struct file object inside our reclaimed UAF page, we can exploit this by opening a sensitive file in read-only mode (for example /etc/passwd), then overwriting its f_mode field to grant read/write permissions and modify its contents.
To achieve this, we need the slab cache that stores struct file objects to reclaim our freed page from the buddy allocator. To do so, there is only one technique, an old and difficult one… no, just kidding. Just spray.
By spraying a large number of struct file objects (repeatedly opening the same file), we force the corresponding slab cache to consume all available slabs that can hold this structure. When the cache needs more space, it requests a new page from the buddy allocator. Eventually, it will receive our previously freed vulnerable page.
When writing to pipe_buffer.page, we effectively write at *(page + offset). Unfortunately, this offset does not naturally align with the f_mode field inside struct file, so writing directly would corrupt unrelated fields.
To solve this, before triggering the UAF, we carefully prepare the pipe by writing a specific amount of data (68 bytes on kernel 5.13). This ensures that the next write into the pipe lands exactly at the offset corresponding to f_mode, allowing us to overwrite only the intended field.
Conclusion
In the Linux kernel, even a subtle bug combined with a very constrained write primitive can ultimately be leveraged to fully compromise a system.
In this article, we explored how the new technique called pagejack can trigger a Use After Free on a page. In this exploit, the main challenge lies in placing our object into the same cache. Once that condition is achieved, gaining full control of the system becomes relatively straightforward.
In other POCs I have studied, multiple structures such as msg_msg and sk_buff are typically used to obtain arbitrary read and write primitives and in the end, they generally create a ROP/JOP chain to get root. But as you can see, with only our pipe_buffer structure, we manage to get root without that many steps.
POC || GTFO
The video below demonstrates exploitation:
Exploit running.
The full Proof of Concept exploit is available here
WARNING: Our exploit sets a bit in the kernel heap to 1, but we are not sure if we are flipping the correct bit. This may result in a kernel panic or cause the system to freeze. YMMV.
