Ten years ago, we published a Clang Hardening Cheat Sheet. Since then, both the threat landscape and the Clang toolchain have evolved significantly. This blog post presents the new mitigations available in Clang to improve the security of your applications.


Introduction

Ten years ago, we published on this blog a Clang Hardening Cheat Sheet. The original post walked through essential hardening techniques available at the time, such as FORTIFY_SOURCE checks, ASLR via position-independent code, stack protection (canaries and safe stack), Control Flow Integrity (CFI), GOT protection with RELRO/now, but also options to activate warnings about string formatting that could lead to potential attacks.

Since that article was published in early 2016, both the threat landscape and the Clang toolchain have evolved significantly.

To celebrate the 10th anniversary of the initial article, here is a new cheat sheet with some new hardening flags to improve security.

TL;DR

The OpenSSF Best Practices Working Group maintains a Compiler Options Hardening Guide for C and C++. They recommend using the following set of options:

-O2 -Wall -Wformat -Wformat=2 -Wconversion -Wimplicit-fallthrough \
-Werror=format-security \
-U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 \
-D_GLIBCXX_ASSERTIONS \
-fstrict-flex-arrays=3 \
-fstack-clash-protection -fstack-protector-strong \
-Wl,-z,nodlopen -Wl,-z,noexecstack \
-Wl,-z,relro -Wl,-z,now \
-Wl,--as-needed -Wl,--no-copy-dt-needed-entries

The options we recommended in our original blog post are still present, even if some have evolved (-Wformat=2, -D_FORTIFY_SOURCE=3). Others have been added:

  • to enable some compiler warnings: -Wconversion, -Wimplicit-fallthrough;
  • to limit the attack surface by only linking against necessary libraries: -Wl,--as-needed -Wl,--no-copy-dt-needed-entries;
  • to harden the code against attacks: -D_GLIBCXX_ASSERTIONS, -fstrict-flex-arrays=3, -fstack-clash-protection, -Wl,-z,nodlopen, -Wl,-z,noexecstack.

In this post, we present the additional hardening options recommended by the OpenSSF, as well as more specialized options that mitigate newer classes of exploits. We first go through several general protections when using the standard C/C++ libraries or loading libraries. Then we present mitigations against stack-based memory corruption, and the use of Return-Oriented Programming (ROP) or Jump-Oriented Programming (JOP) attacks. And finally, we talk about the defenses against speculative execution attacks.

General Protections

Fortify

Since last time, the -D_FORTIFY_SOURCE flag has evolved. It now provides a level 31, which includes all the security features of levels 1 and 2, along with additional checks for potentially dangerous code patterns. These additional checks are designed to detect a wider range of security issues, including:

  • Unsafe usage of memcpy and memmove
  • Risky use of snprintf, vsnprintf, and related functions
  • Potentially dangerous string manipulation functions such as strtok, strncat, and strpbrk

C++ Library Hardening

In addition to -D_FORTIFY_SOURCE, C++ developers can enable extra runtime checks in the standard library by defining -D_GLIBCXX_ASSERTIONS.

This option is one of several supported by the libstdc++ library and it is used to enable various NULL pointer and bounds checking security features.

nodlopen

The -Wl,-z,nodlopen linker flag is used to prevent a shared object from being dynamically loaded at runtime using dlopen().

This can help in reducing an attacker’s ability to load and manipulate shared objects.

To test this flag, let's compile a simple shared object:

#include <stdio.h>

void example_function(void)
{
    printf("Hello from the dynamically loaded library!\n");
}

We compile it with the nodlopen linker flag:

clang -fPIC -c src/libexample.c -o libexample.o
clang -Wl,-z,nodlopen -shared libexample.o -o libexample.so

Now we can attempt to load it at runtime using dlopen:

#include <stdio.h>
#include <dlfcn.h>

int main(void)
{
    void *handle;
    void (*func)(void);

    /* Load the shared library */
    handle = dlopen("./libexample.so", RTLD_NOW);
    if (!handle) {
        fprintf(stderr, "dlopen failed: %s\n", dlerror());
        return 1;
    }

    /* Resolve the symbol */
    func = (void (*)(void))dlsym(handle, "example_function");
    if (!func) {
        fprintf(stderr, "dlsym failed: %s\n", dlerror());
        dlclose(handle);
        return 1;
    }

    /* Call the function */
    func();

    /* Close the library */
    dlclose(handle);

    return 0;
}

When we run this program, we get the following error:

╭─trikkss@archlinux ~/work/clang-hardening/tests/nodlopen  ‹main*›
╰─➤  ./main
dlopen failed: ./libexample.so: shared object cannot be dlopen()ed

This demonstrates that the nodlopen flag effectively prevents the shared library from being loaded dynamically at runtime. Note that this does not prevent the library from being linked at build time:

╭─trikkss@archlinux ~/work/clang-hardening/tests/nodlopen  ‹main*›
╰─➤  cat src/main2.c
#include <stdio.h>

/* declaration of the library function */
void example_function(void);

int main(void)
{
    example_function();
    return 0;
}

╭─trikkss@archlinux ~/work/clang-hardening/tests/nodlopen  ‹main*›
╰─➤  clang -o main2 src/main2.c $PWD/libexample.so
╭─trikkss@archlinux ~/work/clang-hardening/tests/nodlopen  ‹main*›
╰─➤  ldd main2
    linux-vdso.so.1 (0x00007fec50fe3000)
    /home/trikkss/work/clang-hardening/tests/nodlopen/libexample.so (0x00007fec50fd1000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007fec50c00000)
    /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fec50fe5000)
╭─trikkss@archlinux ~/work/clang-hardening/tests/nodlopen  ‹main*›
╰─➤  ./main2
Hello from the dynamically loaded library!

If we inspect the dynamic section of the library, and compare it to the dynamic section of the same library generated without nodlopen, we can see that the main difference is the presence of the flag NOOPEN:

2c2
< Dynamic section at offset 0x2e08 contains 24 entries:
---
> Dynamic section at offset 0x2df8 contains 25 entries:
7c7
<  0x0000000000000019 (INIT_ARRAY)         0x3df8
---
>  0x0000000000000019 (INIT_ARRAY)         0x3de8
9c9
<  0x000000000000001a (FINI_ARRAY)         0x3e00
---
>  0x000000000000001a (FINI_ARRAY)         0x3df0
22a23
>  0x000000006ffffffb (FLAGS_1)            Flags: NOOPEN

The protection may therefore be easy to bypass: if an attacker has write access to the library in question, they can easily patch it.

Defenses Against Stack-Based Memory Corruption

Non-Executable Stack (NX)

One of the oldest and simplest exploit mitigation mechanisms is the non-executable stack. The idea is to prevent code execution directly from the stack, which was a common exploitation technique in early stack-based buffer overflow attacks.

When a binary is compiled or linked with the noexecstack flag, the stack memory region is marked as non-executable. As a result, even if an attacker manages to inject shellcode onto the stack and redirect execution to it, the CPU will refuse to execute it and will raise a fault.

Today, this protection is enabled by default on most modern systems and toolchains.

Stack clash vulnerabilities

During its execution, a program uses the stack memory region to store data. This memory region is special because it grows automatically as the program needs more stack memory. To prevent uncontrolled growth, modern operating systems place guard pages below the stack, which trigger a fault when the stack grows too far.

A stack clash occurs when the stack grows by a very large amount at once and skips over these guard pages. As a result, the stack can collide with another memory region, such as the heap or a memory-mapped area, without triggering an immediate fault. This can cause stack writes to corrupt adjacent memory regions, or allow other memory regions to overlap with the stack, leading to memory corruption and potential exploitation.

The Stack Clash vulnerability was publicly disclosed on June 19, 2017 by the Qualys research team2.

Stack Clash Protection

LLVM's solution to this problem is to divide large allocations into smaller ones of size PAGE_SIZE as described in this blog post. This can be done by adding the -fstack-clash-protection compilation flag.

To observe this behavior, we can compile a simple C program that allocates a large buffer on the stack:

#include <stdio.h>
#include <sys/user.h>

int main(void)
{
    char buffer[PAGE_SIZE * 10];
    printf("hello world\n");

    return 0;
}

Without stack clash protection enabled, the generated assembly looks like this:

0000000000401130 <main>:
  401130:   55                      push   rbp
  401131:   48 89 e5                mov    rbp,rsp
  401134:   48 81 ec 10 a0 00 00    sub    rsp,0xa010
  40113b:   c7 45 fc 00 00 00 00    mov    DWORD PTR [rbp-0x4],0x0
  401142:   48 8d 3d bb 0e 00 00    lea    rdi,[rip+0xebb]        # 402004 <_IO_stdin_used+0x4>
  401149:   b0 00                   mov    al,0x0
  40114b:   e8 e0 fe ff ff          call   401030 <printf@plt>
  401150:   31 c0                   xor    eax,eax
  401152:   48 81 c4 10 a0 00 00    add    rsp,0xa010
  401159:   5d                      pop    rbp
  40115a:   c3                      ret

Here, we can clearly identify a single large stack allocation performed by the instruction sub rsp, 0xa010, which decreases the stack pointer by more than ten pages at once.

Now, we can take a look at the hardened version of this program:

0000000000401140 <main>:
  401140:   55                      push   rbp
  401141:   48 89 e5                mov    rbp,rsp
  401144:   49 89 e3                mov    r11,rsp
  401147:   49 81 eb 00 a0 00 00    sub    r11,0xa000
  40114e:   48 81 ec 00 10 00 00    sub    rsp,0x1000
  401155:   48 c7 04 24 00 00 00    mov    QWORD PTR [rsp],0x0
  40115c:   00
  40115d:   4c 39 dc                cmp    rsp,r11
  401160:   75 ec                   jne    40114e <main+0xe>
  401162:   48 83 ec 20             sub    rsp,0x20
  401166:   64 48 8b 04 25 28 00    mov    rax,QWORD PTR fs:0x28
  40116d:   00 00
  40116f:   48 89 45 f8             mov    QWORD PTR [rbp-0x8],rax
  401173:   c7 85 ec 5f ff ff 00    mov    DWORD PTR [rbp-0xa014],0x0
  40117a:   00 00 00
  40117d:   48 8d 3d 80 0e 00 00    lea    rdi,[rip+0xe80]        # 402004 <_IO_stdin_used+0x4>
  401184:   31 c0                   xor    eax,eax
  401186:   e8 b5 fe ff ff          call   401040 <printf@plt>
  40118b:   64 48 8b 04 25 28 00    mov    rax,QWORD PTR fs:0x28
  401192:   00 00
  401194:   48 8b 4d f8             mov    rcx,QWORD PTR [rbp-0x8]
  401198:   48 39 c8                cmp    rax,rcx
  40119b:   75 0b                   jne    4011a8 <main+0x68>
  40119d:   31 c0                   xor    eax,eax
  40119f:   48 81 c4 20 a0 00 00    add    rsp,0xa020
  4011a6:   5d                      pop    rbp
  4011a7:   c3                      ret
  4011a8:   e8 83 fe ff ff          call   401030 <__stack_chk_fail@plt>

We can see that the single stack allocation (sub rsp, 0xa10) has been replaced by a loop (address 40114e) that grows the stack by 0x1000 bytes (size of a memory page) per iteration until the stack pointer reaches the target address, touching each page along the way (mov QWORD PTR [rsp],0x0) to ensure that all guard pages are accessed and no stack guard is skipped.

Defenses Against Code-Reuse Attacks (ROP/JOP)

Code-Reuse Attacks

Modern exploitation techniques rarely rely on injecting new code. Instead, attackers increasingly reuse existing code already present in the binary or in linked libraries. This class of attacks, commonly referred to as code-reuse attacks, bypasses traditional defenses such as non-executable stack memory (NX) by chaining together short instruction sequences called gadgets that end with indirect control-flow transfers.

The most well-known form is Return-Oriented Programming (ROP), where execution is redirected through a sequence of RET instructions by overwriting the return address of a function. Variants such as Jump-Oriented Programming (JOP) and Call-Oriented Programming (COP) rely on indirect jumps or calls instead.

To mitigate these issues, modern defenses focus on controlling where indirect branches are allowed to land, protecting return addresses, or reducing the usefulness of available gadgets. Rather than preventing memory corruption itself, these mechanisms aim to constrain how corrupted control flow can be exploited.

In the following sections, we will examine how Clang compilation flags implement these strategies through hardware-assisted features and compiler-driven transformations, and how they can be combined to harden binaries against code-reuse attacks in practice.

If you want to learn more about how ROP works, I recommend the following blog post: ROP - Return-Oriented Programming by Pixis.

Control-Flow Protection (x86)

Modern x86_64 CPUs from Intel support Control-flow Enforcement Technology (CET), a hardware-assisted mechanism to protect against code-reuse attacks. CET combines two complementary features:

  • IBT (Indirect Branch Tracking): protects against Jump/Call-Oriented Programming (JOP) attacks.
  • Shadow Stack (SHSTK): protects against Return-Oriented Programming (ROP) attacks.

Clang exposes CET through the -fcf-protection flag, which can take several values:

  • -fcf-protection=return: enables shadow stack protection (SHSTK);
  • -fcf-protection=branch: enables endbranch (EB) generation (IBT);
  • -fcf-protection=full: enables shadow stack protection and endbranch (EB) generation; this is the same as specifying this compiler option with no keyword;
  • -fcf-protection=none: disables Intel CET protection.

To study this security mechanism, consider the following trivial function:

void hello_world()
{
    printf("hello world\n");
}

When compiled without CET, the assembly code looks like:

0000000000401130 <hello_world>:
  401130:   55                      push   rbp
  401131:   48 89 e5                mov    rbp,rsp
  401134:   48 8d 3d c9 0e 00 00    lea    rdi,[rip+0xec9]        # 402004 <_IO_stdin_used+0x4>
  40113b:   b0 00                   mov    al,0x0
  40113d:   e8 ee fe ff ff          call   401030 <printf@plt>
  401142:   5d                      pop    rbp
  401143:   c3                      ret

and compiled with -fcf-protection=full, it looks like:

0000000000001150 <hello_world>:
    1150:   f3 0f 1e fa             endbr64
    1154:   55                      push   rbp
    1155:   48 89 e5                mov    rbp,rsp
    1158:   48 8d 3d a5 0e 00 00    lea    rdi,[rip+0xea5]        # 2004 <_IO_stdin_used+0x4>
    115f:   b0 00                   mov    al,0x0
    1161:   e8 da fe ff ff          call   1040 <printf@plt>
    1166:   5d                      pop    rbp
    1167:   c3                      ret

The only difference is the ENDBR64 instruction at the beginning, which acts as a landing pad for indirect branches under IBT.

If we look at the properties of our ELFs, the hardened binary declares support for IBT and SHSTK:

╭─trikkss@archlinux ~/work/clang-hardening/tests/fcf-protection  main*
╰─➤  readelf -n main-hardened | grep Properties
      Properties: x86 feature: IBT, SHSTK
╭─trikkss@archlinux ~/work/clang-hardening/tests/fcf-protection  main*
╰─➤  readelf -n main | grep Properties
      Properties: x86 ISA needed: x86-64-baseline

⚠️ Today in the 64-bit linux kernel, only userspace shadow stack and kernel IBT are supported (kernel.org - Control-flow Enforcement Technology (CET) Shadow Stack).

Windows doesn't support Indirect Branch Tracking

Indirect Branch Tracking

When IBT is enabled, the compiler inserts landing pads at the beginning of functions that may be the target of an indirect branch. These landing pads are represented by the end branch instructions (ENDBR32 and ENDBR64). On Intel processors, if this property is enabled, the program can only jump to an end branch. This prevents attackers from using JOP gadgets because they are only allowed to jump to addresses marked by ENDBR instructions.

It is still possible to hijack the program’s control flow, but the options available to an attacker are limited. Additionally, this protection does not prevent ROP attacks.

Shadow Stack

Shadow stack, also referred to as SHSTK, is a backward-edge code flow integrity protection feature available in both Intel and AMD processors. Its main purpose is to protect the integrity of return addresses on the call stack to prevent return-oriented programming (ROP) attacks.

The way it works is conceptually straightforward: every time a CALL instruction is executed, the CPU pushes the return address not only onto the regular stack but also onto a separate, protected “shadow” stack. Later, when a RET instruction is executed, the CPU checks that the return address on the normal stack matches the one on the shadow stack. If the two addresses differ, a fault is triggered, preventing the program from returning to an attacker-controlled location.

Control-Flow Protection (ARM)

Just as x86 systems can be protected using CET, ARM provides equivalent hardware-assisted mechanisms through PAC (Pointer Authentication Codes) and BTI (Branch Target Identification). The Clang flag -mbranch-protection enables these protections, which are designed to prevent code-reuse attacks such as ROP, JOP, and COP on ARM systems.

Clang exposes PAC and BIT through the -mbranch-protection flag, which can take several values:

  • none : Disables all types of branch protection.
  • standard : Enables Branch Target Identification (BTI) and Pointer Authentication Code (PAC) branch protection
  • bti : Enables branch protection using BTI.
  • pac-ret: Enables branch protection using PAC.

For more information on these flags, you can consult the ARM Developer Manual.

Branch Target Identification (BTI)

Branch Target Identification (BTI) is an architectural hardware security feature that restricts the set of valid destination addresses for indirect branch instructions. By enforcing where indirect calls and jumps are allowed to land, BTI helps mitigate Jump-Oriented Programming (JOP) and Call-Oriented Programming (COP) attacks.

To examine how this mechanism works in practice, we can once again compile a trivial function and inspect the generated assembly code.

Here is the assembly code of a simple hello world compiled without BTI :

STP             X29, X30, [SP,#-0x10+var_s0]!
MOV             X29, SP
ADRL            X0, aHelloWorld ; "hello world\n"
BL              .printf
LDP             X29, X30, [SP+var_s0],#0x10
RET

And here is the hardened version compiled with BTI enabled:

BTI             c
STP             X29, X30, [SP,#-0x10+var_s0]!
MOV             X29, SP
ADRL            X0, aHelloWorld ; "hello world\n"
BL              .printf
LDP             X29, X30, [SP+var_s0],#0x10
RET

The only visible difference is the addition of a BTI instruction at the beginning of the function. This instruction plays a role similar to the ENDBR instruction on x86: every indirect branch must land on a valid BTI instruction, otherwise execution is aborted.

Unlike ENDBR, the BTI instruction takes an operand that specifies which types of indirect branches are allowed to target this location.

The operand controls the value of the PSTATE.BTYPE register, a 2-bit field that encodes the type of indirect control flow:

  • 00 - none
  • 01 - CALLS (c)
  • 10 - JUMPS (j)
  • 11 - JUMPS and CALLS (jc)

In this example, the BTI c instruction indicates that only indirect calls are allowed to branch to this function entry point. Indirect jumps targeting this address would result in a fault.

By enforcing these constraints, BTI significantly reduces the set of valid gadget entry points available to an attacker. While it does not completely prevent control-flow hijacking, it limits exploitation to a restricted set of legitimate targets, making JOP and COP chains much harder to construct reliably.

As with similar mechanisms on x86 (IBT), BTI doesn't protect against return-oriented programming. To achieve full control-flow integrity on ARM systems, BTI can be combined with PAC.

Pointer Authentication Codes (PAC)

Pointer Authentication Codes (PAC) is a hardware-assisted mechanism on ARM64 that protects against ROP by signing return addresses.

When a function is called, PAC generates a cryptographic signature using:

  • A key register (A or B) provided by the CPU.
  • A modifier, which can include context such as the stack pointer (SP) to prevent reuse of signed pointers across different functions.

The return address is then signed and stored on the stack. When the function returns, the CPU verifies the signature using the same key and modifier. If the signature is invalid, the program triggers a fault, effectively preventing the execution of an attacker-controlled return address.

Pointer authentication takes advantage of the fact that pointers are stored in a 64-bit format, but not all those bits are needed to represent the address. The virtual address space layout is the following:

  • kernel space : 0xFFF0_0000_0000_0000 to 0xFFFF_FFFF_FFFF_FFFF
  • user space : 0x0000_0000_0000_0000 to 0x00FF_FFFF_FFFF_FFFF

Any address that falls outside of both ranges is always invalid and results in a fault if accessed.

You can see that any valid virtual address has its top 12 bits as 0x000 or 0xFFF. When pointer authentication is enabled, the upper bits are used to store a signature and are not treated as part of the address. This signature is referred to as a Pointer Authentication Code (PAC). Its size can change depending on the architecture.

Here is an example taken from the ARM developer Manual - Return-Oriented Programming:

SIGNED POINTER = PAC | POINTER

e.g:

PAC = 0x123
POINTER = 0x0007FFFF5678
SIGNED POINTER = 0x1237FFFF5678

To analyse this security mechanism, we can use our previous hello world example:

STP             X29, X30, [SP,#-0x10+var_s0]!
MOV             X29, SP
ADRL            X0, aHelloWorld ; "hello world\n"
BL              .printf
LDP             X29, X30, [SP+var_s0],#0x10
RET

and compile it with PAC enabled:

PACIASP
STP             X29, X30, [SP,#-0x10+var_s0]!
MOV             X29, SP
ADRL            X0, aHelloWorld ; "hello world\n"
BL              .printf
LDP             X29, X30, [SP+var_s0],#0x10
RETAA

As we can see, two things have changed between the two versions. A PACIASP instruction has been added at the beginning, and the RET instruction has been changed to a RETAA instruction.

The PACIASP instruction signs the return address with key register A and uses SP as a modifier. The RETAA instruction means that the return address must be checked using the key register A.

Several PAC bypasses exist, depending on how pointer authentication is implemented. In some cases, PAC is signed with a null modifier, allowing an attacker to reuse an old authenticated pointer since the signature is not bound to any execution context. Additionally, because the number of bits allocated to the PAC signature varies across architectures, it may sometimes be possible to brute-force the authentication code, among other implementation-specific weaknesses.

If you want to go further, here are some very interesting slides by Brandon Azad on PAC bypasses on iOS: iOS Kernel PAC, One Year Later.

Register Zeroing

Register zeroing is a hardening technique designed to reduce the amount of useful gadgets an attacker can reuse after hijacking the control flow of a program. By explicitly clearing registers before a function returns, register zeroing limits what an attacker can carry from one gadget to the next. Instead of inheriting a rich execution context, each return leaves the program in a mostly clean state, making exploit construction significantly more complex. This helps mitigate Return-Oriented Programming exploits.

Clang provides this mitigation through the -fzero-call-used-regs compilation flag which tells the compiler to zero out certain registers before the function returns.

The two upper categories are:

  • used: Zero out used registers.
  • all: Zero out all registers, whether used or not.

The individual options are:

  • skip: Don't zero out any registers. This is the default.
  • used: Zero out all used registers.
  • used-arg: Zero out used registers that are used for arguments.
  • used-gpr: Zero out used registers that are GPRs.
  • used-gpr-arg: Zero out used GPRs that are used as arguments.
  • all: Zero out all registers.
  • all-arg: Zero out all registers used for arguments.
  • all-gpr: Zero out all GPRs.
  • all-gpr-arg: Zero out all GPRs used for arguments.

General-Purpose Registers (GPR) are CPU registers that programs can use freely to hold temporary values, addresses, or intermediate results during execution; for example, on x86-64, registers like RAX or RCX can be set to zero or reused without breaking the program, while RSP is not freely usable because it must always point to the top of the stack.

Let's take a look at a simple hello world program:

0000000000001140 <hello_world>:
    1140:   55                      push   rbp
    1141:   48 89 e5                mov    rbp,rsp
    1144:   48 8d 3d b9 0e 00 00    lea    rdi,[rip+0xeb9]        # 2004 <_IO_stdin_used+0x4>
    114b:   b0 00                   mov    al,0x0
    114d:   e8 de fe ff ff          call   1030 <printf@plt>
    1152:   5d                      pop    rbp
    1153:   c3                      ret

If we compile it with -fzero-call-used-regs=all:

0000000000001140 <hello_world>:
    1140:   55                      push   rbp
    1141:   48 89 e5                mov    rbp,rsp
    1144:   48 8d 3d b9 0e 00 00    lea    rdi,[rip+0xeb9]        # 2004 <_IO_stdin_used+0x4>
    114b:   b0 00                   mov    al,0x0
    114d:   e8 de fe ff ff          call   1030 <printf@plt>
    1152:   5d                      pop    rbp
    1153:   d9 ee                   fldz
    1155:   d9 ee                   fldz
    1157:   d9 ee                   fldz
    1159:   d9 ee                   fldz
    115b:   d9 ee                   fldz
    115d:   d9 ee                   fldz
    115f:   d9 ee                   fldz
    1161:   d9 ee                   fldz
    1163:   dd d8                   fstp   st(0)
    1165:   dd d8                   fstp   st(0)
    1167:   dd d8                   fstp   st(0)
    1169:   dd d8                   fstp   st(0)
    116b:   dd d8                   fstp   st(0)
    116d:   dd d8                   fstp   st(0)
    116f:   dd d8                   fstp   st(0)
    1171:   dd d8                   fstp   st(0)
    1173:   31 c0                   xor    eax,eax
    1175:   31 c9                   xor    ecx,ecx
    1177:   31 ff                   xor    edi,edi
    1179:   31 d2                   xor    edx,edx
    117b:   31 f6                   xor    esi,esi
    117d:   45 31 c0                xor    r8d,r8d
    1180:   45 31 c9                xor    r9d,r9d
    1183:   45 31 d2                xor    r10d,r10d
    1186:   45 31 db                xor    r11d,r11d
    1189:   0f 57 c0                xorps  xmm0,xmm0
    118c:   0f 57 c9                xorps  xmm1,xmm1
    118f:   0f 57 d2                xorps  xmm2,xmm2
    1192:   0f 57 db                xorps  xmm3,xmm3
    1195:   0f 57 e4                xorps  xmm4,xmm4
    1198:   0f 57 ed                xorps  xmm5,xmm5
    119b:   0f 57 f6                xorps  xmm6,xmm6
    119e:   0f 57 ff                xorps  xmm7,xmm7
    11a1:   45 0f 57 c0             xorps  xmm8,xmm8
    11a5:   45 0f 57 c9             xorps  xmm9,xmm9
    11a9:   45 0f 57 d2             xorps  xmm10,xmm10
    11ad:   45 0f 57 db             xorps  xmm11,xmm11
    11b1:   45 0f 57 e4             xorps  xmm12,xmm12
    11b5:   45 0f 57 ed             xorps  xmm13,xmm13
    11b9:   45 0f 57 f6             xorps  xmm14,xmm14
    11bd:   45 0f 57 ff             xorps  xmm15,xmm15
    11c1:   c3                      ret

Here we can clearly see the downside of using the -fzero-call-used-regs=all variant. In order to guarantee that no potentially useful state remains after the function returns, the compiler aggressively clears all classes of registers: general-purpose registers, SIMD registers (XMM), and even the x87 floating‑point stack (fldz, fstp st(0)). While this significantly reduces the availability of useful ROP gadgets, it comes at a high cost.

In practice, such an aggressive mode is often overkill. A more balanced approach is to use -fzero-call-used-regs=used-gpr, which only clears the general‑purpose registers that were actually used by the function.

Here is the code hardened with -fzero-call-used-regs=used-gpr:

0000000000001140 <hello_world>:
    1140:   55                      push   rbp
    1141:   48 89 e5                mov    rbp,rsp
    1144:   48 8d 3d b9 0e 00 00    lea    rdi,[rip+0xeb9]        # 2004 <_IO_stdin_used+0x4>
    114b:   b0 00                   mov    al,0x0
    114d:   e8 de fe ff ff          call   1030 <printf@plt>
    1152:   5d                      pop    rbp
    1153:   31 c0                   xor    eax,eax
    1155:   31 ff                   xor    edi,edi
    1157:   c3                      ret

As you can see, the difference is much more subtle. Only the general‑purpose registers that were actually used by the function (RAX and RDI in this case) are explicitly cleared before returning. This drastically reduces the amount of additional instructions compared to the all variant, while still removing valuable attacker‑controlled state that could otherwise be reused as part of a ROP chain.

If a binary relies on any library (libc, etc.) not compiled with this compilation flag, an attacker can still find a lot of gadgets. That's why this mitigation is most effective for kernel-sized or static binaries, as it does not depend on external libraries.

Defenses Against Speculative Execution Attacks

Speculative Attacks

Modern CPUs use speculative execution to improve performance. When the processor encounters a conditional branch, it may predict the outcome and execute instructions from the predicted path before the branch condition is fully resolved. If the prediction is correct, the results are committed; if it is wrong, the architectural effects are discarded.

However, even when speculative execution is rolled back architecturally, microarchitectural side effects remain, such as changes in cache state. Speculation attacks exploit this behavior by deliberately causing mispredicted branches so that the CPU speculatively accesses sensitive data. An attacker can then infer this data through side-channel measurements, leading to vulnerabilities.

The first public traces of this class of attacks appeared in early 2018, with the coordinated disclosure of Spectre and Meltdown in January 2018 by Google Project Zero.

Speculative Load Hardening (SLH)

To mitigate speculative execution attacks, Clang provides the compilation flag -mspeculative-load-hardening. For X86 targets, this flag can be configured with two main strategies, indirect masking and fencing, but in practice, only indirect masking is commonly used for production code. Other options are also available, which we do not present here.

To analyse these mitigations, we will use the following simple C program:

#include <stdio.h>

int main(int argc, char **argv)
{
    if (argc == 2)
        printf("%s", argv[1]);
    return 0;
}

This code is not vulnerable to any speculative execution attack. It is only used as a minimal example that contains a conditional branch.

Here is the corresponding assembly code:

0000000000001140 <main>:
    1140:   55                      push   rbp
    1141:   48 89 e5                mov    rbp,rsp
    1144:   48 83 ec 10             sub    rsp,0x10
    1148:   c7 45 fc 00 00 00 00    mov    DWORD PTR [rbp-0x4],0x0
    114f:   89 7d f8                mov    DWORD PTR [rbp-0x8],edi
    1152:   48 89 75 f0             mov    QWORD PTR [rbp-0x10],rsi
    1156:   83 7d f8 02             cmp    DWORD PTR [rbp-0x8],0x2
    115a:   75 16                   jne    1172 <main+0x32>
    115c:   48 8b 45 f0             mov    rax,QWORD PTR [rbp-0x10]
    1160:   48 8b 70 10             mov    rsi,QWORD PTR [rax+0x10]
    1164:   48 8d 3d 99 0e 00 00    lea    rdi,[rip+0xe99]        # 2004 <_IO_stdin_used+0x4>
    116b:   b0 00                   mov    al,0x0
    116d:   e8 be fe ff ff          call   1030 <printf@plt>
    1172:   31 c0                   xor    eax,eax
    1174:   48 83 c4 10             add    rsp,0x10
    1178:   5d                      pop    rbp
    1179:   c3                      ret

Indirect Masking

As described in the High Level Mitigation Approach section of the LLVM documentation on Speculative Load Hardening, one way to mitigate these attacks is to cause loads to be checked using branchless code to ensure that they are executing along a valid control flow path. In order to do that, Clang provides the SLH -mllvm -x86-slh-indirect option (default option).

The LLVM documentation illustrates this approach with the following example:

void leak(int data);
void example(int* pointer1, int* pointer2) {
  if (condition) {
    // ... lots of code ...
    leak(*pointer1);
  } else {
    // ... more code ...
    leak(*pointer2);
  }
}

This code can be transformed into a hardened version such as:

uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
uintptr_t all_zeros_mask = 0;
void leak(int data);
void example(int* pointer1, int* pointer2) {
  uintptr_t predicate_state = all_ones_mask;
  if (condition) {
    // Assuming ?: is implemented using branchless logic...
    predicate_state = !condition ? all_zeros_mask : predicate_state;
    // ... lots of code ...
    //
    // Harden the pointer so it can't be loaded
    pointer1 &= predicate_state;
    leak(*pointer1);
  } else {
    predicate_state = condition ? all_zeros_mask : predicate_state;
    // ... more code ...
    //
    // Alternative: Harden the loaded value
    int value2 = *pointer2 & predicate_state;
    leak(value2);
  }
}

The key aspect here is that the update of predicate_state is performed in a branchless manner. This forces the processor to wait for the branch condition to be resolved before the mask can be correctly computed. The resulting mask is then applied to either the pointer or the loaded value, ensuring that speculative execution cannot access or propagate sensitive data when the control flow is mispredicted.

If we compile our example using this hardening method, the resulting assembly code becomes significantly more verbose. We can see that pointers inside conditional branches are masked using conditional move instructions (CMOV). The key point is that the CPU does not speculatively execute the outcome of conditional moves. This ensures that the pointer values are only applied once the branch condition is fully resolved, preventing speculative execution from using or leaking sensitive data along mispredicted paths.

0000000000001140 <main>:
    1140:   55                      push   rbp
    1141:   48 89 e5                mov    rbp,rsp
    1144:   48 83 ec 30             sub    rsp,0x30
    1148:   48 c7 c0 ff ff ff ff    mov    rax,0xffffffffffffffff
    114f:   48 89 45 e0             mov    QWORD PTR [rbp-0x20],rax
    1153:   48 89 e0                mov    rax,rsp
    1156:   48 c1 f8 3f             sar    rax,0x3f
    115a:   48 89 45 e8             mov    QWORD PTR [rbp-0x18],rax
    115e:   c7 45 fc 00 00 00 00    mov    DWORD PTR [rbp-0x4],0x0
    1165:   89 7d f8                mov    DWORD PTR [rbp-0x8],edi
    1168:   48 89 75 f0             mov    QWORD PTR [rbp-0x10],rsi
    116c:   83 7d f8 02             cmp    DWORD PTR [rbp-0x8],0x2
    1170:   75 02                   jne    1174 <main+0x34>
    1172:   eb 12                   jmp    1186 <main+0x46>
    1174:   48 8b 4d e0             mov    rcx,QWORD PTR [rbp-0x20]
    1178:   48 8b 45 e8             mov    rax,QWORD PTR [rbp-0x18]
    117c:   48 0f 44 c1             cmove  rax,rcx
    1180:   48 89 45 d8             mov    QWORD PTR [rbp-0x28],rax
    1184:   eb 59                   jmp    11df <main+0x9f>
    1186:   48 8b 45 e0             mov    rax,QWORD PTR [rbp-0x20]
    118a:   48 8b 4d e8             mov    rcx,QWORD PTR [rbp-0x18]
    118e:   48 0f 45 c8             cmovne rcx,rax
    1192:   48 8b 45 f0             mov    rax,QWORD PTR [rbp-0x10]
    1196:   48 8b 40 10             mov    rax,QWORD PTR [rax+0x10]
    119a:   48 89 ce                mov    rsi,rcx
    119d:   48 09 c6                or     rsi,rax
    11a0:   48 8d 3d 5d 0e 00 00    lea    rdi,[rip+0xe5d]        # 2004 <_IO_stdin_used+0x4>
    11a7:   31 c0                   xor    eax,eax
    11a9:   48 c1 e1 2f             shl    rcx,0x2f
    11ad:   48 09 cc                or     rsp,rcx
    11b0:   e8 7b fe ff ff          call   1030 <printf@plt>
    11b5:   48 8b 55 e0             mov    rdx,QWORD PTR [rbp-0x20]
    11b9:   48 8b 74 24 f8          mov    rsi,QWORD PTR [rsp-0x8]
    11be:   48 89 e1                mov    rcx,rsp
    11c1:   48 c1 f9 3f             sar    rcx,0x3f
    11c5:   48 8d 3d e9 ff ff ff    lea    rdi,[rip+0xffffffffffffffe9]        # 11b5 <main+0x75>
    11cc:   48 39 fe                cmp    rsi,rdi
    11cf:   48 0f 45 ca             cmovne rcx,rdx
    11d3:   48 89 4d d0             mov    QWORD PTR [rbp-0x30],rcx
    11d7:   48 8b 45 d0             mov    rax,QWORD PTR [rbp-0x30]
    11db:   48 89 45 d8             mov    QWORD PTR [rbp-0x28],rax
    11df:   48 8b 4d d8             mov    rcx,QWORD PTR [rbp-0x28]
    11e3:   31 c0                   xor    eax,eax
    11e5:   48 c1 e1 2f             shl    rcx,0x2f
    11e9:   48 09 cc                or     rsp,rcx
    11ec:   48 83 c4 30             add    rsp,0x30
    11f0:   5d                      pop    rbp
    11f1:   c3                      ret
    11f2:   66 2e 0f 1f 84 00 00    cs nop WORD PTR [rax+rax*1+0x0]
    11f9:   00 00 00
    11fc:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

To better understand this mechanism, let us take a closer look at what this code does in practice.

1. Mask initialization
  mov    rax,0xffffffffffffffff
  mov    QWORD PTR [rbp-0x20],rax 

The compiler initializes a mask with all bits set (0xFFFFFFFFFFFFFFFF). This value represents a speculation-dependent mask that is later used to either preserve or deliberately corrupt pointers, ensuring that mis-speculated execution paths cannot access meaningful data.

2. Extracting the most significant bit of RSP
  mov    rax,rsp
  sar    rax,0x3f
  mov    QWORD PTR [rbp-0x18],rax

This sequence extracts the most significant bit of the stack pointer by performing an arithmetic right shift. Since RSP is expected to hold a canonical user-space address, this bit should normally be zero.

3. First conditional branch

After the if (argc == 2) check, the CPU may speculatively execute one of the branches.

  mov    rax,QWORD PTR [rbp-0x20]
  mov    rcx,QWORD PTR [rbp-0x18]
  cmovne rcx,rax

Here, a conditional move (CMOVNE) is used to update the mask stored in RCX.

The important property is that the CPU does not speculate on the result of a conditional move. As a consequence, it must wait for the condition flags to be resolved before applying the move. It creates a data dependency.

Once the condition is resolved, speculative execution may continue, but the mask will now correctly reflect the control-flow outcome. Depending on the condition, the mask will be either 0xFFFF_FFFF_FFFF_FFFF or 0x0000_0000_0000_0000.

4. Applying the mask to the pointer

At this point, RCX holds the computed mask.

  mov    rax,QWORD PTR [rbp-0x10]
  mov    rax,QWORD PTR [rax+0x10]
  mov    rsi,rcx                         # copie du masque dans RSI
  or     rsi,rax                         # OR du pointeur vers argv[1] avec notre masque.
  lea    rdi,[rip+0xe5d]        # 2004 <_IO_stdin_used+0x4>
  xor    eax,eax
  shl    rcx,0x2f
  or     rsp,rcx
  call   1030 <printf@plt>

Here, the pointer to argv[1] is combined with the mask using a bitwise OR. If the execution path is valid, the mask preserves the pointer value. If the path is invalid due to mis-speculation, the mask corrupts the pointer, preventing it from being used to access meaningful data.

⚠️ **Enabling this feature has a notable performance impact, but it is still less costly than inserting lfence barriers everywhere (see next section). An attribute3 can be used to restrict hardening to specific functions, avoiding the need to slow down the entire program4.

LFENCE barriers

Another way to avoid speculative execution on branches is to insert LFENCE instructions after all conditional jumps to act as speculation barriers. To select this strategy, Clang provides the SLH option -mllvm -x86-slh-lfence. The LFENCE instruction does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. As such, it acts as a barrier because the branch will never be executed speculatively.

push   rbp
mov    rbp,rsp
sub    rsp,0x10
mov    DWORD PTR [rbp-0x4],0x0
mov    DWORD PTR [rbp-0x8],edi
mov    QWORD PTR [rbp-0x10],rsi
cmp    DWORD PTR [rbp-0x8],0x2
jne    1175 <main+0x35>
lfence
mov    rax,QWORD PTR [rbp-0x10]
mov    rsi,QWORD PTR [rax+0x10]
lea    rdi,[rip+0xe96]        # 2004 <_IO_stdin_used+0x4>
xor    eax,eax
call   1030 <printf@plt>
lfence
xor    eax,eax
add    rsp,0x10
pop    rbp
ret

⚠️ We do not recommend using this compilation flag as its impact on program performance is very significant. It is mentioned here for completeness only.

Conclusion

Over the years, the list of recommended compiler hardening options has significantly evolved, driven by the continuous discovery of new classes of vulnerabilities and exploitation techniques. New compilation flags have been introduced to mitigate threats such as control-flow hijacking, speculative execution attacks, and information leaks at both the architectural and micro-architectural levels.

While these mitigations greatly improve the security posture of compiled binaries, they often come with a non-negligible performance cost. As a result, enabling them requires careful consideration and an informed trade-off between security and performance, depending on the threat model and deployment context.

The growing awareness that security is a first-class concern in modern software development suggests that these compiler options will continue to evolve. New mitigations will be added, existing ones will be refined, and defaults may change over time. Staying up to date with compiler developments and security recommendations is therefore essential.

Finally, compiler hardening options, while powerful, are not sufficient on their own. They must be combined with good programming practices, regular code reviews and audits. In addition, software protection techniques such as obfuscation, integrity checks or Runtime Application Self Protection (RASP) make the identification and exploitation of vulnerabilities more difficult and slow down attackers.

Acknowledgments

We would like to thank our colleagues Laurent Laubin and Rémy Salim for their thorough review of this article as well as their suggestions.


  1. A developer’s guide to secure coding with FORTIFY_SOURCE, Sandipan Roy, July 2023. 

  2. The Stack Clash, Qualys Research Team, December 2022. 

  3. Attributes in Clang

  4. Comparing GCC and Clang security features, Jonathan Corbet, September 2019. 


If you would like to learn more about our security audits and explore how we can help you, get in touch with us!