A glance at compiler internals: Keep my memset

Why does some memset calls get optimized away by the compiler? Let's investigate!

The C11 standard introduced the memset_s(3) function, a secure version of memset(3) that:

cannot be optimized away by the compiler

Why would a compiler optimize my memset? Let's have a look on the following example, adapted from MSC06-C. Beware of compiler optimizations

#include <string.h>
extern int read_password(char[], int);
extern void process(char const [], int);
void get_password(void) {
  char pwd[64];
  if(read_password(pwd, sizeof pwd)) {
    // Checking of password, secure operations, etc
    process(pwd, sizeof pwd);
  }
  memset(pwd, 0, sizeof pwd);
}

The memset is here to overwrite the secret password's buffer with NULL bytes. How does it come a compiler would optimize it? Let's use Clang and LLVM to follow what happens in compiler internals (We use the LLVM infrastructure because it's easy to work on its intermediate representation, but ideas are similar for other compilers like GCC).

$ clang -S -O0 -emit-llvm getpwd.c

the -S -emit-llvm flav=gs lets clang emit LLVM bytecode instead of assembly. Thus we can have a look to the unoptimized (-O0) translation.

define void @get_password() #0 {
  %pwd = alloca [64 x i8], align 16
  %1 = getelementptr inbounds [64 x i8]* %pwd, i32 0, i32 0
  %2 = call i32 @read_password(i8* %1, i32 64)
  %3 = icmp ne i32 %2, 0
  br i1 %3, label %4, label %6
; <label>:4                                       ; preds = %0
  %5 = getelementptr inbounds [64 x i8]* %pwd, i32 0, i32 0
  call void @process(i8* %5, i32 64)
  br label %6
; <label>:6                                       ; preds = %4, %0
  %7 = bitcast [64 x i8]* %pwd to i8*
  call void @llvm.memset.p0i8.i64(i8* %7, i8 0, i64 64, i32 16, i1 false)
  ret void
}

Nothing terrible there, LLVM uses a Static Single Assignment (SSA) form, and a control flow graph (CFG) with Basic Blocks (BB) (identified by label) and jumps between blocks (e.g. the br instruction). One interesting thing though, is that memset got parsed as a special LLVM intrinsic llvm.memset.*. See the documentation for a detailed description of its behavior. It means that LLVM is likely to accurately model the behavior of the function. Indeed, if we run the LLVM bytecode optimizer, opt, with a given optimization:

$ opt -dse getpwd.ll -S

we get :

define void @get_password() #0 {
  %pwd = alloca [64 x i8], align 16
  %1 = getelementptr inbounds [64 x i8]* %pwd, i32 0, i32 0
  %2 = call i32 @read_password(i8* %1, i32 64)
  %3 = icmp ne i32 %2, 0
  br i1 %3, label %4, label %6
; <label>:4                                       ; preds = %0
  %5 = getelementptr inbounds [64 x i8]* %pwd, i32 0, i32 0
  call void @process(i8* %5, i32 64)
  br label %6
; <label>:6                                       ; preds = %4, %0
  ret void
}

The llvm.memset got away! A quick glance at the pass description tells us more concerning Dead Store Elimination (DSE). It's a compiler optimization that removes memory operations that are dead, typically a store that is never read afterward. That's the case of our function! To confirm this, let's use the result of the memset in the C function:

memset(pwd, 0, sizeof pwd);
return pwd[0];

Eeny, meeny, miny, moe... the memset is no longer optimized. LLVM even fails to always return 0... Trying other optimization passes still keeps the memset! Although very pedagogical, this example is unlikely to find its way to real code. The careful reader of the llvm.memset documentation may have noticed that the intrinsic can be made volatile. A few RTFM later we know that we just need to make llvm.memset volatile! I don't know how to do this from C, but a common solution to the problem is to use a volatile function pointer:

void *(* volatile memset_s)(void *s, int c, size_t n) = memset;
void get_password(void) {
  char pwd[64];
  if(read_password(pwd, sizeof pwd)) {
    /* Checking of password, secure operations, etc. */
    process(pwd, sizeof pwd);
  }
  memset_s(pwd, 0, sizeof pwd);
}

The meaning of volatile in C, is that the content of the volatile variable can change behind the hood, as more accurately described in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2016.html When translated to LLVM bytecode, the load of the function pointer is marked as volatile, and the compiler can no longer optimize it away: it has no way to prove that the pointer is still pointing to the memset function.

@memset_s = global i8* (i8*, i32, i64)* @memset, align 8
; Function Attrs: nounwind
declare i8* @memset(i8*, i32, i64) #0
; Function Attrs: nounwind uwtable
define void @get_password() #1 {
  %pwd = alloca [64 x i8], align 16
  %1 = getelementptr inbounds [64 x i8]* %pwd, i32 0, i32 0
  %2 = call i32 @read_password(i8* %1, i32 64)
  %3 = icmp ne i32 %2, 0
  br i1 %3, label %4, label %6
; <label>:4                                       ; preds = %0
  %5 = getelementptr inbounds [64 x i8]* %pwd, i32 0, i32 0
  call void @process(i8* %5, i32 64)
  br label %6
; <label>:6                                       ; preds = %4, %0
  %7 = load volatile i8* (i8*, i32, i64)** @memset_s, align 8
  %8 = getelementptr inbounds [64 x i8]* %pwd, i32 0, i32 0
  %9 = call i8* %7(i8* %8, i32 0, i64 64)
  ret void
}

Alternatives to the volatile function pointer trick are described in this great site. Read it!

Comments