Golang is the most used programming language for developing cloud technologies. Tools such as Kubernetes, Docker, Containerd and gVisor are all written in Go. Despite the fact that the code of these programs is open source, there is not an obvious way to analyze and extend their behaviour dynamically (for example through binary instrumentation) without recompiling their code. Is this due to the complex internals of the language or is there something else? In this third blog post, we will demonstrate how to dynamically instrument Golang code by implementing the function hooks described in the first blog post. Furthermore, we will tackle the limitations of this approach using FFI (Foreign function interfaces) in Golang which we saw in the second blog post of this series.

A Golang gopher motorcycling from one Go world to another (by Ashley McNamara)

Introduction

Hooking, also known as a "detour", is a mechanism for unconditionally redirecting the execution flow of a program. There is a lot of literature on the Internet on how this can be done for different programming languages such as C and C++. However, hooking Go code at runtime is not a straightforward process. It gets even more interesting when one tries to hook Go code with Go code, leading to a deep rabbit hole. One would think, it should be more natural to manipulate Golang data structures using Golang, right? In this series of blog posts, we’ll present several rather interesting strategies that we’ve developed at Quarkslab to achieve that. Before going back into the rabbit hole, let’s first briefly remind you why we got interested in implementing detours for Go programs and why it is more complicated than for other programs written in C or C++.

Why Hook Golang Programs During Runtime and What are the Challenges?

Nowadays, most modern cloud technologies are written in Golang - Kubernetes, Docker, Containerd, runc and gVisor to name a few. All of them have big and complex architectures which are cumbersome to analyze statically. It would be great to have the means to examine these tools dynamically alongside the static analysis. Unfortunately, at the time of writing, besides eBPF, there isn’t any solution for dynamic analysis without recompiling the source code of the programs. However, we are not interested in eBPF because, as we described in the first blog post, manipulating Go types and structures in C is not fun. Sometimes we can’t modify the source code of Go programs, and we should interact directly with the process that is already executing the code. But why aren’t there any tools in the wild which allow the insertion of some arbitrary logic inside a running Go program? We suppose that one of the problems could be that Gopl (Golang programming language) has a different ABI (Application binary interface) than the one used in C and C++ (hence, Frida does not work out of the box :( ). In addition, Golang incorporates a language-specific runtime which is responsible for complex procedures such as garbage collection and scheduling of goroutines. This runtime, and the way it is placed inside the program, completely change the way we construct and insert hooks. Last but not least, initially Gopl was intended to be self-contained — it was not designed to be extensible during runtime (for example, by loading shared libraries). Fortunately, this changed, but Go programs are still statically linked if they do not use the net and user packages or they do not make use of cgo. However, with some tweaking we were able to circumvent these problems.

In the first blog post, we have demonstrated how we can hook Golang programs during runtime using C and pure assembly on x86-64 CPU architecture. We have discussed the difficulties and limitations of this approach. In the second blog post, we have studied how FFI works in Golang and what happens when a Go function invokes another Go function using CGO. Today, we are going to use FFI to try to circumvent the limitations of the approach described in the first blog post and achieve our goal - dynamically hook Go code using Go!

Note: The content below was produced on x86_64 Linux with go version go1.21.1

Defining the Target

We are going to use the very same example, that we call the target or the host, from our first blog post:

// secret.go
package main

import (
    "fmt"
    "os"
    "strings"
)
import "C"
// the "import C" statement is needed for the compiler to produce a binary 
// which is going to load libc.so when launched. This is needed 
// for the side-loading of the hook module.
var SECRET string = "VALIDATEME"

func theGuessingGame(s string) bool {
    if s == SECRET {
        fmt.Println("Authorized")
        return true
    } else {
        fmt.Println("Unauthorised")
        return false
    }

}

func main() {
    var s string

    for {
        if _, err := fmt.Scanf("%s", &s); err != nil {
            panic(err)
        }
        s = strings.ToLower(s)
        if theGuessingGame(s) {
            os.Exit(0)
        }
    }
}

Our goal is the same: hijack the execution flow of the theGuessingGame and redirect it to another, external to the program, Go function. The latter should uppercase the letters of the UTF-8 encoded string and thus, we should be able to take the os.Exit branch and terminate the program. We called this procedure in the first part a trampoline hook. Why do we need to uppercase the letters of the string rather than just return a true value when the execution flow enters the theGuessingGame? The ultimate goal of this series of blog posts is to explore how two separate pieces of Go code can be coupled together when they are not compiled as one. How to introduce this external function in the environment of an already compiled and running program? There are different ways to do that but we are going to stick with the one that we used in the first part of the series - we will place it in a shared library which we will side-load into the target program while the latter is running. We will not cover the process of dynamically loading this library as there are plenty of examples of how this can be done.

Compiling Go Code Into a Shared Library

The next question is, how to compile Go code into a shared library? Fortunately, the answer is simple - CGO. What comes next is the content of this shared library. Let us define the logic with which we want to extend the theGuessingGame:

// hook.go
package main

import (
    "C"
    "unsafe"
    "strings"
)
// The directive below "//export UpperCaseString" is going to 
// make the Go function callable by C code
// by generating a C adapter interface

//export UpperCaseString
func UpperCaseString(s string) {
    // Strings in Go are immutable hence,
    // uppercasing involves creating a new string
    // with the modified contents of the previous one
    s = strings.ToUpper(s)
}

The above program can be compiled into a Linux shared library using the build frontend as follows:

$ go build -buildmode c-shared -o /tmp/hooks.so hook.go

After being compiled, we can find the following symbols inside the library which are associated to the UpperCaseString function:

$ go tool nm /tmp/hooks.so | grep UpperCaseString
6f6a0 T UpperCaseString
6f520 T _cgoexp_7b357ce951e1_UpperCaseString

In the second blog post of this series, we have provided an overview of what CGO does to make Go code callable by C programs - it generates a bunch of Go and C boilerplate content to glue the two languages and their different calling conventions. The exported text symbol (T) UpperCaseString is a compiled C function generated by CGO allowing the function identified with the _cgoexp_7b357ce951e1_UpperCaseString text symbol (the actual code we have presented above) to be called from a C environment. As we are going to redirect the execution flow of a Go program, the question is to which one of the two functions to redirect it to? It is time to Go back into the rabbit hole ;)

First Try - Jump Directly to the Replacement Function

First, we are going to explore what happens if the flow is redirected to _cgoexp_7b357ce951e1_UpperCaseString. As we mentioned, this is the Go code which is going to extend the theGuessingGame function and uppercase the letters of the string argument before the verification performed inside the function. Throughout this third part, we will use GDB for our experiments. We stop (break) at the very beginning of our target function and we will set the instruction pointer (RIP) to the beginning of _cgoexp_7b357ce951e1_UpperCaseString. This is not a trampoline hook as we will not return and execute the original function code. Nevertheless, it is important to first be sure that the logic will execute without any problem before implementing the actual hooking procedure.

The _cgoexp_7b357ce951e1_UpperCaseString function was side-loaded in the target program, after it had been launched, as part of the shared library that we have compiled above. To redirect the execution flow after loading the shared library, we have to identify the location of the replacement function.

Finding the Address of UpperCaseString

This can be done, for example, by outputting its address inside a constructor function of the shared library:

// hook.go
...
// The constructor below will be executed when the Linux loader (`ldd`) loads the library
func init() {
    // Note: this is not the actual address but a pointer to where we can find the address
    // in gdb it can be extracted with x/2xg @address and then taking the first word
    println("Address: ", UpperCaseString)
}
...

After loading the library we retrieve the address 0x7f26941274e0. Additionally, in GDB we retrieve the following information:

# inspect where the library was loaded in the VAS (virtual address space) of the target
(gdb) info proc mappings
...
0x7f26940b8000     0x7f26940cd000    0x15000        0x0  r--p   /tmp/hooks.so
0x7f26940cd000     0x7f2694129000    0x5c000    0x15000  r-xp   /tmp/hooks.so
0x7f2694129000     0x7f2694143000    0x1a000    0x71000  r--p   /tmp/hooks.so
0x7f2694143000     0x7f2694144000     0x1000    0x8b000  ---p   /tmp/hooks.so
0x7f2694144000     0x7f269419f000    0x5b000    0x8b000  r--p   /tmp/hooks.so
0x7f269419f000     0x7f26941a6000     0x7000    0xe6000  rw-p   /tmp/hooks.so

# inspect the function contents at the address
(gdb) disas 0x7f26941274e0
Dump of assembler code for function <redacted>/main.UpperCaseString:
   0x00007f26941274e0 <+0>: cmp    rsp,QWORD PTR [r14+0x10]
   0x00007f26941274e4 <+4>: jbe    0x7f26941274fe <main.UpperCaseString+30>
   0x00007f26941274e6 <+6>: push   rbp
   0x00007f26941274e7 <+7>: mov    rbp,rsp
   0x00007f26941274ea <+10>:    sub    rsp,0x10
   0x00007f26941274ee <+14>:    mov    QWORD PTR [rsp+0x20],rax
   0x00007f26941274f3 <+19>:    call   0x7f2694127020 <strings.ToUpper>
   0x00007f26941274f8 <+24>:    add    rsp,0x10
   0x00007f26941274fc <+28>:    pop    rbp
   0x00007f26941274fd <+29>:    ret
   0x00007f26941274fe <+30>:    mov    QWORD PTR [rsp+0x8],rax
   0x00007f2694127503 <+35>:    mov    QWORD PTR [rsp+0x10],rbx
   0x00007f2694127508 <+40>:    call   0x7f2694121e60 <runtime.morestack_noctxt>
   0x00007f269412750d <+45>:    mov    rax,QWORD PTR [rsp+0x8]
   0x00007f2694127512 <+50>:    mov    rbx,QWORD PTR [rsp+0x10]
   0x00007f2694127517 <+55>:    jmp    0x7f26941274e0 <main.UpperCaseString>

The above assembly code looks pretty much like what we expected to see. It is contained in a shared library which is compiled as PIE (Position Independent Executable).

Finding the Address of theGuessingGame

On the other hand, the target program was not compiled as PIE hence, the addresses of the symbols inside are fixed. Thus, we can easily identify the location of the theGuessingGame function:

$ go tool nm target | grep "theGuessingGame"
4939c0 t main.theGuessingGame

(gdb) disassemble 0x4939c0
Dump of assembler code for function main.theGuessingGame:
   0x00000000004939c0 <+0>: cmp    rsp,QWORD PTR [r14+0x10]
   0x00000000004939c4 <+4>: jbe    0x493a94 <main.theGuessingGame+212>
   0x00000000004939ca <+10>:    sub    rsp,0x50
   0x00000000004939ce <+14>:    mov    QWORD PTR [rsp+0x48],rbp
   0x00000000004939d3 <+19>:    lea    rbp,[rsp+0x48]
   0x00000000004939d8 <+24>:    mov    QWORD PTR [rsp+0x58],rax
   0x00000000004939dd <+29>:    mov    rdx,QWORD PTR [rip+0xa2f9c]        # 0x536980 <main.SECRET>
   0x00000000004939e4 <+36>:    cmp    QWORD PTR [rip+0xa2f9d],rbx
...

Changing the Instruction Pointer

We will break at the very beginning of the function and not after the stack resizing prologue (i.e. 0x4939ca) which is where GDB will insert a breakpoint if you do break main.theGuessingGame.

(gdb)  break *0x04939c0
(gdb)  continue
Continuing.
# in another TTY we supply a string input to reach the break point
...
(gdb)
Thread 1 "target" hit Breakpoint 1, main.theGuessingGame (s=..., ~r0=<optimized out>)
    at <redacted> ...
(gdb) disassemble $rip
Dump of assembler code for function main.theGuessingGame:
=> 0x00000000004939c0 <+0>: cmp    rsp,QWORD PTR [r14+0x10]
...

Let us set the instruction pointer to the beginning of UpperCaseString (associated to symbol _cgoexp_7b357ce951e1_UpperCaseString):

(gdb) set $rip=0x7f26941274e0
(gdb) continue
Continuing.
Thread 1 "target" received signal SIGSEGV, Segmentation fault.
runtime.morestack () at /usr/local/go/src/runtime/asm_amd64.s:555
555     MOVQ    g_m(BX), BX // 0x0000000000000000 RBX is a Nul pointer

We received a segmentation fault because of a NULL pointer dereference. To understand the reason for it, we have to dig deeper:

(gdb) backtrace
#0  runtime.morestack () at /usr/local/go/src/runtime/asm_amd64.s:555
#1  0x00007f269412750d in <redacted>/main.UpperCaseString (s=...) at <redacted>/hook.go:13
#2  0x0000000000493c51 in main.main () at <redacted>/secret.go:35
(gdb) print $rip
$1 = (void (*)()) 0x7f2694121dcb <runtime.morestack+11>
(gdb) disassemble $rip
Dump of assembler code for function runtime.morestack:
   0x00007f2694121dc0 <+0>: mov    rcx,QWORD PTR [rip+0x7d201]        # 0x7f269419efc8
   0x00007f2694121dc7 <+7>: mov    rbx,QWORD PTR fs:[rcx]
=> 0x00007f2694121dcb <+11>:    mov    rbx,QWORD PTR [rbx+0x30]
...
(gdb) print $rbx
$4 = 0
(gdb) p/x $rcx
$4 = 0xffffffffffffff58 # -168

The code fails in (runtime.morestack) because the value of RBX being 0. There are several interesting things in the above GDB backtrace log:

we are executing runtime code contained in the shared library that we have just loaded and not in the target program which also has its own one;
the offset (-168) in the TLS register (fs) is taken from a memory location inside the shared library (0x7f269419efc8).

When compiling Go code as a shared library using CGO, the Go runtime will be included inside it. And this is normal as FFI makes Go code accessible from C, where there is not a runtime. So by loading a CGO shared library inside a Go program we end up with two separate runtimes! And this, as we will see later, can cause a lot of headaches. However, the above does not explain the reason for the segmentation violation. Let us dig deeper into the code of the runtime.morestack function but this time in the target program:

# address obtained with "go tool nm target | grep 'runtime.morestack'"
(gdb) disassemble 0x45d820
Dump of assembler code for function runtime.morestack:
   0x000000000045d820 <+0>: mov    rbx,QWORD PTR fs:0xfffffffffffffff8 # <===== here we encounter a different offset taken from FS
   0x000000000045d829 <+9>: mov    rbx,QWORD PTR [rbx+0x30]

The two seemingly same procedures do not access the TLS in the same way. Go uses the TLS register as a base to store a pointer to the current goroutine (G) executing on the OS thread (M). The runtime code in the shared library is accessing data at offset -168, a value obtained from an address inside the library, from fs which ends up having the value 0. On the other hand, the runtime code inside the target program is manipulating data at offset -8 from fs. We think that the reason for these different offsets is because of how CGO and the shared libraries it creates work.

In the previous blog post, we saw that the _cgoexp_7b357ce951e1_UpperCaseString procedure is not supposed to be called directly even though it contains the actual code that we wrote and want to execute. There is a C-generated interface called UpperCaseString which, when invoked, is going to either retrieve an existing Go context (G, P, etc.) or initialise a new one, and then execute the _cgoexp_7b357ce951e1_UpperCaseString function. When jumping directly to _cgoexp_7b357ce951e1_UpperCaseString, we end up with an uninitialized/unretrieved Go context thus, we receive a segmentation fault. To fix that, we will have to play by the CGO rules.

Second Try - Jump to the CGO generated C interface

To solve the above problem, we could try calling directly UpperCaseString. But there are several problems that we need to solve before that:

the interface is supposed to be called on a system thread which has a much larger stack than a goroutine;
it is also supposed to be called using the System-V ABI and not the Go custom ABI;

What steps can we take to address these issues? We can do the same thing as we did in the first blog post:

hijack the execution flow;
use a mutex or similar to create an atomic section;
pivot the stack;
save the caller-saved registers (preserve the Go context);
perform an ABI switch from Go to C;
call the C-generated interface;
switch back the ABI from C to Go;
restore back the caller-saved registers;
pivot back the stack;
release the lock;
resume the execution.

But by doing that, we are not solving the problems we have identified in the first blog post:

the scheduler still will not be able to preempt the hijacked goroutine hence, it will starve the other Gs waiting on the same P;
the GC could not perform the sweep cycle because it will not be able to preempt the hijacked goroutine either.

The above will introduce global program latency and increased memory consumption. Can we circumvent all that? We could try to generate part of the Go and C code generated by CGO which calls into runtime.cgocall as we have seen in the second article of this series. This should be enough as the shared library compiled with CGO contains all that is necessary to safely execute _cgoexp_7b357ce951e1_UpperCaseString.

In most Go programs, we can find the runtime.cgocall function.

It receives as arguments a pointer to an argument frame and a pointer to a function. The latter is automatically generated by CGO, and will serve to unpack the argument frame following the C ABI. This frame is created by another generated CGO procedure which is also the one calling into runtime.cgocall.
After runtime.cgocall executes and the argument frame is unpacked, it is safe to call a CGO-generated C interface behind which there is a Go function.

In our case, the function which the frame unpacker procedure should call is UpperCaseString.

There are two components which are missing in our example - the function that constructs the argument frame and calls into runtime.cgocall and the one unpacking the frame and calling into the UpperCaseString function.

Creating the Argument Frame Constructor

An argument frame is a notion from the old Go stack-based ABI. It is a region on the stack where the caller of a function stores the arguments and allocates space for the return values. The first argument is stored at the highest address while the last return value is at the lowest.

In our case, we should pivot the arguments of the target from the CPU registers to the stack. As the argument we want to pass to the replacement procedure is a Go string, we need to store in the frame the two elements representing a string in Go - its size and a pointer to where the actual UTF-8 encoded bytes are stored in memory. We are going to construct the argument frame using Plan9 assembly. This is an intermediate assembly representation used by the Go compiler which is architecture-independent and it facilitates cross-compilation:

Reminder: When hijacking the execution flow at the beginning of the target function, a pointer to the UTF-8 bytes of the string is stored in RAX while its size is stored in RBX

// func FrameConstructor() 
TEXT ·FrameConstructor(SB), NOSPLIT|NOFRAME, $32-0 

    // "$32-0" - declares required stack space of 32 bytes - three slots of size 8 bytes and one 8 byte slot for the return address of the caller of the routine ==> 32 bytes; 
    // NOFRAME - indicates that there is no need for a local argument frame (push rbp; mov rbp, rsp; ... pop rbp)
    // NOSPLIT - indicates that there is no need for a stack resizing prologue and epilogue

    MOVQ  AX, frameSlotArg0-8(SP)    // store the pointer to the UTF-8 bytes onto the stack; this is the beginning of the frame
    MOVQ  BX, frameSlotArg1-16(SP)   // store the size of the UTF-8 bytes array onto the stack
    LEAQ  frameptr-8(SP), BX         // get the beginning of the argument frame in BX, this is the second argument of runtime.cgocall
    MOVQ  BX, frame-24(SP)           // the beginning of the frame should also be stored on the stack
    LEAQ  ·FrameUnpacker+0(SB), AX   // get the address of the frame unpacker, this is the first argument of runtime.cgocall
    XORPS X15, X15                   // X15 should be set to 0 
    // below is the address of runtime.cgocall in the host. It is obtained with "go tool nm target  | rg "runtime.cgocall"" 
    MOVQ  $0x405060, R12           // R12 is a scratch register so we can use it without backing it up
    CALL  R12
    // Restore the stack frame back to the registers following the Go ABI
    MOVQ  frameSlotArg0-8(SP), AX
    MOVQ  frameSlotArg1-16(SP), BX
    RET

If you are not familiar with Plan 9 assembly, I suggest that you take a look at the resources related to Go assembly in the reference section.

We can summarise what the above abstract assembly code does as follows:

stores the register arguments (the pointer to the UTF-8 byte array as well as its length) consecutively on the stack;
loads the address of the beginning of the frame in BX (resp. RBX on x86-64). This is the second argument of runtime.cgocall and it should also be stored on the stack after the two arguments. We are not sure why, but that is how CGO does it so we are going to stick with it;
loads the address of the frame unpacker into AX (resp. RAX on x86-64). This is the first argument of runtime.cgocall.
loads the address of runtime.cgocall extracted from the target program into R12 and calls it;
restores the two values stored into the stack frame back to their registers.

If you are more into architecture-specific assembly, here is the representation for x86-64 of the above abstract code:

0x00007f7098351160 <+0>:    sub    rsp,0x20
0x00007f7098351164 <+4>:    mov    QWORD PTR [rsp+0x18],rax
0x00007f7098351169 <+9>:    mov    QWORD PTR [rsp+0x10],rbx
0x00007f709835116e <+14>:   lea    rbx,[rsp+0x18]
0x00007f7098351173 <+19>:   mov    QWORD PTR [rsp+0x8],rbx
0x00007f7098351178 <+24>:   lea    rax,[rip+0x21]        # 0x7f70983511a0 <main.FrameUnpacker>
0x00007f709835117f <+31>:   xorps  xmm15,xmm15
0x00007f7098351183 <+35>:   mov    r12,0x405060
0x00007f709835118a <+42>:   call   r12
0x00007f709835118d <+45>:   mov    rax,QWORD PTR [rsp+0x18]
0x00007f7098351192 <+50>:   mov    rbx,QWORD PTR [rsp+0x10]
0x00007f7098351197 <+55>:   add    rsp,0x20
0x00007f709835119b <+59>:   ret

Normally, before calling into runtime.cgocall, CGO will ensure that the pointer to the current goroutine is stored in R14. However, as we are hijacking the execution flow at the very beginning of the function, we do not need to modify R14. If you look back at the stack resizing prologue, you will see that the size of the stack of the current goroutine is determined using R14 (cmp rsp,QWORD PTR [r14+0x10]). The Go compiler always puts in R14 a pointer to the current G before calling a function (except for NOSPLIT assembly routines). But how are we sure that we will have 32 bytes on the stack when executing the assembly code? Fortunately, in the first blog post, we explained that the Go compiler reserves additional stack space for each function so that NOSPLIT assembly functions in the runtime can execute securely. Furthermore, in the previous blog post we have already described what happens after the call to runtime.cgocall. In the end, the first argument of the latter is called using the x86-64 System V ABI.

Creating the Argument Frame Unpacker

The frame unpacker given as argument to runtime.cgocall (as we call it) is the ABI adapter which unpacks the argument frame stored on the stack into registers following the System V ABI and then calls into a function using the same ABI:

// func FrameUnpacker()
TEXT ·FrameUnpacker(SB), NOSPLIT|NOFRAME, $56-0
    // Here we are about to indirectly transition from C to Go ABI after calling into "UpperCaseString" function
    // where all registers are considered caller-saved.
    // Hence, we will save the callee-saved registers to preserve the System V calling convention
    MOVQ BX, backupreg0-8(SP)
    MOVQ BP, backupreg1-16(SP)
    MOVQ R12, backupreg2-24(SP)
    MOVQ R13, backupreg3-32(SP)
    MOVQ R14, backupreg4-40(SP)
    MOVQ R15, backupreg5-48(SP)
    // extract the arguments from the frame stored in BX into RDI and RSI
    MOVQ DI, BX 
    SUBQ $0x08, BX
    MOVQ BX, SI
    LEAQ UpperCaseString+0(SB), R12
    CALL R12
    // Restore the callee-saved registers
    MOVQ backupreg5<>-48(SP), R15
    MOVQ backupreg4<>-40(SP), R14
    MOVQ backupreg3<>-32(SP), R13
    MOVQ backupreg2<>-24(SP), R12
    MOVQ backupreg1<>-16(SP), BP
    MOVQ backupreg0<>-8(SP), BX
    RET

We can summarise what the above assembly snippet does as follows:

stores the callee-saved registers on the stack. These are BX, BP, R12, R13, R14, R15 (resp. RBX, RBP, R12, R13, R14, R15);
puts in DI (resp. RDI) the pointer to the UTF-8 byte array and in SI (resp. RSI) the length of the array;
puts in R12 the address of the CGO-generated C interface and calls into it;
restores the callee-saved registers from the stack.

In x86-64 this would look as:

# the above is taken from a different execution hence the address of "UpperCaseString" is different than before
0x00007f685813b6a0 <+0>:    sub    rsp,0x38
0x00007f685813b6a4 <+4>:    mov    QWORD PTR [rsp+0x30],rbx
0x00007f685813b6a9 <+9>:    mov    QWORD PTR [rsp+0x28],rbp
0x00007f685813b6ae <+14>:   mov    QWORD PTR [rsp+0x20],r12
0x00007f685813b6b3 <+19>:   mov    QWORD PTR [rsp+0x18],r13
0x00007f685813b6b8 <+24>:   mov    QWORD PTR [rsp+0x10],r14
0x00007f685813b6bd <+29>:   mov    QWORD PTR [rsp+0x8],r15
0x00007f685813b6c2 <+34>:   mov    rbx,rdi
0x00007f685813b6c5 <+37>:   sub    rbx,0x8
0x00007f685813b6c9 <+41>:   mov    rsi,rbx
0x00007f685813b6cc <+44>:   lea    r12,[rip+0x4d]        # 0x7f685813b720 <UpperCaseString>
0x00007f685813b6d3 <+51>:   call   r12
0x00007f685813b6d6 <+54>:   mov    r15,QWORD PTR [rsp+0x8]
0x00007f685813b6db <+59>:   mov    r14,QWORD PTR [rsp+0x10]
0x00007f685813b6e0 <+64>:   mov    r13,QWORD PTR [rsp+0x18]
0x00007f685813b6e5 <+69>:   mov    r12,QWORD PTR [rsp+0x20]
0x00007f685813b6ea <+74>:   mov    rbp,QWORD PTR [rsp+0x28]
0x00007f685813b6ef <+79>:   mov    rbx,QWORD PTR [rsp+0x30]
0x00007f685813b6f4 <+84>:   add    rsp,0x38
0x00007f685813b6f8 <+88>:   ret

This looks good but there is still a problem. If you look carefully in the FrameConstructor you will notice that in BX we put a pointer to the beginning of the argument frame (LEAQ). Hence, in RDI and RSI we end up storing a pointer to a pointer to the bytes array and a pointer to the length of the array. But this does not correspond to a Go string, thus we have to patch our hook function:

//export UpperCaseString
func UpperCaseString(s string) {
    // we need to extract the internal components of a string 
    // by additionally dereferencing them
    strHeader := (*reflect.StringHeader)(unsafe.Pointer(&s))
    data := (**byte)(unsafe.Pointer(strHeader.Data))
    length := *(**uintptr)(unsafe.Pointer(&strHeader.Len))
    // we construct a Go string with the above
    strFromHeader := unsafe.String(*data, *length)
    // we uppercase it
    uppercaseStr := strings.ToUpper(strFromHeader)
    strHeader = (*reflect.StringHeader)(unsafe.Pointer(&uppercaseStr))
    // store the new string components back to the argument frame
    *data = (*byte)(unsafe.Pointer(strHeader.Data))
    *length = uintptr(strHeader.Len)
}

When we run the above we validate the challenge:

validateme
Unauthorised
...
<Shared library was injected>
...
validateme
Authorized

By using our knowledge of how strings are represented in Golang and how the Go ABI is defined, we can rewrite the above as:

//export UpperCaseString
func UpperCaseString(data **byte, len *int) {
    //for simplification we consider that
    //all of the characters of str are ASCII
    strSlice := unsafe.Slice(*data, *len)
    dif := byte('a' - 'A')
    for i := range strSlice {
        strSlice[i] -= dif
    }
}

The above two hook functions will uppercase the hijacked string argument. However, they work in a completely different way: the second one is modifying the string bytes in-place, while the first one is going to allocate a new array on the heap, copy the content of the original string, and modify it in the new location (remember that strings in Go are immutable). Although the two solutions work, the first one should not and we have just hacked our way around the CGO rules ;)

Memory Allocations inside CGO

To understand why we broke the rules of CGO, let us consider the following replacement function:

//export UpperCaseString
func UpperCaseString(s string) string {
    strHeader := (*reflect.StringHeader)(unsafe.Pointer(&s))
    data := (**byte)(unsafe.Pointer(strHeader.Data))
    length := *(**uintptr)(unsafe.Pointer(&strHeader.Len))
    // we construct a Go string with the above
    strFromHeader := unsafe.String(*data, *length)
    // we uppercase it
    return strings.ToUpper(strFromHeader)
}

For the sake of simplicity, consider that the above code is a valid replacement of our target function. If we extend the original execution flow with the above function, our program is going to get aborted by the side-loaded runtime:

# we side-load the replacement function in a shared library, extend the execution flow
# and attach with a debugger
(gdb) continue
Continuing.
Thread 1 "target" received signal SIGABRT, Aborted.
runtime.raise ()
    at /usr/local/go/src/runtime/sys_linux_amd64.s:154
154     RET
(gdb) backtrace
#0  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:154
#1  0x00007f84883357c5 in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:903
#2  0x00007f8488321486 in runtime.crash () at /usr/local/go/src/runtime/signal_unix.go:985
#3  runtime.fatalpanic (msgs=<optimized out>) at /usr/local/go/src/runtime/panic.go:1215
#4  0x00007f8488320bcc in runtime.gopanic (e=...) at /usr/local/go/src/runtime/panic.go:1017
#5  0x00007f84882f737a in runtime.cgoCheckArg (t=0x7f84883700e0, p=0x2c000110000, indir=<optimized out>, top=false,
    msg="cgo result has Go pointer") at /usr/local/go/src/runtime/cgocall.go:538
#6  0x00007f84882f780b in runtime.cgoCheckResult (val=...) at /usr/local/go/src/runtime/cgocall.go:656
#7  0x00007f848834f0ff in _cgoexp_7b357ce951e1_UpperCaseString (a=0x7fff548916c0) at _cgo_gotypes.go:40
#8  0x00007f84882f6e02 in runtime.cgocallbackg1 (fn=0x7f848834f0a0 <_cgoexp_7b357ce951e1_UpperCaseString>,
    frame=0x7fff548916c0, ctxt=0) at /usr/local/go/src/runtime/cgocall.go:329
#9  0x00007f84882f6aab in runtime.cgocallbackg (fn=0x7f848834f0a0 <_cgoexp_7b357ce951e1_UpperCaseString>, frame=0x7fff548916c0,
    ctxt=0) at /usr/local/go/src/runtime/cgocall.go:245
#10 0x00007f848834bfcb in runtime.cgocallbackg (fn=0x7f848834f0a0 <_cgoexp_7b357ce951e1_UpperCaseString>, frame=0x7fff548916c0,
    ctxt=0) at <autogenerated>:1
#11 0x00007f8488349a6d in runtime.cgocallback () at /usr/local/go/src/runtime/asm_amd64.s:1037
#12 0x00007f8488349ca1 in runtime.goexit () at /usr/local/go/src/runtime/asm_amd64.s:1652

From the above snippet, we see that the Go runtime from the shared library detected that we are trying to return a pointer allocated by its own memory allocator and aborted the execution.

Now it is the time to mention that when using CGO there are a lot of constraints especially on pointers. The design of the FFI feature in Golang establishes strict rules on how pointers can be passed from Go to C and vice versa.

More precisely, all allocations done in an exported Golang function, available through CGO or a function reachable by it in the same CGO scope, must not leave the scope. This is related to the volatile execution context we mentioned earlier - if a CGO exported Go function is called from C, a new execution environment is created (memory allocator, scheduler, etc.). However, after the end of the execution of the function, this context is deleted. This means that all memory allocations created by Go code called through CGO will be released. Thus, the memory that was valid throughout the execution of the function should be considered invalid when the function returns.

To enforce the right usage of this feature, the Go developers integrated checks under the hood of the runtime to ensure that no pointer leaves CGO. We suppose that the first example using the strings library worked because we were dealing with nested pointers which the CGO checks could not detect escaping the CGO scope. Furthermore, we can confirm that the memory allocated in the hook function escapes to the target world by simply outputting the addresses of the argument string and the result of strings.ToUpper:

//export UpperCaseString
func UpperCaseString(s string) {
    strHeader := (*reflect.StringHeader)(unsafe.Pointer(&s))
    data := (**byte)(unsafe.Pointer(strHeader.Data))
    length := *(**uintptr)(unsafe.Pointer(&strHeader.Len))

    println("Data address received from the target", data)
    strFromHeader := unsafe.String(*data, *length)
    uppercaseStr := strings.ToUpper(strFromHeader)
    strHeader = (*reflect.StringHeader)(unsafe.Pointer(&uppercaseStr))

    println("Data address after calling ToUpper method",
        (*byte)(unsafe.Pointer(strHeader.Data)))
    *data = (*byte)(unsafe.Pointer(strHeader.Data))
    *length = uintptr(strHeader.Len)
}

...
Data address received from the target 0xc000051ed8
Data address after calling ToUpper method 0x2c0000140a8
...

We can identify the new memory pages allocated by the host runtime by inspecting the memory mappings of the target program before and after loading the Go shared library and executing the hook function:

# memory mapping before the loading of the Go library
$ cat /proc/2487077/maps
...
00550000-00582000 rw-p 00000000 00:00 0                                  [heap]
00582000-005a3000 rw-p 00000000 00:00 0                                  [heap]
c000000000-c000400000 rw-p 00000000 00:00 0
c000400000-c004000000 ---p 00000000 00:00 0
7f97e4000000-7f97e4021000 rw-p 00000000 00:00 0

# memory mapping after loading the Go library and executing the function
$ cat /proc/2487077/maps
...
00550000-00582000 rw-p 00000000 00:00 0                                  [heap]
00582000-005a3000 rw-p 00000000 00:00 0                                  [heap]
c000000000-c000400000 rw-p 00000000 00:00 0
c000400000-c004000000 ---p 00000000 00:00 0
1c000000000-1c000800000 rw-p 00000000 00:00 0
1c000800000-1c004000000 ---p 00000000 00:00 0
2c000000000-2c000400000 rw-p 00000000 00:00 0 # <----------- we see the heap addresses allocated by the side-loaded runtime
2c000400000-2c004000000 ---p 00000000 00:00 0

To avoid unpleasant surprises which, additionally, could also be quite hard to debug, we should modify the string argument directly in place as we did in our second example.

Putting it All Together

We have deviated a bit from our primary goal -- implement a trampoline hook to extend the execution flow of our target program. But this deviation was important to understand what are the limits and constraints of the current approach. Now it is time to put it all together. As we are hooking the very same program as in the first blog post we will apply similar actions:

we are going to hook the theGuessingGame function at offset +10 just after the stack resizing prologue;
we are going to insert a jump stub of size 14 bytes thus, we should backup at least 14 bytes of instructions from that code region. Fortunately, in our example, backing up exactly 14 bytes does not break the integrity of the instructions at this location;
we are re going to extend the execution flow using CGO and what we presented above;
we are going to execute the original saved instructions and then continue the original execution flow at offset 10+14=24. Here is what the jump assembly stub looks like in the body of the target function (the stub is inserted after the hook library was loaded):

# the body of the "theGuessingGame" function before the insertion of the jump stub
(gdb) disassemble 0x4939c0
Dump of assembler code for function main.theGuessingGame:
   0x00000000004939c0 <+0>: cmp    0x10(%r14),%rsp
   0x00000000004939c4 <+4>: jbe    0x493a94 <main.theGuessingGame+212>
   0x00000000004939ca <+10>:    sub    $0x50,%rsp       
   0x00000000004939ce <+14>:    mov    %rbp,0x48(%rsp)
   0x00000000004939d3 <+19>:    lea    0x48(%rsp),%rbp
   0x00000000004939d8 <+24>:    mov    %rax,0x58(%rsp)  # you can see that from [+10; +24[ (14 bytes) we have exactly 4 instructions
   0x00000000004939dd <+29>:    mov    0xa2f9c(%rip),%rdx        # 0x536980 <main.SECRET>
   0x00000000004939e4 <+36>:    cmp    %rbx,0xa2f9d(%rip)        # 0x536988 <main.SECRET+8>
    ...

# the body of the "theGuessingGame" function after the insertion of the jump stub
(gdb) disassemble 0x4939c0
   0x00000000004939c0 <+0>: cmp    rsp,QWORD PTR [r14+0x10]
   0x00000000004939c4 <+4>: jbe    0x493a94 <main.theGuessingGame+212>
   0x00000000004939ca <+10>:    push   0x2433e020
   0x00000000004939cf <+15>:    mov    DWORD PTR [rsp+0x4],0x7f42
   0x00000000004939d7 <+23>:    ret
   0x00000000004939d8 <+24>:    mov    QWORD PTR [rsp+0x58],rax
   0x00000000004939dd <+29>:    mov    rdx,QWORD PTR [rip+0xa2f9c]        # 0x536980 <main.SECRET>
   0x00000000004939e4 <+36>:    cmp    QWORD PTR [rip+0xa2f9d],rbx        # 0x536988 <main.SECRET+8>
   ...

We are redirecting the execution flow to a section we call "trampoline" (located above at 0x7f422433e020). There are going to be saved instructions that we will overwrite with the redirection stub. They are going to be executed after the ·FrameConstructor routine returns. After executing the backup, the execution flow is redirected back to 0x00000000004939d8 (the instruction in the target function which is immediately after the 14-byte redirection stub):

// The below global variables contain the address of the next instruction to jump back to 0x00000000004939d8
DATA jumpBackNext_UpperCaseString<>+0(SB)/8, $0x00000000004939d8
GLOBL jumpBackNext_UpperCaseString<>(SB), NOPTR, $8 

// func TrampolineStub_UpperCaseString()
TEXT ·TrampolineStub_UpperCaseString(SB), NOSPLIT|NOFRAME, $0
    LEAQ ·FrameConstructor+0(SB), R12
    CALL R12
    // The below sequence of RET instructions is just a buffer for the instructions that we are going to back up.
    // They are going to be filled when the library is loaded and just before inserting the assembly jump stub by the loaded library itself
    RET
    RET
    RET
    RET
    RET
    RET
    RET
    RET
    RET
    RET
    RET
    RET
    RET
    RET
    LEAQ jumpBackNext_UpperCaseString<>+0(SB), R12
    MOVQ (R12), R12
    SUBQ $0x00000008, SP
    MOVQ R12, (SP)
    RET

The x86-64 architecture-specific representation of the above is the following:

(gdb) disas 0x7f422433e020
    0x00007f422433e020 <+0>:    lea    r12,[rip+0x39]        # 0x7f422433e060 <main.FrameConstructor>
   0x00007f422433e027 <+7>: call   r12
   # below can be found the original instructions copied from the body of the target function
   0x00007f422433e02a <+10>:    sub    rsp,0x50
   0x00007f422433e02e <+14>:    mov    QWORD PTR [rsp+0x48],rbp
   0x00007f422433e033 <+19>:    lea    rbp,[rsp+0x48]
   0x00007f422433e038 <+24>:    lea    r12,[rip+0x79469]        # 0x7f42243b74a8 <jumpBackNext_UpperCaseString>
   0x00007f422433e03f <+31>:    mov    r12,QWORD PTR [r12]
   0x00007f422433e043 <+35>:    sub    rsp,0x8
   0x00007f422433e047 <+39>:    mov    QWORD PTR [rsp],r12
   0x00007f422433e04b <+43>:    ret

Now you have the whole picture and the result from the procedure is...

validateme
Authorized
# inspect the return code of the program
$ echo $?
0 # we have won :P

A validated challenge! ;)

Take a Step Back

Let us take a step back and try to summarise what happened until now in this series of blog posts. In the first part, we hooked a Go function using a trampoline hook and redirected the execution flow to a C function. To do that, we had to use architecture-specific assembly and we had to translate the data types of the function arguments from Go to C. Furthermore, we introduced concurrency problems and global program latency by not allowing the hijacked goroutine to be preempted either cooperatively or asynchronously by the Go scheduler. By blocking preemption, other goroutines could not execute on the same processor and the garbage collector could not free memory when the used-memory threshold was reached, eventually resulting in increased collection cycles later on.

By using CGO we were able to define the logic of the hook function in Go and thus, be able to manipulate Golang types using Go. Through the use of the FFI feature of Go and the associated runtime primitives, we solved the latency issues of the previous approach by allowing the set of goroutines bounded to the current thread (the processor P) to be assigned for execution to another OS thread. Also, the garbage collector could free the program memory without increasing the length of the collection cycles. Unfortunately, this approach introduced other meaningful problems.

Limitations

With the approach described in this blog post, we encountered the following limitations:

no memory allocations should leave CGO;
no nested pointers can be passed to the hook function. runtime.cgocall will not ensure that they will not be garbage collected while the hook executes;
all Go types should again manually be translated to C following the System V ABI (see the FrameUnpacker assembly stub);
there is a massive overhead when using CGO to switch off from Go to C. Interesting comparisons can be found in the reference section;

And sadly, there could be even more ... The main disadvantage of this approach is the fact that we end up with two separate runtimes and more specifically: two memory allocators, two schedulers, etc. In addition, Go data types had to be again translated into the C ABI as for our first approach. It seems that in Go there is only room for one runtime.

Conclusion

In this blog post, we illustrated our second effort to define a hooking scheme for Go programs this time using CGO -- the FFI feature of Golang. We solved some of the limitations identified in our first approach but introduced new and arguably more complex ones. In the end, we wanted to manipulate Go types with more ease but we ended up again translating them to C. In the next article, we will focus on how to share a single runtime between our target and our Go hooks thus, bypassing CGO. Stay tuned, the rabbit hole goes deeper ;)

Resources

If you would like to learn more about our security audits and explore how we can help you, get in touch with us!