HOW-TO: Implementing a custom directive processor in clang to drive the compilation process of our LLVM-base code obfuscator, while maintaining backward-compatibility if another compiler is used. What a good opportunity for a journey in the first compiler stages!
Pragma and Annotations
The C and C++ languages have a built-in mechanism for extending the language with compiler-specific extensions: compiler directives, a.k.a. #pragma.
They are not the sole mean to extend the language. Clang and GCC support the annotate attribute that can be used to attach extra information to some declarations:
int foo(int param) {
int __attribute__((annotate("my annotation"))) var = param;
return var;
}
Compiled with clang -S -emit-llvm this turns into the following bitcode:
@.str = private unnamed_addr constant [14 x i8] c"my annotation\00", section "llvm.metadata"
@.str.1 = private unnamed_addr constant [10 x i8] c"/tmp/rr.c\00", section "llvm.metadata"
define i32 @foo(i32 %param) {
%var = alloca i32, align 4
%1 = bitcast i32* %var to i8*
call void @llvm.var.annotation(i8* %1, i8* getelementptr inbounds ([14 x i8], [14 x i8]* @.str, i64 0, i64 0), i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str.1, i64 0, i64 0), i32 2)
ret i32 %param
}
The call to the LLVM intrinsic @llvm.var.annotation [0] adds extra information on the variable, and a compiler pass can be used to consume it.
A similar annotation can be put on a function type:
int foo(int param) __attribute__((annotate("my annotation"))) {
return param;
}
The LLVM bitcode equivalent is:
@.str = private unnamed_addr constant [14 x i8] c"my annotation\00", section "llvm.metadata"
@.str.1 = private unnamed_addr constant [10 x i8] c"/tmp/gg.c\00", section "llvm.metadata"
@llvm.global.annotations = appending global [1 x { i8*, i8*, i8*, i32 }] [{ i8*, i8*, i8*, i32 } { i8* bitcast (i32 (i32)* @foo to i8*), i8* getelementptr inbounds ([14 x i8], [14 x i8]* @.str, i32 0, i32 0), i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str.1, i32 0, i32 0), i32 1 }], section "llvm.metadata"
define i32 @foo(i32 %param) {
ret i32 %param
}
The annotation is now located in a global table @llvm.global.annotations with appending linkage. This has the benefit of allowing several compilation unit to define it, but, in the end, they'll all be linked together into a single table. Again, a compiler pass can consume these info to perform its routine.
There's still an issue: typing __attribute__((annotate("my annotation"))) goes far from what I'd call an intuitive interface. We could put this into a macro:
#define STRINGIFY_2(value) #value
#define STRINGIFY(value) STRINGIFY_2(value)
#define my_stuff(value) __attribute__((annotate(STRINGIFY(value))))
int foo(int param) my_stuff(my annotation) {
return param;
}
Piped through the C preprocessor, we get our annotation back:
int foo(int param) __attribute__((annotate("my annotation"))) {
return param;
}
If we decide to put the macros in a header, we'll introduce a header dependency. This is not convenient and acceptable as we want to be as less intrusive as possible. The original source should be compilable without a hitch on an unmodified system, that may not have this header.
So here comes #pragma. As hinted by the #, they are preprocessor directives that can be used to implement custom (as in non standard) behavior. A compiler that does not know a particular directive should be able to safely ignore it (a precious property that's not always respected, see #pragma once).
First Attempt: Preprocessor only Implementation
Preprocessor directives are handled at the preprocessing step (d'oh). At that step, the compiler knows nothing about the code: it only sees a stream of lexed tokens. But Clang let us define a specific handler for our directive, using a PragmaHandler. This handler process the token stream and can do multiple actions with it: reading until the end of the directive, implementing a small language based on these tokens or more importantly adding new tokens to the stream.
Let's introduce a simple idea: whenever we meet a #pragma annotate ... extra tokens ..., we replace it by new tokens that correspond to the equivalent attribute. For example:
#pragma annotate my annotation int var = 1;
We'll record all the tokens until the end-of-directive (there's a special token for it, namely tok::eod), then process an extra token (the type) and finally paste the requited token to recreate the annotation. On the surface, this method ought to work, but we'll see that this technique is too simplistic.
For more information about this approach, you can take a look at the following code:
https://github.com/quarkslab/clang/commit/0163f52f70e4781ce99710575bb66943125357b2
It uses a similar technique to replace #pragma by function calls.
The astute reader may consider this approach «as frail as a twig». Indeed, what if we get an ill-placed directive:
#pragma annotate my annotation
g = 1;
We'll get an helpless error message, as we have generated an invalid C sequence (namely g __attribute__((annotate("my annotation"))) = 1).
As the good ol' macro used to say: "As funny as it may be, don't play too much with the preprocessor!".
Second Attempt: Lexical Analysis, Semantic Analysis, Code Generation
For a proper and good code generation with error recovery, we need to go through the whole compilation steps. The very same we learnt at school:
Lexical analysis, to split the input into words and associate a category to each chunk (like keyword, identifier etc.);
Semantic Analysis, to build structured and grammatically correct sentences from what the lexing step gave us;
Code Generation, to turns the abstract sentences into something more relevant for automated processing, here LLVM bitcode.
One first thing to note, is that Clang does not use the old school Flex/Bison duo, but a custom parser, which is fortunately very well structured. Still, there's no way diving in a code that can successfully parse a language like C++ can be an easy task ;-)
Minimal Preprocessor Step
Clang allows to define custom token that can be generated as the result of parsing a directive, in the file include/clang/Basic/TokenKinds.def, for instance:
ANNOTATION(pragma_my_annotate)
An annotation is just a token that can hold any piece of data in addition to the mandatory source location info. This additional piece of information is used to ensure a correct error reporting.
So our directive processor just reads tokens until the end of the directive, build a string out of it. This newly created string is then attached to a new annotation token, which in turn gets inserted back to the token stream:
void PragmaMyAnnotateHandler::HandlePragma(Preprocessor &PP,
PragmaIntroducerKind Introducer,
Token &FirstTok)
{
Token Tok;
// first slurp the directive content in a string.
std::ostringstream MyAnnotateDirective;
while(Tok.isNot(tok::eod)) {
PP.Lex(Tok);
if(Tok.isNot(tok::eod))
MyAnnotateDirective << PP.getSpelling(Tok);
}
Tok.startToken();
Tok.setKind(tok::annot_pragma_my_annotate);
Tok.setLocation(FirstTok.getLocation());
Tok.setAnnotationEndLoc(FirstTok.getLocation());
// there should be something better that this strdup :-/
Tok.setAnnotationValue(strdup(MyAnnotateDirective.str().c_str()));
PP.EnterToken(Tok);
}
Now we have a new token generated by the lexer. What will the semantic analyzer do with it? What a suspense!
Semantic Analysis
Now our token is going to be part of a potentially valid C or C++ sentence. It's up to the semantic analysis to decide! We want it to be attached to a variable declaration, in a function (no global scope). The function that handles this part is in lib/Parse/ParseStmt.cpp and it's called ParseStatementOrDeclarationAfterAttributes. We add a case to handle our token:
case tok::annot_pragma_my_annotate:
ProhibitAttributes(Attrs);
return HandlePragmaMyAnnotate(Stmts);
where HandlePragmaMyAnnotate consumes the tokens and insert the statements into the Stmts. HandlePragmaMyAnnotate checks the validity of the underlying statement, and call extra code generation on it. It also reports an error if we're in trouble. Clang has a great error reporting system: one defines an error template in include/clang/Basic/DiagnosticParseKinds.td and use it using the Diag member function.
// custom warning def warn_pragma_epona : Warning< "Invalid annotate directive: %0 - ignored">;
Note that invalid directives are just ignored in our case, so we use a warning.
StmtResult Parser::HandlePragmaMyAnnotate(StmtVector & Stmts)
{
assert(Tok.is(tok::annot_pragma_epona_obfuscation));
// handle stacked directives by merging them
SmallVector<char* , 8> Annotations;
auto Where = Tok.getLocation();
while (Tok.is(tok::annot_pragma_epona_obfuscation)) {
Annotations.push_back(static_cast<char*>(Tok.getAnnotationValue()));
ConsumeToken();
}
// start handling
if (isDeclarationStatement()) {
StmtResult S = ParseStatementOrDeclaration(Stmts, false);
if (S.isUsable()) {
StmtResult SR = Actions.ActOnPragmaMyAnnotate(S.get(), Annotations,
Tok.getLocation());
Stmts.push_back(SR.get());
} else {
Diag(Where, diag::warn_pragma_my_annotate)
<< "attached to unexpected declaration statement";
}
} else {
Diag(Where, diag::warn_pragma_my_annotate)
<< "attached to unexpected statement";
}
// here is the consequence of our strdup :-/
// one always pays for its sins
std::for_each(Annotations.begin(), Annotations.end(), free);
return StmtEmpty();
}
The magic goes on with the call to ActOnPragmaMyAnnotate which is responsible for the actual code generation :-)
CodeGen
We're now in the middle of the generation process: in a function, probably in the middle of a basic block. The member function ActOnPragmaMyAnnotate has been implemented in lib/CodeGen/CGStmt.cpp
Note
Did you notice how the directory layout of the edited files scrupulously follows the different compilation step? Amazing!
This function has access to the LLVM Module object, an llvm::IRBuilder<> properly initialized. We can use it to insert LLVM instruction into current basic block, doing whatever we want and see it reflected in Clang's output; It's still based on the content of our array of annotations, for instance filling the global variable @llvm.global.annotations that we met at the beginning of this post.
Application to Code Obfuscation
When obfuscating code, we don't want to obfuscate the whole application, but only relevant pieces of code. For a developer, being able to mark a code location as a sensible piece of code is mandatory. Compiler directives are the natural way to do so, as showcased by the following code (extracted from the Hacker's Delight [1]):
#pragma epona obfuscate MBA()
#pragma epona obfuscate ControlFlowGraphFlattening()
unsigned int crc32(const unsigned char* message)
{
int i, j;
unsigned int byte, crc, mask;
i = 0;
crc = 0xFFFFFFFF;
// main_loop
while (message[i] != 0) {
byte = message[i]; // Get next byte.
crc = crc ^ byte;
for (j = 7; j >= 0; j--) { // Do eight times.
mask = -(crc & 1);
#pragma epona obfuscate OpaqueConstants()
const unsigned int tmp = 0xEDB88320 & mask;
crc = (crc >> 1) ^ tmp;
}
i = i + 1;
}
return ~crc;
}
The whole function will be obfuscated using Mixed Boolean Arithmetic [2] then Control Flow Graph Flattening, and the special constant 0xEDB88320 gets a special handling through Opaque Constant transformation. This effectively control the obfuscation process through compiler directives, and the very same code can still be compiled with GCC that is going to simply ignore the directive.
Thanks
Thanks a lot w1gzizi for your proof reading. Your nickname is ridiculous, but your proof reading is inestimable :-)
[0] | http://llvm.org/docs/LangRef.html#llvm-var-annotation-intrinsic |
[1] | http://www.hackersdelight.org/hdcodetxt/crc.c.txt |
[2] | On that subject, read the excellent article of my colleague Ninon |