Easy::jit: Just-In-Time compilation for C++

Easy::jit is a library that brings just-in-time compilation to C++ codes. It allows developers to jit-compile some functions and specializing (part of) their parameters. Just-in-time compilation is done on-demand and controlled by the developer. The project is available on github .

The performance of some codes is tied to the values taken by certain parameters whose value is only known at runtime: loop trip counts, values read from configuration files, user-input, etc. If these values are known, the code can be specialized and extra optimizations become possible, leading to better performance.

Easy::jit is a library that brings just-in-time compilation to C++ codes to achieve that goal. It hides complicated concepts and is presented as a simple abstraction; no specific compilation knowledge is required to use it.

Example

To understand the library, consider the following kernel that processes a frame from a video stream:

static void kernel(const char* mask, unsigned mask_size, unsigned mask_area,
                   const unsigned char* in, unsigned char* out,
                   unsigned rows, unsigned cols, unsigned channels) {

  unsigned mask_middle = (mask_size/2+1);
  unsigned middle = (cols+1)*mask_middle;

  for(unsigned i = 0; i != rows-mask_size; ++i) {
    for(unsigned j = 0; j != cols-mask_size; ++j) {
      for(unsigned ch = 0; ch != channels; ++ch) {
        long out_val = 0;
        for(unsigned ii = 0; ii != mask_size; ++ii)
          for(unsigned jj = 0; jj != mask_size; ++jj)
            out_val += mask[ii*mask_size+jj] * in[((i+ii)*cols+j+jj)*channels+ch];
        out[(i*cols+j+middle)*channels+ch] = out_val / mask_area;
      }
    }
  }
}

and its invocation,

static void apply_filter(const char *mask, unsigned mask_size, unsigned mask_area,
                         cv::Mat &image, cv::Mat *&out) {
  kernel(mask, mask_size, mask_area,
         image.ptr(0,0), out->ptr(0,0),
         image.rows, image.cols, image.channels());
}

In this code, the values of the parameters mask, mask_size and mask_area, depend on user's input and occasionally change. A priori, it's impossible to know which values are going to be taken.

On the other hand, the frame dimensions, rows, cols and channels, depend on the input device and tend to remain constant during the entire execution.

Knowing their values can enable some new optimizations: if the mask dimensions are known and small enough, for example, the optimizer could decide to fully unroll the innermost loops.

Using Easy::jit

The main functionalities of the library are present in the easy/jit.h header file, and the code function of the library is--guess the name--the easy::jit function.

This function takes as input a function to specialize (or a function pointer), and a series of parameter values or placeholders. Parameter values are used to specialize the code of the function passed as parameter; placeholders are used to forward the parameters of the specialized function to the original function. The library tries to mimic the interface of a standard C++ construct: std::bind. Just-in-time compilation takes place as soon as the easy::jit function is called, to give control to the user when and where the compilation is launched.

For example, in the code below, the kernel function is specialized with the mask and the frame dimensions. The first parameter of the specialized function is forwarded as the input frame and the second parameter as the output frame.

The returned object hides the LLVM related data-structures, mimics the signature of the specialized function, and is in charge of freeing the allocated resources.

#include <easy/jit.h>

static void apply_filter(const char *mask, unsigned mask_size, unsigned mask_area,
                         cv::Mat &image, cv::Mat *&out) {
  using namespace std::placeholders; // for _1, _2, ...

  auto kernel_opt = easy::jit(kernel, mask, mask_size, mask_area,
                              _1, _2,
                              image.rows, image.cols, image.channels());
  kernel_opt(image.ptr(0,0), out->ptr(0,0));
}

With this implementation, just-in-time compilation takes place at every frame of the video. To avoid useless re-compilations, the library ships a code cache built using a hash table and the easy::jit function. This way, if a specialized version has already been generated, it is not necessary to go through the entire code generation step again.

The previous code adapted to use the code cache is presented below.

#include <easy/code_cache.h>

static void apply_filter(const char *mask, unsigned mask_size, unsigned mask_area,
                         cv::Mat &image, cv::Mat *&out) {
  using namespace std::placeholders; // for _1, _2, ...

  static easy::cache<> cache;
  auto const &kernel_opt = cache.jit(kernel, mask, mask_size, mask_area,
                                     _1, _2,
                                     image.rows, image.cols, image.channels());
  kernel_opt(image.ptr(0,0), out->ptr(0,0));
}

How?

This library relies on C++ meta-programming and a compiler plugin to achieve its objectives.

Template meta-programming is used to initialize the Context from the call to easy::jit. This Context contains all the information required to perform the specialization: values taken by the parameters, pointer to the required function to specialize, special options, etc.

Additionally, the low-level LLVM objects are wrapped together into an opaque object. The operator() is specialized with the appropriate types derived from the easy::jit call, such that type checking is performed on every argument and their values are casted if necessary.

However, template meta-programming is not enough to provide JIT support: special compiler help is required to identify which functions will be specialized at runtime, and to embed their bitcode implementation in the final executable. This bitcode is used later on, at runtime for specialization and optimization.

Benchmarks

The results were obtained using Google Benchmark. We compiled two kernels, a convolution kernel, similar to the one presented in the previous sections, and a C-style quicksort kernel, were the function performing the comparison can be inlined thanks to just-in-time compilation. The input sizes were changed from a matrix of 16x16 elements up to 1024x1024 elements for the convolution kernel, and an array of 16 elements up to 1024 elements for the quicksort kernel. The times taken by the code generation process and by a cache hit are also measured. The time corresponds to the average time of #Iterations executions.

The sources can be found in the project's repository if you wish to reproduce. The benchmark is running on an Intel(R) Core(TM) i7-8550U CPU @ 4.00GHz.

The kernel using Easy::jit executes roughly 4.5x faster than the original version for the convolution kernel, and around 2x faster for the quicksort kernel. These times do not take into account the time taken to generate the code, only the kernel execution. The time required for the code generation process remains big in comparison to the kernel execution time: for the 1024 input sizes, at least 2 iterations of the kernel convolution are required to compensate the cost of code generation, and 29 iterations for the quicksort kernel. Still the time taken by a cache hit is negligible. In its current version, the cache is not persistent, and the generated codes remain in memory. This time may change if the generated code were to be read from a file (as expected in future versions). We can conclude that to profit from the compilation strategy proposed by Easy::jit, being able to reuse the generated code versions is critical.

-----------------------------------------------------------------------------------------------
Benchmark                      Time JIT (#Iterations)    Time Original (#Iterations)    Speedup
-----------------------------------------------------------------------------------------------
BM_convolve_jit/16               314 ns (2307965)              1559 ns (449056)          3.69 x
BM_convolve_jit/32              1674 ns (438450)               7780 ns (90692)           4.64 x
BM_convolve_jit/64              7221 ns (96520)               34714 ns (19811)           4.80 x
BM_convolve_jit/128            30487 ns (22970)              144905 ns (4820)            4.75 x
BM_convolve_jit/256           126170 ns (5533)               597026 ns (1180)            4.73 x
BM_convolve_jit/512           512088 ns (1372)              2417549 ns (289)             4.72 x
BM_convolve_jit/1024         2073025 ns (338)               9737191 ns (72)              4.69 x
BM_convolve_compile_jit     13147455 ns (54)
BM_convolve_cache_hit_jit        218 ns (3199323)
-----------------------------------------------------------------------------------------------
Benchmark                      Time JIT (#Iterations)    Time Original (#Iterations)    Speedup
-----------------------------------------------------------------------------------------------
BM_qsort_jit/16                  171 ns (4082824)               279 ns (2500416)         1.63 x
BM_qsort_jit/32                  545 ns (1284095)              1032 ns (677563)          1.89 x
BM_qsort_jit/64                 1860 ns (374257)               3658 ns (191657)          1.96 x
BM_qsort_jit/128                6844 ns (101668)              13604 ns (51633)           1.98 x
BM_qsort_jit/256               25009 ns (27731)               52199 ns (13417)           2.08 x
BM_qsort_jit/512               95446 ns (7317)               205984 ns (3427)            2.15 x
BM_qsort_jit/1024             369910 ns (1855)               810159 ns (866)             2.19 x
BM_qsort_compile_jit        13013694 ns (54)
BM_qsort_cache_hit_jit           174 ns (3979212)

What's next

The library remains on its early stages. Among the missing basic features we can list:

  • struct parameters are not correctly supported
  • functions returning a struct are not well supported.

In the midterm, the first objective is to provide the mechanisms to perform profile-guided-optimization and optimization using speculation. We also aim at handling a set of common constructs and patterns (caching and threading, for example) to simplify the usage of the library.

In the very end, the objective is to be able to perform partial evaluation of codes. There already exist some work on this area, in particular LLPE is an implementation based on LLVM.

Thanks to Serge Guelton for his contributions to the project and many discussions. Thanks to Quarkslab for supporting us in working on personal projects.

Comments