HOME about

LuaJIT bad FFI callback issue

LuaJIT's FFI is super fast. It is even faster than C/C++'s dynamic library call and 40x times faster than Golang's CGO (reference). But there is one thing not allowed:

an FFI call into a C function get JIT-compiled, which in turn calls a callback, calling into Lua again.

When it happends, user will have a "bad callback" error.

What is a LuaJIT FFI callback?

LuaJIT FFI callback is a lua function, which is called by FFI C code.

An example to trigger the bad callback issue.

-- save to file test.lua
local ffi = require("ffi")

ffi.cdef [[
    typedef int (*my_fn_t)(int);
    int f2();
    void setup(my_fn_t f, int);
]]

local lib = ffi.load("test")

function setup(cb, a)
  lib.setup(cb, a)
end

function f0()
  -- The FFI call to f2(), which is defined in C library test.so. 
  return lib.f2() + 1
end

do
  local cb = ffi.cast("my_fn_t",
                      -- This is the FFI callback function.
                      function(a)
                        return a
  end)

  for i=1,100 do
    if i == 80 then setup(cb, 10) end
    f0()
  end
end      
// save to file lib.c
// and compile it to a shared libary:
//   gcc -Wall -O -g -o libtest.so -fpic -shared lib.c
// put libtest.so under the same path of test.lua
typedef int (*my_fn_t)(int);

my_fn_t gf = 0;
int ga;

void setup(my_fn_t f, int a) {
  gf = f;
  ga = a;
}

int f2() {
  if (gf == 0) { // this is necessary to escape the auto-detection.
    return 3;
  } else {
    return gf(ga) + 1;
  }
}  

And then, run it will give us the "bad callback" error:

$luajit test.lua
PANIC: unprotected error in call to Lua API (bad callback)

Run it with trace dump will show that the f0() call is compiled by the JIT compiler:

---- TRACE 1 start test.lua:27
0022  ISNEN    6   0      ; 80
0023  JMP      7 => 0028
0028  GGET     7   9      ; "f0"
0029  CALL     7   1   1
0000  . FUNCF    2          ; test.lua:15
0001  . UGET     0   0      ; lib
0002  . TGETS    0   0   0  ; "f2"
0000  . . . FUNCC               ; ffi.clib.__index
0003  . CALL     0   2   1
0000  . . FUNCC               ; ffi.meta.__call // FFI call at line: 17 in test.lua is compiled.
0004  . ADDVN    0   0   0  ; 1
0005  . RET1     0   2
0030  FORL     3 => 0022  

The call chain is like: Trace 1 -> lib.f2() -> lua callback function.

LuaJIT's bad callback auto detection feature

In some case, LuaJIT can automatically detect bad callback and disable the JIT compilation for the related FFI call. Here is a slightly modified example from previous example.

local ffi = require("ffi")

ffi.cdef [[
    typedef int (*my_fn_t)(int);
    int f2();
    void setup(my_fn_t f, int);
]]

local lib = ffi.load("test")

function setup(cb, a)
  lib.setup(cb, a)
end

function f0()
  -- The FFI call to f2(), which is defined in C library test.so. 
  return lib.f2() + 1
end

do
  local cb = ffi.cast("my_fn_t",
                      -- This is the FFI callback function.
                      function(a)
                        return a
  end)

  local cb2 = ffi.cast("my_fn_t",
                      -- This is the FFI callback function.
                      function(a)
                        return a+1
  end)

  setup(cb, 10)

  for i=1,100 do
    if i == 80 then setup(cb2, 10) end
    f0()
  end
end      
typedef int (*my_fn_t)(int);

my_fn_t gf = 0;
int ga;

void setup(my_fn_t f, int a) {
  gf = f;
  ga = a;
}

int f2() {
  // f2 will always call a Lua callback.
  return gf(ga) + 1;
}

This example can run without bad callback error. And from the dumped trace, we can see the JIT compilation for the FFI call in f0() is aborted and it is added to backlist. So there is no JIT compiled FFI call, it is safe to do lua callback in the C code now.

---- TRACE 1 start test1.lua:35
0030  ISNEN    7   0      ; 80
0031  JMP      8 => 0036
0036  GGET     8   9      ; "f0"
0037  CALL     8   1   1
0000  . FUNCF    2          ; test1.lua:15
0001  . UGET     0   0      ; lib
0002  . TGETS    0   0   0  ; "f2"
0000  . . . FUNCC               ; ffi.clib.__index
0003  . CALL     0   2   1
0000  . . FUNCC               ; ffi.meta.__call
---- TRACE 1 abort test1.lua:17 -- blacklisted

The same effect can be achieved by manually turning off JIT compilation in the first test case:

do
  local cb = ffi.cast("my_fn_t",
                      -- This is the FFI callback function.
                      function(a)
                        return a
  end)

  jit.off(f0)

  for i=1,100 do
    if i == 80 then setup(cb, 10) end
    f0()
  end
end   

Why the auto detection can not always catches the LuaJIT FFI callback case?

Because the auto detection takes effect only during LuaJIT trace compilation and in the first case, when the trace is compiled, there is no FFI callback. Since we deliberately only set the callback in the lua loop code when i == 80. So in the later iteration of the looping (i > 80), there is FFI callback, which is not expected in the compiled trace. And we can not change the compiled code at this time. The assumption “not allow an FFI call into a C function get JIT-compiled” is broken. It then runs into error “bad callback”.

How other VMs solve this issue?

In the "Trace-based Just-in-Time Type Specialization for Dynamic Languages by Andreas Gal etc." paper section 6.5, there is a description for the similar situation:

Another problem is that external functions can reenter the interpreter by calling scripts, which in turn again might want to access the call stack or global variables. To address this problem, we made the VM set a flag whenever the interpreter is reentered while a compiled trace is running. Every call to an external function then checks this flag and exits the trace immediately after returning from the external function call if it is set. There are many external functions that seldom or never reenter, and they can be called without problem, and will cause trace exit only if necessary.

This approach seems like will cause issue in LuaJIT. When the callback function is hot enough and got compiled – let's say previous FFI call is compiled to Trace 1. This hot callback function is compiled to Trace 2 – then the some state will be modified in Trace 2, which is likely unexpected in Trace 1. When returning to Trace 1 from Trace 2, error will probably occur due to corrupted state. However, it is worth giving it a try.

Date: 2024-05-04 Sat 00:00