Troubleshooting an intermittent error by RR

(this post is converted from https://konghq.com/blog/engineering/troubleshooting-an-intermittent-failure-on-arm64)

The Kong Gateway CI was failing intermittently (about once every 100 runs) on the ARM64 platform with a strange error: attempt to perform arithmetic on local 'i' (a function value). The variable i in the context is an integer but at runtime, it was sometimes a function value.

This is caused by an error in the LuaJIT ARM64 JIT compiler. This document describes how it is fixed by leveraging the Mozilla rr tool.

Narrow down the scope: Is the error in Lua code or elsewhere?

The error when encountered was on line 3 in Kong's db init.lua code file.

for i, primary_key_name in ipairs(primary_key_names) do
   insert(argn_next, primary_key_name)
   insert(pk_placeholders, placeholder(i + #fk_names))
end

Looking at the above code, the value i should be an integer, but the error says i is a function value. This error was only ever on ARM64 runners used in Kong CI. The first troubleshooting step was to check whether this error was caused by Lua code or other parts of Gateway — like the Nginx, LuaJIT, etc. This can be determined by disabling the JIT compilation for the function that contains the above code with: jit.off(new). After disabling JIT for the above function, the error was no longer reproducible. That means the error was caused by the underlying LuaJIT machine code generation.

Get deterministic results from random failure

One hurdle to debugging the failure was the random nature of this error. It was seen in about one out of every 100 CI runs. A minimal reproducible test case was needed to help speed up fixing the issue. The tool we used is called reverse debugger: Mozilla rr. If we use the analogy of a core dump being akin to taking a picture of the program when it fails, the record file generated by rr is like taking a video of the program when it fails. Like a real video file, you can play it back and forth, with the record file from rr, you can run the failure program back and forth and always get the same result. This helps tremendously when the failure happens randomly.The CI runs our unit tests with the busted framework and is invoked normally like busted spec/02-integration/03-db/01-db_spec.lua

unset KONG_VENV_ENV_FILE; unset KONG_VENV; unset KONG_PREFIX; export KONG_BUSTED_RESPAWNED=1; export KONG_IS_TESTING=1;

count=0
while resty --rr -j dump -c 4096 --http-conf 'lua_ssl_trusted_certificate ~/projects/kong/spec/fixtures/kong_spec.crt;'  'bin/busted' 'spec/02-integration/03-db/01-db_spec.lua' > jit_dump_with_rr_bad_demo.log; do 
  count=$((count+1));
  echo $count; 
done

The above commands run the test file again and again until it fails. And at each run, it is recorded by rr (by the –rr switch in resty) and the LuaJIT dump (by the -j dump switch and the output is redirected to a file) logging is also turned on to help identify which JIT trace caused the failure. Other switches like -c , –http-conf and environment variables are needed to run the test itself.

Debug the error with rr replay

Once the above command fails, we will have a file generated by rr. It usually is saved at ~/.local/share/rr/. And to replay this error, just run:

rr replay

It will run in gdb like interface:

Reading symbols from ~/.local/share/rr/nginx-589/mmap_hardlink_4_nginx...
(gdb)

And we can run the “continue” command in gdb to run to the error:

0 successes / 0 failures / 1 error / 0 pending : 9.803452 seconds

Error -> ./kong/db/strategies/postgres/init.lua @ 1441
suite spec/02-integration/03-db/01-db_spec.lua
./kong/db/strategies/postgres/init.lua:1441: attempt to perform arithmetic on local 'i' (a function value)

stack traceback:
        ./kong/db/strategies/postgres/init.lua:1441: in function 'new'
        ./kong/db/strategies/init.lua:59: in function 'new'
        ./kong/db/init.lua:95: in function 'new'
        spec/02-integration/03-db/01-db_spec.lua:770: in main chunk


Program received signal SIGKILL, Killed.
0x0000000070000004 in syscall_traced ()
(gdb)

In addition to running the “continue” command, we can also run the “reverse-continue” command to play back forwards, which will stop at any breakpoint. In this case, after searching for the error message in the LuaJIT source, we set a breakpoint at lj_err_optype and will give us more information about the error details:

(gdb) break lj_err_optype
Breakpoint 1 at 0xffffbb0b6cf8: file lj_err.c, line 911.
(gdb) reverse-continue 
Continuing.

Breakpoint 1, lj_err_optype (L=0x4517e33a1980, o=0x4517e57bcf80, opm=LJ_ERR_OPARITH) at lj_err.c:911
911	  const char *tname = lj_typename(o);
(gdb)

And at this point, we can get information by printing tname and o and o’s value in this lj_err_optype function.

void lj_err_optype(lua_State *L, cTValue *o, ErrMsg opm)
{
  const char *tname = lj_typename(o);
  const char *opname = err2msg(opm);
  if (curr_funcisL(L)) {
    GCproto *pt = curr_proto(L);
    const BCIns *pc = cframe_Lpc(L) - 1;
    const char *oname = NULL;
    const char *kind = lj_debug_slotname(pt, pc, (BCReg)(o-L->base), &oname);
    if (kind)
      err_msgv(L, LJ_ERR_BADOPRT, opname, kind, oname, tname);
  }
  err_msgv(L, LJ_ERR_BADOPRV, opname, tname);
}

This confirms the value under pointer o is a function value already. It is an error to perform arithmetic operations on a function value in Lua language.

The question now became where this value was overridden from a number value to a function value. This question could be answered in various ways. One approach is to set up a watch point in the gdb to the memory address that o is pointing to and do the reverse-continue.

In our case, there is a bug in the rr tool itself, the reverse-continue does not work with the watch point. The original rr paper did not support ARM. The porting to ARM64 was only done recently after the ARM64 LSE platform became available on the market, which rr depends on. We were able to use the tool anyway by running the “next” command again and again to check when the value is overridden. We set up several breakpoints to speed up this process:

lj_trace_exit, this function does the transformation from JITed code to interpreter in LuaJIT. It is an important point that marks whether the error happens in JITed code or in the interpreter.
TRACE_939, this the JITed code trace that contains the line init.lua:1441 in the error message. We can get this information from the JIT dump log file.

After setting up these breakpoints, we can run “continue” or “reverse-continue” in the gdb and examine the memory address that contains the wrong value regularly to search the code that writes the wrong value into that memory address.

Using this technique we found the location and it is in the following JITed code:

---- TRACE 1464 start 957/2 init.lua:1439
1141  ITERC   62   3   3       (init.lua:1439)
0000  . FUNCC               ; ipairs_aux
1142  JITERL  62 939       (init.lua:1439)
---- TRACE 1464 IR
0001  fun SLOAD  #61   T
0002  tab SLOAD  #62   T
0003  int SLOAD  #63   T
0004  fun EQ     0001  ipairs_aux
0005  int ADD    0003  +1  
0006  int FLOAD  0002  tab.asize
0007  int ABC    0006  0005
0008  p64 FLOAD  0002  tab.array
0009  p64 AREF   0008  0005
0010  str ALOAD  0009
---- TRACE 1464 mcode 176
ffffbb551d9c  mov   x25, #35480
ffffbb551da0  movk  x25, #58165, lsl #16
ffffbb551da4  movk  x25, #17687, lsl #32
...
ffffbb551e34  add   x30, x1, w28, uxtw
ffffbb551e38  str   x30, [x19, #496]
ffffbb551e3c  add   x30, x1, w28, uxtw
ffffbb551e40  stp   x30, x0, [x19, #488] // wrong value is written here
ffffbb551e44  add   sp, sp, #64
ffffbb551e48  b     0xbb5b0520
---- TRACE 1464 stop -> 939

The stp x30, x0, [x19, #488] is the line of ARM64 assembly code that corrupts the memory. To fix this error, we need to deep dive into the LuaJIT JIT compiler and see what happens during the code generation.

Fix the error in the code generation

After searching the LuaJIT source code, there is only one place that can generate the “stp” instruction, the emit_lso function.

static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs)
{
  int ot = emit_checkofs(ai, ofs), sc = (ai >> 30) & 3;
  lj_assertA(ot, "load/store offset %d out of range", ofs);
  /* Combine LDR/STR pairs to LDP/STP. */
  if ((sc == 2 || sc == 3) &&
      (!(ai & 0x400000) || rd != rn) &&
      as->mcp != as->mcloop) {
    uint32_t prev = *as->mcp & ~A64F_D(31);
    int ofsm = ofs - (1<<sc), ofsp = ofs + (1<<sc);
    A64Ins aip;
    if (prev == (ai | A64F_N(rn) | A64F_U12(ofsm>>sc)) ||
        prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsm&0x1ff))) {
      aip = (A64F_A(rd) | A64F_D(*as->mcp & 31));
    } else if (prev == (ai | A64F_N(rn) | A64F_U12(ofsp>>sc)) ||
               prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsp&0x1ff))) {
      aip = (A64F_D(rd) | A64F_A(*as->mcp & 31));
      ofsm = ofs;
    } else {
      goto nopair;
    }
    if (ofsm >= (int)((unsigned int)-64<<sc) && ofsm <= (63<<sc)) {
      *as->mcp = aip | A64F_N(rn) | ((ofsm >> sc) << 15) |
        (ai ^ ((ai == A64I_LDRx || ai == A64I_STRx) ? 0x50000000 : 0x90000000));
      return;
    }
  }
nopair:
  if (ot == 1)
    *--as->mcp = ai | A64F_D(rd) | A64F_N(rn) | A64F_U12(ofs >> sc);
  else
    *--as->mcp = (ai^A64I_LS_U) | A64F_D(rd) | A64F_N(rn) | A64F_S9(ofs & 0x1ff);
}

We can set a breakpoint on this function and check how the “stp” instruction is generated.

Basically, this function is doing two things:

Generate regular load/store instruction for the load/store Intermediate Representation in the LuaJIT compiler.
Try to merge consecutive load/store instructions when some conditions meet, like if the 2 stores have the same index register and the offset is adjacent, which is also called peephole optimization in compiler optimization.

In our failure case, this function has merged the following 2 store instructions:

str x30, [x19, #488] // merge into ==> stp x30, x0, [x19, #488]
stur x0, [x19, #-16]

This is not correct!

Because the offset 488 and -16 are not adjacent. The next store position of 488 is:

488 + 8 = 496 // 8 for 64 bit data.

The code treats real offset -16 the same as 496, which is wrong. The following explains why this happens.

And according to ARM64 instruction definition, the stur instruction is encoded like:

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	x	1	1	1	0	0	0	0	0	0	imm9									0	0	Rn					Rt

The offset (like -16) is stored in field imm9, which is short for immediate integer with 9 bits long. Imm9 can be either signed or unsigned depending how you interpret it. And 496 in the imm9 unsigned integer format is 1 1111 000. -16 in the imm9 signed integer format is also 1 1111 000. They are the same! This leads to passing the check at this line in above emit_lso function:

prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsp & 0x1ff))

This causes the generation of the wrong instruction: stp x30, x0, [x19, #488]. To fix this error, we can add additional checks for the offset range before merging as we did in the submitted PR to LuaJIT upstream.

In retrospect, peephole optimization errors should always be triggered if there are such pairs of instructions. So why does it happen intermittently in our CI, which always has the same input in every run?

The reason is Lua language itself has non-deterministic behaviors like the order of table iteration is undefined. And LuaJIT itself also has non-deterministic behaviors in the implementation, like the hot loop count, side trace penalty counter, etc. All of them will cause different JIT compilation behaviors even with the same input, especially if the test is large with thousands of JIT traces (in the above case, there are about 1500 JIT traces). These non-deterministic behaviors lead to intermittent failure.

hacker news link.