Why fork+exec Takes 100ms on My Mac: Debugging Slow Emacs with Instruments
Summary
Magit was taking 3 seconds on a small repo. The culprit: OCLP's system
library patches invalidate macOS dyld shared cache closures, causing
every fork+exec to fall back to JustInTimeLoader at ~100ms per
process launch. Magit spawns ~25 git processes per magit-status.
Introduction
I found my fresh Emacs install is slow on macOS 15, especially when
running Magit on a tiny git repo. For example, it takes 2~3 seconds to
run the magit-status command.
I tried several tweaks like DNS resolving and switching to a plain
Emacs config (emacs -Q), but none of them worked. I decided to do
further investigation by profiling.
First Profiling Tool: the Emacs Built-in Profiler
Emacs has a built-in profiler. It can be used as:
profiler-start ;; run profiler-report to get the result.
Here is the report when running magit-status on the small git repo:
95% - ... 95% - list 95% - cons 95% - cons 95% - cons 95% - cons 95% - cons 95% - cons 95% - cons 95% - eval-expression--debug 95% - debug 95% - recursive-edit 95% - command-execute 95% - call-interactively 78% - byte-code 78% - read-extended-command 78% - read-extended-command-1 78% - completing-read 78% - completing-read-default 78% - read-from-minibuffer 31% + redisplay_internal (C function) 15% + icomplete-post-command-hook 0% + command-execute 0% + timer-event-handler 0% + minibuffer-inactive-mode 16% + funcall-interactively 0% + redisplay_internal (C function) 4% Automatic GC
Nothing hot on the CPU is related to Magit. This makes sense — the CPU profiler only shows on-CPU time, but Magit's slowness may be in waiting for I/O (off-CPU). This is why off-CPU(Off-CPU profiling https://www.brendangregg.com/offcpuanalysis.html) profiling matters.
I turned to the ELP profiler, which is a tracing tool that helps identify I/O issues, such as when the code is scheduled out and not running on CPU. It simply records the timestamp when a function starts and when it ends, reporting the wall-clock time it takes.
The magit-toggle-profiling command can do the work. Run
magit-toggle-profiling, then run magit-status. After it finishes,
run magit-toggle-profiling again to show the following result:
;; time is in seconds. Function Name Call Count Elapsed Time Average Time magit-run-section-hook 2 4.043413 2.0217065 magit-process-git 25 3.587962 0.14351848 magit-process-file 25 3.587085 0.1434834 magit-status 1 3.373008 3.373008 ...
The above result is a little confusing if you are not familiar with
Magit's implementation. Each line is independent, and
magit-process-git actually calls magit-process-file internally,
so the "Elapsed Time" and "Call Count" are almost the same. At least
we know the slowness is caused by magit-process-file. It would be better
to use a real Off-CPU profiler to get the whole picture of what is going on.
Profiling with Instruments
The Instruments profiler included in Xcode can do Off-CPU profiling by using the "Time Profiler" with "Record Waiting Threads" enabled in the Time Profiler's config.
Figure 1: Emacs Profiling Flame Graph
The above flame graph was captured while the magit-status command
was running. There is a big portion of the graph related to the read
syscall, as highlighted. Other parts like pselect are
irrelevant to the slowness in Magit, because from the call stack we
can tell they come from the read_key_sequence function, which handles
user input.
Besides the "Time Profiler", Instruments can also record a
"System Call Trace". So if we click the read in the flame graph,
we can see the time spent in it and the read syscall's arguments,
as shown below. It is reading file descriptor 8, and it spends 250.4ms
waiting in the first call. The total time spent in the read syscall
is around 3 seconds and matches the time to run magit-status.
So the slow read syscall is causing the issue.
CPU Time=1.41 ms Wait Time=250.40 ms Emacs-x86_64-10_14 (33755) Main Thread 0x304972 read(fd:8, buf:0x7ff7bc6a4f40, len:16.00 KiB) = 16.00 KiB 0x8 0x7ff7bc6a4f40 0x4000 0x10 0x4000 0x0 n/a CPU Time=272.95 µs Wait Time=122.76 ms Emacs-x86_64-10_14 (33755) Main Thread 0x304972 read(fd:8, buf:0x7ff7bc6a4ee0, len:16.00 KiB) = 0 Bytes0x8 0x7ff7bc6a4ee0 0x4000 0x10 0x0 0x4 end-of-file was encountered CPU Time=343.33 µs Wait Time=120.95 ms Emacs-x86_64-10_14 (33755) Main Thread 0x304972 read(fd:8, buf:0x7ff7bc6a4f10, len:16.00 KiB) = 0 Bytes 0x8 0x7ff7bc6a4f10 0x4000 0x10 0x0 0x0 end-of-file was encountered . . . about 30 more similar read calls with the same fd:8
Connecting the Dots between ELP Magit Tracing and Instruments Tracing
Unfortunately, Instruments does not understand Elisp
interpreter frames and only shows C functions in the call
stack. So just from the call stack in the flame graph, we cannot tell which
Elisp function calls into the read syscall.
So let's check what fd:8 is in the above read syscall. Is it
a file in the git repo being read by magit-status that causes the
slowness?
Running the following command while magit-status runs shows that fd:8
is actually a PIPE:
# capture what is fd:8 while true; do lsof -p $(pgrep -f "emacs") | grep -E "8u"; sleep 0.1; done
This makes sense. From the flame graph's call stack of the
read syscall, it comes from Emacs's call_process function, and the
ELP results show that the slow function is magit-process-file, which
internally calls call_process. It turns out Magit uses fd:8,
a PIPE, to read the output of the git command. As shown by the
following log, if we capture magit-process-file's arguments by:
(defun my-magit-time-process-file (orig-fn program &rest args) (let ((start (current-time))) (prog1 (apply orig-fn program args) (let ((elapsed (float-time (time-subtract (current-time) start)))) (when (> elapsed 0.01) ;; Extract just the git arguments (skip nil and file handles) (let* ((git-args (seq-filter #'stringp (flatten-list args))) (cmd (string-join (cons program git-args) " "))) (message "\n[MAGIT] %.3fs:\n %s\n" elapsed cmd))))))) (advice-add 'magit-process-file :around #'my-magit-time-process-file)
We will see what magit-process-file is doing during magit-status:
[MAGIT] 0.104s:
/usr/local/bin/git --no-pager --literal-pathspecs -c core.preloadIndex=true -c log.showSignature=false -c color.ui=false -c color.diff=false -c diff.noPrefix=false config --list -z
[MAGIT] 0.097s:
/usr/local/bin/git --no-pager --literal-pathspecs -c core.preloadIndex=true -c log.showSignature=false -c color.ui=false -c color.diff=false -c diff.noPrefix=false log --no-walk --format=%h %s HEAD^{commit} --
[MAGIT] 0.092s:
/usr/local/bin/git --no-pager --literal-pathspecs -c core.preloadIndex=true -c log.showSignature=false -c color.ui=false -c color.diff=false -c diff.noPrefix=false symbolic-ref --short HEAD
Why Is the read Syscall Slow?
As we saw in the last section, the read syscall reads from a
PIPE to get git's output. So it is not the read syscall itself that
is slow; the slowness comes from the forked child process, and the parent
process just uses read to wait for the child process to produce output.
A test like below shows how slow creating a new process is when running an executable:
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/time.h> #include <sys/wait.h> double get_time_ms(void) { struct timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec * 1000.0 + tv.tv_usec / 1000.0; } int main(int argc, char *argv[]) { int iterations = 5; /* Test 1: fork + _exit (no exec) */ printf("=== Test 1: fork + _exit (no exec) ===\n"); for (int i = 0; i < iterations; i++) { double start = get_time_ms(); pid_t pid = fork(); if (pid == 0) { _exit(0); } int status; waitpid(pid, &status, 0); printf(" %d: %.3f ms\n", i + 1, get_time_ms() - start); } /* Test 2: fork + exec /usr/bin/true */ printf("\n=== Test 2: fork + exec /usr/bin/true ===\n"); for (int i = 0; i < iterations; i++) { double start = get_time_ms(); pid_t pid = fork(); if (pid == 0) { char *args[] = {"/usr/bin/true", NULL}; execv("/usr/bin/true", args); _exit(1); } int status; waitpid(pid, &status, 0); printf(" %d: %.3f ms\n", i + 1, get_time_ms() - start); } return 0; }
=== Test 1: fork + _exit (no exec) === 1: 0.864 ms 2: 0.643 ms 3: 0.620 ms 4: 0.601 ms 5: 0.621 ms === Test 2: fork + exec /usr/bin/true === 1: 108.289 ms 2: 100.957 ms 3: 100.923 ms 4: 95.505 ms 5: 92.196 ms
The fork alone takes less than 1ms, but fork + exec takes ~100ms.
The Instruments profiler cannot follow the forked process during profiling, so the following tweak is needed to enable profiling on the forked process:
/* * 1. fork() * 2. Child: stop itself with SIGSTOP before exec * 3. Parent: print the child PID and wait for you to attach Instruments * 4. After you press Enter, parent sends SIGCONT to resume the child * 5. Child does exec() */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <signal.h> #include <sys/wait.h> int main(int argc, char *argv[]) { if (argc < 2) { fprintf(stderr, "Usage: %s <program> [args...]\n", argv[0]); fprintf(stderr, "Example: %s /usr/local/bin/git --version\n", argv[0]); return 1; } pid_t pid = fork(); if (pid == 0) { /* Child: stop ourselves, then exec */ raise(SIGSTOP); execvp(argv[1], &argv[1]); perror("exec failed"); _exit(1); } /* Parent: wait for child to stop */ int status; waitpid(pid, &status, WUNTRACED); printf("Child PID: %d\n", pid); printf("Attach Instruments to PID %d now.\n", pid); printf(" In Instruments: File -> Record Options -> Attach to Process -> %d\n", pid); printf(" Enable 'Record Waiting Threads' in Time Profiler settings.\n"); printf("Press Enter to resume the child and start exec...\n"); getchar(); kill(pid, SIGCONT); waitpid(pid, &status, 0); printf("Child exited with status %d\n", WEXITSTATUS(status)); return 0; }
And here is the flame graph:
Figure 2: Test Fork and Exec Profiling Flame Graph
For even a simple program like /usr/bin/true, most of the time is spent in
JustInTimeLoader. This is slow and unusual.
macOS 15 is installed on my old Intel MacBook. Version 15 is not
officially supported on this hardware, and I installed it using the
OCLP tool. OCLP modifies system libraries to patch the OS for
unsupported hardware. This likely invalidates the dyld shared cache
closures stored in /var/db/dyld/, forcing dyld to fall back from
PrebuiltLoader (which is fast) to JustInTimeLoader for every
process launch. Normally dyld uses pre-computed closures in
/var/db/dyld/ to load shared libraries fast, but when the libraries
are modified (as OCLP does), these closures become invalid and dyld
must resolve everything at runtime.
If you run the fork+exec test takes >50ms, you may have a similar dyld cache issue.
The Solution
So far, there is no perfect solution. There is a project called magit-libgit, which avoids creating new processes, but it is not well maintained and is incomplete.
And tried a "static" git build to workaround the dynamic loader issue. But the "static" build still contains a bunch of dynamic lib loading. Finally it doesn't improve the performance at all.
Another try is to reduce the number of git calls by:
(with-eval-after-load 'magit (setq magit-status-sections-hook '(magit-insert-head-branch-header magit-insert-unstaged-changes magit-insert-staged-changes magit-insert-untracked-files)) (setq magit-status-headers-hook '(magit-insert-head-branch-header)) (setq magit-refresh-status-buffer nil))
This reduces the 2~3 seconds to about 1 second.