Skip to content

Conversation

nsingl00
Copy link

@nsingl00 nsingl00 commented Jul 16, 2025

Problem

Fixes #975 and #1418

Git commands spawn helper processes (git-credential-helper, git-remote-https, ssh, git-upload-pack, git-receive-pack) that can become zombie processes when the main git process exits before its children complete. This is particularly
problematic in cloud/container environments where:

  • The application runs as PID 1 or under minimal init systems
  • Standard init process zombie reaping may be unreliable or slow
  • Resource constraints can cause zombie accumulation over time

Root Cause

In containerized environments, our Jupyter application often becomes PID 1 due to exec usage in startup scripts, making it responsible for zombie process cleanup. Unlike robust init systems (systemd, launchd) found in local environments,
minimal container init systems may not reliably reap orphaned git helper processes.

Solution

Added a SIGCHLD signal handler that automatically reaps zombie processes system-wide. The handler:

  • Uses non-blocking waitpid(-1, os.WNOHANG) to reap any zombie children
  • Runs whenever any child process terminates (SIGCHLD signal)
  • Prevents zombie accumulation without affecting normal git operations
  • Logs reaped processes at debug level for monitoring

Testing

  • Verified zombie processes are eliminated in cloud environments
  • Confirmed normal git operations continue to work correctly
  • No performance impact on git command execution

Copy link

Binder 👈 Launch a Binder on branch nsingl00/jupyterlab-git/subprocess

@nsingl00
Copy link
Author

Hey @krassowski can you help review the PR?

@ellisonbg
Copy link
Contributor

Thanks for working on this @nsingl00 - wonder if this is related to #975

@nsingl00 nsingl00 changed the title Fix git subprocess termination issues to prevent process accumulation Fix zombie process accumulation from git operations in cloud environments Jul 24, 2025
  Git commands spawn helper processes (git-credential-helper, git-remote-https, ssh)
  that become zombies when the main git process exits before children complete. This
  is problematic in cloud/container environments where the application runs as PID 1
  or under minimal init systems that don't reliably reap orphaned processes.

  Added SIGCHLD signal handler to automatically reap zombie processes system-wide
  using non-blocking waitpid(), preventing resource leaks without affecting normal
  git operations.
@nsingl00
Copy link
Author

Thanks for working on this @nsingl00 - wonder if this is related to #975

yes, it is related. The problem is , when in enterprise we run jupyter in a container environment where process ID 1 is not signal handler processes like tini, zombie process cleanup doesn't happen. Adding an extra SIGHandler to reap off the child processes in those scenarios would work. Added the PR for those cases.

Copy link
Member

@krassowski krassowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening this PR @nsingl00 and sorry for late reply - I missed the notification after coming back from holidays.

I believe that signal.SIGCHLD will raise an AttributeError exception on Windows and break the extension, it is only available on Unix as per documentation https://docs.python.org/3/library/signal.html#signal.SIGCHLD. At a minimum this would need to be wrapped in some code sniffing if signal.SIGCHLD is available

In general, it feels like maybe there is something we could improve in the execute() function to make sure we avoid leaving zombies?

async def execute(
cmdline: "List[str]",
cwd: "str",
timeout: "float" = 20,
env: "Optional[Dict[str, str]]" = None,
username: "Optional[str]" = None,
password: "Optional[str]" = None,
is_binary=False,
) -> "Tuple[int, str, str]":
"""Asynchronously execute a command.
Args:
cmdline (List[str]): Command line to be executed
cwd (Optional[str]): Current working directory
env (Optional[Dict[str, str]]): Defines the environment variables for the new process
username (Optional[str]): User name
password (Optional[str]): User password
Returns:
(int, str, str): (return code, stdout, stderr)
"""
async def call_subprocess_with_authentication(
cmdline: "List[str]",
username: "str",
password: "str",
cwd: "Optional[str]" = None,
env: "Optional[Dict[str, str]]" = None,
) -> "Tuple[int, str, str]":
try:
p = pexpect.spawn(
cmdline[0],
cmdline[1:],
cwd=cwd,
env=env,
encoding="utf-8",
timeout=None,
)
# We expect a prompt from git
# In most of cases git will prompt for username and
# then for password
# In some cases (Bitbucket) username is included in
# remote URL, so git will not ask for username
i = await p.expect(["Username for .*: ", "Password for .*:"], async_=True)
if i == 0: # ask for username then password
p.sendline(username)
await p.expect("Password for .*:", async_=True)
p.sendline(password)
elif i == 1: # only ask for password
p.sendline(password)
await p.expect(pexpect.EOF, async_=True)
response = p.before
returncode = p.wait()
p.close()
return returncode, "", response
except pexpect.exceptions.EOF: # In case of pexpect failure
response = p.before
returncode = p.exitstatus
p.close() # close process
return returncode, "", response
def call_subprocess(
cmdline: "List[str]",
cwd: "Optional[str]" = None,
env: "Optional[Dict[str, str]]" = None,
is_binary=is_binary,
) -> "Tuple[int, str, str]":
process = subprocess.Popen(
cmdline, stdout=subprocess.PIPE, stderr=subprocess.PIPE, cwd=cwd, env=env
)
output, error = process.communicate()
if is_binary:
return (
process.returncode,
base64.encodebytes(output).decode("ascii"),
error.decode("utf-8"),
)
else:
return (process.returncode, output.decode("utf-8"), error.decode("utf-8"))
try:
await execution_lock.acquire(timeout=datetime.timedelta(seconds=timeout))
except tornado.util.TimeoutError:
return (1, "", "Unable to get the lock on the directory")
try:
# Ensure our execution operation will succeed by first checking and waiting for the lock to be removed
time_slept = 0
lockfile = os.path.join(cwd, ".git", "index.lock")
while os.path.exists(lockfile) and time_slept < MAX_WAIT_FOR_LOCK_S:
await tornado.gen.sleep(CHECK_LOCK_INTERVAL_S)
time_slept += CHECK_LOCK_INTERVAL_S
# If the lock still exists at this point, we will likely fail anyway, but let's try anyway
get_logger().debug("Execute {!s} in {!s}.".format(cmdline, cwd))
if username is not None and password is not None:
code, output, error = await call_subprocess_with_authentication(
cmdline,
username,
password,
cwd,
env,
)
else:
current_loop = tornado.ioloop.IOLoop.current()
code, output, error = await current_loop.run_in_executor(
None, call_subprocess, cmdline, cwd, env
)
log_output = (
output[:MAX_LOG_OUTPUT] + "..." if len(output) > MAX_LOG_OUTPUT else output
)
log_error = (
error[:MAX_LOG_OUTPUT] + "..." if len(error) > MAX_LOG_OUTPUT else error
)
get_logger().debug(
"Code: {}\nOutput: {}\nError: {}".format(code, log_output, log_error)
)
except BaseException as e:
code, output, error = -1, "", traceback.format_exc()
get_logger().warning("Fail to execute {!s}".format(cmdline), exc_info=True)
finally:
execution_lock.release()
return code, output, error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
3 participants