-
Notifications
You must be signed in to change notification settings - Fork 337
Restart async callback dispatcher thread after fork #1117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart async callback dispatcher thread after fork #1117
Conversation
8576d92
to
8bdfaf7
Compare
:( I accidently pushed a few versions of this PR which hang forever, and github seems to be waiting for all the previous enqueued jobs to time out (after six hours by default! :o) before running the tests on the latest version of this PR. So I guess I'm going to have to wait till tomorrow to get the test results here. |
8bdfaf7
to
7bc583d
Compare
If we don't do this, callbacks registered before fork will never actually fire after forking, and just hang forever. We explicitly re-initialize the entire context (including re-initializing the mutexes and clearing out the list of callbacks), because: * If a thread in the parent was in the process of calling a callback when we fork, the mutex would otherwise be "stuck on" * I don't think we _want_ to actually execute any pending callbacks from other threads in the child process, so clearing the list makes sense anyway.
7bc583d
to
1f4989f
Compare
I canceled the CI jobs. Looks good now! |
Thanks! Looks like I ticked all the old rubies the right way, I think |
/* n.b. we _used_ to try and destroy the mutex/cond before initializing here, | ||
* but it's undefined what happens if you try and destory an unitialized cond. | ||
* glibc in particular seems to wait for any concurrent waiters to finish before | ||
* destroying a condvar, trying to destroy a condvar after fork that someone was | ||
* waiting on pre-fork won't work. Just re-init he memory directly. */ | ||
pthread_mutex_init(&ctx->async_cb_mutex, NULL); | ||
pthread_cond_init(&ctx->async_cb_cond, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I wonder if, given the comment, it would be cleaner to use
ctx->async_cb_mutex = PTHREAD_MUTEX_INITIALIZER;
ctx->async_cb_cond = PTHREAD_COND_INITIALIZER;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess they're equivalent, yeah. I can change this if we prefer.
Hello 👋 Just to let you know that we have deploy this branch in our test environment and this fix a fork safety issue we had with one of our internal library that use FFI, without any other side effects as of today Thanks a lot @KJTsanaktsidis 🙏 |
@beauraF you can use the karafka rdkafka fix meanwhile (rebuild the needed classes post-fork) to mitigate this. FYI I am also waiting on this. |
We're waiting on a PR on the upstream Ruby FFI repo for fork-safety: ffi/ffi#1117 This simple workaround also prevents the issue, so we don't have to wait.
I'd like to come back on this, one of our CI pipelines is blocked when we use this branch. Unfortunately I haven't had time to investigate further yet. No issues with @mensfeld workaround |
Thank you @KJTsanaktsidis for this great work! I think this can be merged, but the blocking CI should be investigated first. @beauraF It would be great if you can investigate it, or you can provide more details. I didn't notice any side effects on my own, but I don't have any use cases with fork() other than the ffi specs. |
Yeah @beauraF if you're able to provide some details, even if not a full reproduction, I can have a little stare at the code and think about what might be happening. |
So.. I just investigated, and the issue was fully on our side. Sorry about that.. |
Great news, thanks for confirming! |
If we don't do this, callbacks registered before fork will never actually fire after forking, and just hang forever.
We explicitly re-initialize the entire context (including re-initializing the mutexes and clearing out the list of callbacks), because:
Fixes: #1114