-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Fix thread unsafety with rb_register_postponed_job #8856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix thread unsafety with rb_register_postponed_job #8856
Conversation
rb_register_postponed_job is designed to be both threadsafe and async-signal-safe. That is to say, it should be legal to call it in all of the following circumstances: * From a thread holding the GVL * From a Ruby thread not holding the GVL * From a native thread not otherwise known to the Ruby VM * From a signal handler, interrupting non-ruby code * From a signal handler, interrupting the Ruby VM * From a signal handler, interrupting an execution of rb_register_postponed_job (i.e. it must be _reentrant_) * From a signal handler, interrupting rb_postponed_job_flush The current implementation is unfortunately racy when called from a thread whilst rb_postponed_job_flush is running in another thread. It is possible for one thread executing postponed jobs to decrement vm->postponed_job_index, and then for another thread to re-incremenet it & overwrite the relevant slot in vm->postponed_job_buffer before the first thread goes to execute the original job. This can lead to a function being called with an argument intented for a _different_ function. This commit introduces a new implementation of rb_register_postponed_job and rb_postponed_job_flush that attempts to address these safety issues. The postponed job buffer is now protected by a semaphore. A semaphore is used instead of a mutex, because POSIX requires that sem_get and sem_post be async-signal-safe, whilst pthread_mutex's are not required to be so. To protect against the postponed job buffer being mutated in a signal handler whilst rb_postponed_job_flush is executing, it also masks signals around the buffer access. This is slightly slower than using a mutex (it requires a system call both to mask signals & acquire the semaphore, whereas a pthread mutex usually requires no system calls when uncontended). Hopefully, however, it should be more correct! This commit also re-implements rb_workqueue_register in terms of rb_register_postponed_job, since it should be async-signal-safe and threadsafe now. Fixes #19991
342f30f to
64c666a
Compare
| jobs = malloc(job_bytes); | ||
| if (jobs) { | ||
| memcpy(jobs, vm->postponed_job_buffer->jobs, job_bytes); | ||
| postponed_job_count = vm->postponed_job_buffer->job_count; | ||
| vm->postponed_job_buffer->job_count = 0; | ||
| } | ||
| rb_native_async_safe_mutex_unlock(&vm->postponed_job_buffer->lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you missed freeing the jobs after going through them :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! great catch, thank you!
|
Thank you for the summary.
So it violates 2 and 3. can we make them safe by making the buffer as a queue? I'm afraid that using semaphore introduce non-ignorable overhead, for example for the profilers. |
|
@ko1 I hear what you're saying - the semaphores aren't free. I opened a new PR with a different approach: #8949. This keeps the workqueue/postponed_job infrastructure separate, and uses a lock-free ringbuffer to manage the signal-safe postponed job list. It's a lot more complicated than this PR, but it also seems to perform better too. WDYT? |
|
We merged #9041 instead. |
rb_register_postponed_job is designed to be both threadsafe and async-signal-safe. That is to say, it should be legal to call it in all of the following circumstances:
The current implementation is unfortunately racy when called from a thread whilst rb_postponed_job_flush is running in another thread. It is possible for one thread executing postponed jobs to decrement vm->postponed_job_index, and then for another thread to re-incremenet it & overwrite the relevant slot in vm->postponed_job_buffer before the first thread goes to execute the original job. This can lead to a function being called with an argument intented for a different function.
This commit introduces a new implementation of rb_register_postponed_job and rb_postponed_job_flush that attempts to address these safety issues. The postponed job buffer is now protected by a semaphore. A semaphore is used instead of a mutex, because POSIX requires that sem_wait and sem_post be async-signal-safe, whilst pthread_mutex's are not required to be so. To protect against the postponed job buffer being mutated in a signal handler whilst rb_postponed_job_flush is executing, it also masks signals around the buffer access.
This is slightly slower than using a mutex (it requires a system call both to mask signals & acquire the semaphore, whereas a pthread mutex usually requires no system calls when uncontended). Hopefully, however, it should be more correct!
This commit also re-implements rb_workqueue_register in terms of rb_register_postponed_job, since it should be async-signal-safe and threadsafe now.
Fixes #19991 (https://bugs.ruby-lang.org/issues/19991)