Skip to content

Conversation

@KJTsanaktsidis
Copy link
Contributor

@KJTsanaktsidis KJTsanaktsidis commented Nov 7, 2023

rb_register_postponed_job is designed to be both threadsafe and async-signal-safe. That is to say, it should be legal to call it in all of the following circumstances:

  • From a thread holding the GVL
  • From a Ruby thread not holding the GVL
  • From a native thread not otherwise known to the Ruby VM
  • From a signal handler, interrupting non-ruby code
  • From a signal handler, interrupting the Ruby VM
  • From a signal handler, interrupting an execution of rb_register_postponed_job (i.e. it must be reentrant)
  • From a signal handler, interrupting rb_postponed_job_flush

The current implementation is unfortunately racy when called from a thread whilst rb_postponed_job_flush is running in another thread. It is possible for one thread executing postponed jobs to decrement vm->postponed_job_index, and then for another thread to re-incremenet it & overwrite the relevant slot in vm->postponed_job_buffer before the first thread goes to execute the original job. This can lead to a function being called with an argument intented for a different function.

This commit introduces a new implementation of rb_register_postponed_job and rb_postponed_job_flush that attempts to address these safety issues. The postponed job buffer is now protected by a semaphore. A semaphore is used instead of a mutex, because POSIX requires that sem_wait and sem_post be async-signal-safe, whilst pthread_mutex's are not required to be so. To protect against the postponed job buffer being mutated in a signal handler whilst rb_postponed_job_flush is executing, it also masks signals around the buffer access.

This is slightly slower than using a mutex (it requires a system call both to mask signals & acquire the semaphore, whereas a pthread mutex usually requires no system calls when uncontended). Hopefully, however, it should be more correct!

This commit also re-implements rb_workqueue_register in terms of rb_register_postponed_job, since it should be async-signal-safe and threadsafe now.

Fixes #19991 (https://bugs.ruby-lang.org/issues/19991)

rb_register_postponed_job is designed to be both threadsafe and
async-signal-safe. That is to say, it should be legal to call it in all
of the following circumstances:

* From a thread holding the GVL
* From a Ruby thread not holding the GVL
* From a native thread not otherwise known to the Ruby VM
* From a signal handler, interrupting non-ruby code
* From a signal handler, interrupting the Ruby VM
* From a signal handler, interrupting an execution of
  rb_register_postponed_job (i.e. it must be _reentrant_)
* From a signal handler, interrupting rb_postponed_job_flush

The current implementation is unfortunately racy when called from a
thread whilst rb_postponed_job_flush is running in another thread. It is
possible for one thread executing postponed jobs to decrement
vm->postponed_job_index, and then for another thread to re-incremenet it
& overwrite the relevant slot in vm->postponed_job_buffer before the
first thread goes to execute the original job. This can lead to a
function being called with an argument intented for a _different_
function.

This commit introduces a new implementation of rb_register_postponed_job
and rb_postponed_job_flush that attempts to address these safety issues.
The postponed job buffer is now protected by a semaphore. A semaphore is
used instead of a mutex, because POSIX requires that sem_get and
sem_post be async-signal-safe, whilst pthread_mutex's are not required
to be so. To protect against the postponed job buffer being mutated in a
signal handler whilst rb_postponed_job_flush is executing, it also masks
signals around the buffer access.

This is slightly slower than using a mutex (it requires a system call
both to mask signals & acquire the semaphore, whereas a pthread mutex
usually requires no system calls when uncontended). Hopefully, however,
it should be more correct!

This commit also re-implements rb_workqueue_register in terms of
rb_register_postponed_job, since it should be async-signal-safe and
threadsafe now.

Fixes #19991
@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/postponed_job_thread_safety branch from 342f30f to 64c666a Compare November 7, 2023 09:24
Comment on lines +1750 to +1756
jobs = malloc(job_bytes);
if (jobs) {
memcpy(jobs, vm->postponed_job_buffer->jobs, job_bytes);
postponed_job_count = vm->postponed_job_buffer->job_count;
vm->postponed_job_buffer->job_count = 0;
}
rb_native_async_safe_mutex_unlock(&vm->postponed_job_buffer->lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed freeing the jobs after going through them :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! great catch, thank you!

@ko1
Copy link
Contributor

ko1 commented Nov 16, 2023

Thank you for the summary.

1. From a thread holding the GVL
2. From a Ruby thread not holding the GVL
3. From a native thread not otherwise known to the Ruby VM
4. From a signal handler, interrupting non-ruby code
5. From a signal handler, interrupting the Ruby VM
6. From a signal handler, interrupting an execution of rb_register_postponed_job (i.e. it must be reentrant)
7. From a signal handler, interrupting rb_postponed_job_flush

The current implementation is unfortunately racy when called from a thread whilst rb_postponed_job_flush is running in another thread. It is possible for one thread executing postponed jobs to decrement vm->postponed_job_index, and then for another thread to re-incremenet it & overwrite the relevant slot in vm->postponed_job_buffer before the first thread goes to execute the original job. This can lead to a function being called with an argument intented for a different function.

So it violates 2 and 3.

can we make them safe by making the buffer as a queue?
(or introducing spin-lock for such critical sections)

I'm afraid that using semaphore introduce non-ignorable overhead, for example for the profilers.

@KJTsanaktsidis
Copy link
Contributor Author

@ko1 I hear what you're saying - the semaphores aren't free.

I opened a new PR with a different approach: #8949. This keeps the workqueue/postponed_job infrastructure separate, and uses a lock-free ringbuffer to manage the signal-safe postponed job list.

It's a lot more complicated than this PR, but it also seems to perform better too. WDYT?

@KJTsanaktsidis
Copy link
Contributor Author

We merged #9041 instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants