Fix thread unsafety with rb_register_postponed_job #8856

KJTsanaktsidis · 2023-11-07T09:19:43Z

rb_register_postponed_job is designed to be both threadsafe and async-signal-safe. That is to say, it should be legal to call it in all of the following circumstances:

From a thread holding the GVL
From a Ruby thread not holding the GVL
From a native thread not otherwise known to the Ruby VM
From a signal handler, interrupting non-ruby code
From a signal handler, interrupting the Ruby VM
From a signal handler, interrupting an execution of rb_register_postponed_job (i.e. it must be reentrant)
From a signal handler, interrupting rb_postponed_job_flush

The current implementation is unfortunately racy when called from a thread whilst rb_postponed_job_flush is running in another thread. It is possible for one thread executing postponed jobs to decrement vm->postponed_job_index, and then for another thread to re-incremenet it & overwrite the relevant slot in vm->postponed_job_buffer before the first thread goes to execute the original job. This can lead to a function being called with an argument intented for a different function.

This commit introduces a new implementation of rb_register_postponed_job and rb_postponed_job_flush that attempts to address these safety issues. The postponed job buffer is now protected by a semaphore. A semaphore is used instead of a mutex, because POSIX requires that sem_wait and sem_post be async-signal-safe, whilst pthread_mutex's are not required to be so. To protect against the postponed job buffer being mutated in a signal handler whilst rb_postponed_job_flush is executing, it also masks signals around the buffer access.

This is slightly slower than using a mutex (it requires a system call both to mask signals & acquire the semaphore, whereas a pthread mutex usually requires no system calls when uncontended). Hopefully, however, it should be more correct!

This commit also re-implements rb_workqueue_register in terms of rb_register_postponed_job, since it should be async-signal-safe and threadsafe now.

Fixes #19991 (https://bugs.ruby-lang.org/issues/19991)

rb_register_postponed_job is designed to be both threadsafe and async-signal-safe. That is to say, it should be legal to call it in all of the following circumstances: * From a thread holding the GVL * From a Ruby thread not holding the GVL * From a native thread not otherwise known to the Ruby VM * From a signal handler, interrupting non-ruby code * From a signal handler, interrupting the Ruby VM * From a signal handler, interrupting an execution of rb_register_postponed_job (i.e. it must be _reentrant_) * From a signal handler, interrupting rb_postponed_job_flush The current implementation is unfortunately racy when called from a thread whilst rb_postponed_job_flush is running in another thread. It is possible for one thread executing postponed jobs to decrement vm->postponed_job_index, and then for another thread to re-incremenet it & overwrite the relevant slot in vm->postponed_job_buffer before the first thread goes to execute the original job. This can lead to a function being called with an argument intented for a _different_ function. This commit introduces a new implementation of rb_register_postponed_job and rb_postponed_job_flush that attempts to address these safety issues. The postponed job buffer is now protected by a semaphore. A semaphore is used instead of a mutex, because POSIX requires that sem_get and sem_post be async-signal-safe, whilst pthread_mutex's are not required to be so. To protect against the postponed job buffer being mutated in a signal handler whilst rb_postponed_job_flush is executing, it also masks signals around the buffer access. This is slightly slower than using a mutex (it requires a system call both to mask signals & acquire the semaphore, whereas a pthread mutex usually requires no system calls when uncontended). Hopefully, however, it should be more correct! This commit also re-implements rb_workqueue_register in terms of rb_register_postponed_job, since it should be async-signal-safe and threadsafe now. Fixes #19991

ivoanjo · 2023-11-15T11:40:13Z

vm_trace.c

+    jobs = malloc(job_bytes);
+    if (jobs) {
+        memcpy(jobs, vm->postponed_job_buffer->jobs, job_bytes);
+        postponed_job_count = vm->postponed_job_buffer->job_count;
+        vm->postponed_job_buffer->job_count = 0;
+    }
+    rb_native_async_safe_mutex_unlock(&vm->postponed_job_buffer->lock);


I think you missed freeing the jobs after going through them :)

Oops! great catch, thank you!

ko1 · 2023-11-16T16:49:14Z

Thank you for the summary.

1. From a thread holding the GVL
2. From a Ruby thread not holding the GVL
3. From a native thread not otherwise known to the Ruby VM
4. From a signal handler, interrupting non-ruby code
5. From a signal handler, interrupting the Ruby VM
6. From a signal handler, interrupting an execution of rb_register_postponed_job (i.e. it must be reentrant)
7. From a signal handler, interrupting rb_postponed_job_flush

The current implementation is unfortunately racy when called from a thread whilst rb_postponed_job_flush is running in another thread. It is possible for one thread executing postponed jobs to decrement vm->postponed_job_index, and then for another thread to re-incremenet it & overwrite the relevant slot in vm->postponed_job_buffer before the first thread goes to execute the original job. This can lead to a function being called with an argument intented for a different function.

So it violates 2 and 3.

can we make them safe by making the buffer as a queue?
(or introducing spin-lock for such critical sections)

I'm afraid that using semaphore introduce non-ignorable overhead, for example for the profilers.

KJTsanaktsidis · 2023-11-20T06:27:11Z

@ko1 I hear what you're saying - the semaphores aren't free.

I opened a new PR with a different approach: #8949. This keeps the workqueue/postponed_job infrastructure separate, and uses a lock-free ringbuffer to manage the signal-safe postponed job list.

It's a lot more complicated than this PR, but it also seems to perform better too. WDYT?

KJTsanaktsidis · 2023-12-10T23:34:08Z

We merged #9041 instead.

KJTsanaktsidis force-pushed the ktsanaktsidis/postponed_job_thread_safety branch from 342f30f to 64c666a Compare November 7, 2023 09:24

ivoanjo reviewed Nov 15, 2023

View reviewed changes

KJTsanaktsidis mentioned this pull request Nov 20, 2023

Take 2: Fix thread unsafety with rb_register_postponed_job (without locking) #8949

Merged

KJTsanaktsidis closed this Dec 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix thread unsafety with rb_register_postponed_job #8856

Fix thread unsafety with rb_register_postponed_job #8856

Uh oh!

KJTsanaktsidis commented Nov 7, 2023 •

edited

Loading

Uh oh!

ivoanjo Nov 15, 2023

Uh oh!

KJTsanaktsidis Nov 15, 2023

Uh oh!

ko1 commented Nov 16, 2023

Uh oh!

KJTsanaktsidis commented Nov 20, 2023

Uh oh!

KJTsanaktsidis commented Dec 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix thread unsafety with rb_register_postponed_job #8856

Fix thread unsafety with rb_register_postponed_job #8856

Uh oh!

Conversation

KJTsanaktsidis commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivoanjo Nov 15, 2023

Choose a reason for hiding this comment

Uh oh!

KJTsanaktsidis Nov 15, 2023

Choose a reason for hiding this comment

Uh oh!

ko1 commented Nov 16, 2023

Uh oh!

KJTsanaktsidis commented Nov 20, 2023

Uh oh!

KJTsanaktsidis commented Dec 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KJTsanaktsidis commented Nov 7, 2023 •

edited

Loading