Skip to content

Fix a crash when using glshader = crt-auto-machine#3077

Merged
kcgen merged 2 commits intomainfrom
wd/shader
Nov 1, 2023
Merged

Fix a crash when using glshader = crt-auto-machine#3077
kcgen merged 2 commits intomainfrom
wd/shader

Conversation

@weirddan455
Copy link
Copy Markdown
Collaborator

@weirddan455 weirddan455 commented Oct 31, 2023

Description

This was a weird one. I got an assert failure in the shader code because the machine global was invalid. Removing the assert gave the same divide by 0 crash that @FeralChild64 got so I assume he was just running a Release build.

My solution is admittedly kind of a hack. I'm just reading the machine type from the config (extracted code from DOSBOX_RealInit into a separate function) because the shader code gets run before Dosbox get initialized.

As @johnnovak mentioned in the issue thread, init ordering is kind of a mess although in this case I am not seeing anything platform specific. I haven't tested on Windows but I would wager a guess that at least the assertion failure happens there as well.

I had to bust out gdb to figure out what was really going on here. If you're interested in the details, read on. The first time the shader code gets run, it's from this stack trace:

Thread 1 "dosbox" hit Breakpoint 1, ShaderManager::FindShaderAutoMachine[abi:cxx11]() const (
    this=0x555557c68680 <get_shader_manager()::shader_manager>) at ../../src/gui/shader_manager.cpp:531
531             if (video_mode.color_depth == ColorDepth::Composite) {
(gdb) bt
#0  ShaderManager::FindShaderAutoMachine[abi:cxx11]() const (this=0x555557c68680 <get_shader_manager()::shader_manager>)
    at ../../src/gui/shader_manager.cpp:531
#1  0x0000555555f6189d in ShaderManager::MaybeAutoSwitchShader (this=0x555557c68680 <get_shader_manager()::shader_manager>)
    at ../../src/gui/shader_manager.cpp:380
#2  0x0000555555f5f66e in ShaderManager::NotifyGlshaderSettingChanged (
    this=0x555557c68680 <get_shader_manager()::shader_manager>, shader_name="crt-auto-machine")
    at ../../src/gui/shader_manager.cpp:72
#3  0x0000555555f164d2 in handle_shader_changes () at ../../src/gui/render.cpp:898
#4  0x0000555555f16905 in RENDER_Init (sec=0x555558a270c0) at ../../src/gui/render.cpp:947
#5  0x0000555555f14550 in RENDER_Reinit () at ../../src/gui/render.cpp:338
#6  0x0000555555f5504b in set_output (sec=0x555558a18de0, wants_aspect_ratio_correction=true)
    at ../../src/gui/sdlmain.cpp:3407
#7  0x0000555555f56046 in GUI_StartUp (sec=0x555558a18de0) at ../../src/gui/sdlmain.cpp:3555
#8  0x0000555555bfcf43 in Section::ExecuteInit (this=0x555558a18de0, init_all=true) at ../../src/misc/setup.cpp:1234
#9  0x0000555555bfcd58 in Config::Init (this=0x555558a14950) at ../../src/misc/setup.cpp:1206
#10 0x0000555555f5bda2 in sdl_main (argc=1, argv=0x7fffffffe4f8) at ../../src/gui/sdlmain.cpp:4858
#11 0x0000555555bd1ce6 in main (argc=1, argv=0x7fffffffe4f8) at ../../src/main.cpp:30

As far as I can tell, the SDL section is always the first to get initialized (its init function is GUI_StartUp). This in turn calls RENDER_Init which calls the shader code. This all happens before the Dosbox section gets initialized so the shader manager doesn't know what to do about the machine type (triggering the assert).

Later on, the Render section gets initialized (calling for a 2nd time RENDER_Init and repeating this process.)

Thread 1 "dosbox" hit Breakpoint 1, ShaderManager::FindShaderAutoMachine[abi:cxx11]() const (
    this=0x555557c68680 <get_shader_manager()::shader_manager>) at ../../src/gui/shader_manager.cpp:531
531             if (video_mode.color_depth == ColorDepth::Composite) {
(gdb) bt
#0  ShaderManager::FindShaderAutoMachine[abi:cxx11]() const (this=0x555557c68680 <get_shader_manager()::shader_manager>)
    at ../../src/gui/shader_manager.cpp:531
#1  0x0000555555f6189d in ShaderManager::MaybeAutoSwitchShader (this=0x555557c68680 <get_shader_manager()::shader_manager>)
    at ../../src/gui/shader_manager.cpp:380
#2  0x0000555555f5f66e in ShaderManager::NotifyGlshaderSettingChanged (
    this=0x555557c68680 <get_shader_manager()::shader_manager>, shader_name="crt-auto-machine")
    at ../../src/gui/shader_manager.cpp:72
#3  0x0000555555f164d2 in handle_shader_changes () at ../../src/gui/render.cpp:898
#4  0x0000555555f16905 in RENDER_Init (sec=0x555558a270c0) at ../../src/gui/render.cpp:947
#5  0x0000555555bfcf43 in Section::ExecuteInit (this=0x555558a270c0, init_all=true) at ../../src/misc/setup.cpp:1234
#6  0x0000555555bfcd58 in Config::Init (this=0x555558a14950) at ../../src/misc/setup.cpp:1206
#7  0x0000555555f5bda2 in sdl_main (argc=1, argv=0x7fffffffe4f8) at ../../src/gui/sdlmain.cpp:4858
#8  0x0000555555bd1ce6 in main (argc=1, argv=0x7fffffffe4f8) at ../../src/main.cpp:30

By this point (if you're running a release build and the assert didn't trigger), the Dosbox section has been initialized and the shader manager is happy. However, the shader manager reports that there has been a shader change (from "empty string" to something valid). This makes the 2nd run of RENDER_Init take a branch to call into VGA_SetupDrawing which runs into a divide by 0 bug due to more uninitialized global variables.

auto shader_changed = handle_shader_changes();
setup_scan_and_pixel_doubling();
const auto needs_reinit =
((force_square_pixels != prev_force_square_pixels) ||
(render.scale.size != prev_scale_size) ||
(GFX_GetIntegerScalingMode() != prev_integer_scaling_mode) ||
shader_changed ||
(prev_force_vga_single_scan != force_vga_single_scan) ||
(prev_force_no_pixel_doubling != force_no_pixel_doubling));
if (running && needs_reinit) {
render_callback(GFX_CallBackReset);
VGA_SetupDrawing(0);
}

This PR kind of sidesteps that 2nd issue because now the shader manager gets it right the first time and the needs_reinit variable will be false.

Related issues

Fixes #3059

Manual testing

Checklist

Please tick the items as you have addressed them. Don't remove items; leave the ones that are not applicable unchecked.

I have:

  • followed the project's contributing guidelines and code of conduct.
  • performed a self-review of my code.
  • commented on the particularly hard-to-understand areas of my code.
  • split my work into well-defined, bisectable commits, and I named my commits well.
  • applied the appropriate labels (bug, enhancement, refactoring, documentation, etc.)
  • checked that all my commits can be built.
  • confirmed that my code does not cause performance regressions (e.g., by running the Quake benchmark).
  • added unit tests where applicable to prove the correctness of my code and to avoid future regressions.
  • made corresponding changes to the documentation or the website according to the documentation guidelines.
  • locally verified my website or documentation changes.

@weirddan455 weirddan455 added the bug Something isn't working label Oct 31, 2023
@weirddan455 weirddan455 self-assigned this Oct 31, 2023
@weirddan455
Copy link
Copy Markdown
Collaborator Author

Added an MCH_INVALID = 0 type to the machine type enum to make PVS Studio happy.


// DOSBOX_RealInit may have not been run yet.
// If not, go ahead and set the globals from the config.
if (machine == MCH_INVALID) {
Copy link
Copy Markdown
Member

@johnnovak johnnovak Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment that this is a workaround for the inconsistent/random/platform & config dependent init order problem.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this problem is going to be consistently reproducible across all platforms. I'm going to test on Windows here in a little bit to confirm.

@johnnovak
Copy link
Copy Markdown
Member

johnnovak commented Nov 1, 2023

As @johnnovak mentioned in the issue thread, init ordering is kind of a mess although in this case I am not seeing anything platform specific. I haven't tested on Windows but I would wager a guess that at least the assertion failure happens there as well.

Yeah, maybe this specific case is not platform-specific. My original adaptive shader stuff was fine on macOS, but when raising the final PR I tested it on Windows, and I got hard crashes in release builds. So I hit at least one platform-specific init order problem.

My solution is admittedly kind of a hack. I'm just reading the machine type from the config (extracted code from DOSBOX_RealInit into a separate function) because the shader code gets run before Dosbox get initialized.

This is fine and it works, I'm just wondering if there's a less invasive / simpler solution. In my case, the fix for the problem I mentioned above was simply to return early from some functions when the render dimensions were still 0. That got me past the problem because that init function somehow got called twice, for the first time with zero render dimensions, then for the second time with the proper values... Yeah, it's a mess 🤷🏻 E.g., you could try checking if the divisor in VGA_SetupDrawing is 0 that causes the problem, then return early if that's the case, and see if that fixes the problem...

But if that's not possible, your current fix will be fine. Can you also regression test it on Windows? Then I'll test on macOS at the end.

@weirddan455
Copy link
Copy Markdown
Collaborator Author

weirddan455 commented Nov 1, 2023

This is fine and it works, I'm just wondering if there's a less invasive / simpler solution. In my case, the fix for the problem I mentioned above was simply to return early from some functions when the render dimensions were still 0. That got me past the problem because that init function somehow got called twice, for the first time with zero render dimensions, then for the second time with the proper values... Yeah, it's a mess 🤷🏻 E.g., you could try checking if the divisor in VGA_SetupDrawing is 0 that causes the problem, then return early if that's the case, and see if that fixes the problem...

That is basically what's going on. I should have included more details about the divide by 0 crash, although my original description was already getting quite long.

Thread 1 "dosbox" received signal SIGFPE, Arithmetic exception.
0x00005555560afc71 in setup_drawing () at ../../src/hardware/vga_draw.cpp:2704
2704                             final_render_height / video_mode.height);
(gdb) bt
#0  0x00005555560afc71 in setup_drawing () at ../../src/hardware/vga_draw.cpp:2704
#1  0x00005555560aff3a in VGA_SetupDrawing () at ../../src/hardware/vga_draw.cpp:2792
#2  0x0000555555f16955 in RENDER_Init (sec=0x555558a270c0) at ../../src/gui/render.cpp:961
#3  0x0000555555bfceff in Section::ExecuteInit (this=0x555558a270c0, init_all=true) at ../../src/misc/setup.cpp:1234
#4  0x0000555555bfcd14 in Config::Init (this=0x555558a14950) at ../../src/misc/setup.cpp:1206
#5  0x0000555555f5bd5e in sdl_main (argc=1, argv=0x7fffffffe4f8) at ../../src/gui/sdlmain.cpp:4858
#6  0x0000555555bd1ce6 in main (argc=1, argv=0x7fffffffe4f8) at ../../src/main.cpp:30

const auto render_per_video_mode_scale =
Fraction(final_render_width / video_mode.width,
final_render_height / video_mode.height);

This is from the 2nd time RENDER_Init gets called when the "render" section is actually initialized from the Config calling init on everything.

I dug a bit deeper and found the root cause is from this calculate_vga_timings function. It's reading from a global vga struct which is plain old data with everything initialized to 0 at program start but valid data hasn't been filled out at the time this gets called. I stepped through with a debugger to verify this.

The VGA timings get used to calculate video_mode.height and therefore, it also ends up being 0.

// This function reads various VGA registers to calculate the display timings,
// but does not modify any of them as a side-effect.
static VgaTimings calculate_vga_timings()
{
uint32_t clock = 0;
DisplayTimings horiz = {};
DisplayTimings vert = {};
if (IS_EGAVGA_ARCH) {
horiz.total = vga.crtc.horizontal_total;
horiz.display_end = vga.crtc.horizontal_display_end;
horiz.blanking_end = vga.crtc.end_horizontal_blanking & 0x1F;
horiz.blanking_start = vga.crtc.start_horizontal_blanking;
horiz.retrace_start = vga.crtc.start_horizontal_retrace;

I wasn't sure how to fix this problem, and doing so won't fix the assertion failure anyway because that happens before this divide by 0 bug does.

I think I understand the problem you were saying about indeterminate/platform specific init and destroy if that's the same thing @kcgen was trying to fix in that PR was closed for getting too hard to maintain. If I understand correctly, the problem there was global and file scoped objects with constructors and destructors getting automatically called by the C++ runtime library at program start and exit.

This I believe is a different problem because we're dealing with plain old data. It's just a problem of various parts of the code reaching out to global variables before they get properly initialized by their respective modules. Ideally we'd want to lock some of these things behind API boundaries or something so this can't happen (open to suggestions but this seems like a systematic issue with these global variables).

Can you also regression test it on Windows? Then I'll test on macOS at the end.

Yeah, I'll test on Windows here in a little bit.

@weirddan455
Copy link
Copy Markdown
Collaborator Author

Confirmed the same behavior happens in Windows. Both the assertion failure and, if the assert is removed, the divide by 0 crash.

divide-error

@kcgen
Copy link
Copy Markdown
Member

kcgen commented Nov 1, 2023

This I believe is a different problem because we're dealing with plain old data.

Yes; that's my understanding too, @weirddan455.

The setup/control class is in charge of starting up each module in-order of their registration (all those _Init() functions along with their conf key-val descriptions in dosbox.cpp).

This includes the VGA and Render modules.

However, I think chunks of sdlmain's code expects/uses the VGA or Render modules even before they're init'd.

I'm all for as much re-ordering as needed of startup pieces (combined with asserts to catch unexpected empty/null values), to ensure the ordering is robust and correct.

I don't like the idea of quiet skips or repeated calls until eventually everything is valid; the more we can get to the correct serialized startup sequence, the better!

@weirddan455
Copy link
Copy Markdown
Collaborator Author

However, I think chunks of sdlmain's code expects/uses the VGA or Render modules even before they're init'd.

Yes, that's exactly what's happening. sdlmain in GUI_StartUp is explicitly calling RENDER_ReInit before it gets initalized through the config system. RENDER_Init is also (sometimes) calling into the VGA layer before it gets initialized as well.

I'm all for as much re-ordering as needed of startup pieces (combined with asserts to catch unexpected empty/null values), to ensure the ordering is robust and correct.

I don't like the idea of quiet skips or repeated calls until eventually everything is valid; the more we can get to the correct serialized startup sequence, the better!

Agree with all of that. We've got modules calling into each other in their init code all willy-nilly and probably some circular dependencies as well. This certainty needs to be cleaned up at some point.

What are you thoughts on the 0.81 feature freeze though? I'm not sure what exactly we should be merging right now. This PR (IMO) is a low risk, quick fix of a hard crash. Ideally we'd fix the root cause by cleaning up the init code as you said but that is a larger, more risky change.

@kcgen
Copy link
Copy Markdown
Member

kcgen commented Nov 1, 2023

What are you thoughts on the 0.81 feature freeze though? I'm not sure what exactly we should be merging right now. This PR (IMO) is a low risk, quick fix of a hard crash. Ideally we'd fix the root cause by cleaning up the init code as you said but that is a larger, more risky change.

I think your changes are better than the baseline and necessary - so absolutely should go in.

I also think that every possible code path this PR touches is going to be tested almost immediately, so there's little risk of a regression slipping through the cracks.

Copy link
Copy Markdown
Member

@kcgen kcgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and small. Passing tests here on Linux and macOS.

@kcgen
Copy link
Copy Markdown
Member

kcgen commented Nov 1, 2023

I don't see why we would want to carry this known crash into 0.81; merging. Thanks for the fix, @weirddan455.

@kcgen kcgen merged commit e91cae4 into main Nov 1, 2023
@weirddan455
Copy link
Copy Markdown
Collaborator Author

Yeah, I figured we would need to fix the crash before release. I was more talking about how the bigger job of cleaning up the init ordering and dependencies might have to wait until after 0.81 (unless you think it's important enough to try to solve now).

@kcgen
Copy link
Copy Markdown
Member

kcgen commented Nov 1, 2023

Yeah, I figured we would need to fix the crash before release. I was more talking about how the bigger job of cleaning up the init ordering and dependencies might have to wait until after 0.81 (unless you think it's important enough to try to solve now).

Oh - yeah, I agree with you and @johnnovak to hold off on this larger re-ordering pursuit, especially if's not understood or what it will entail
(I was thinking that this PR itself was the item being considered to hold-off till post-0.81).

@weirddan455 weirddan455 deleted the wd/shader branch November 1, 2023 21:32
@johnnovak
Copy link
Copy Markdown
Member

johnnovak commented Nov 1, 2023

Yeah, I figured we would need to fix the crash before release. I was more talking about how the bigger job of cleaning up the init ordering and dependencies might have to wait until after 0.81 (unless you think it's important enough to try to solve now).

Absolutely, cleaning up the init order stuff is a huge tasks, now is not the time. One can work on that for many many months to get it right.

The global and static variables problem is well-understood and what you guys wrote above about the various issues that inject randomness into the init order is all 100% true @kcgen & @weirddan455.

I think there is another source of indeterminism that is harder to see, although I haven't systematically verified it. I think the different window managers on different OSes send UI events in different order in various circumstances. E.g., maybe on macOS you get these UI events through SDL on init:

  • Window Create, Resize, Show, Activate

Then on Windows:

  • Windows Create, Activate, Show (so different order and no initial resize)

I'm completely making the example event streams up; I'm just trying to illustrate the theorethical problem here.

I'm pretty sure SDL just translates OS-issued events into SDL events (GLFW works this way that I know for sure), but does not remove events or inserts extra "synthetic" events to make the event stream exactly the same across all supported platforms and platform versions on init and in response to various other actions (fullscreen toggling, minimize/maximize, etc.; for example, on Windows 7 and macOS you might get a resize event after a windows maximize, on Windows 10 and some Linux you might not get it... I did encounter such differences during UI dev work with GLFW at the very least).

Because different SDL events at startup trigger different call chains, this can also add a significant platform dependent "randomness" to the init call chain.

I brought this up several times, I think this is the 3rd or 4th time, but never got a reaction, maybe because the problem and how it can affect the init order is hard to see. In any case, if I'm right, this is important.

I thought "normalising" the events the app sees is the solution, i.e. remove events or insert synthetic SDL events so Staging "sees" the same events in the same order on all supported OSes. But now I'm starting to think that's too hard, a moving target with the millions of Linux window managers out there, not future proof, and ultimately brittle. The best approach would be that our init functions, at least for the graphics subsystem, assume no fixed init order of incoming SDL events. That requires a different thinking about the problem and slightly different code that involves "lazy initing" many things (or most, or maybe all).

Hope that all makes some sense at last @kcgen and @weirddan455 😅

@weirddan455
Copy link
Copy Markdown
Collaborator Author

I think the different window managers on different OSes send UI events in different order in various circumstances.

Yeah, almost certainly that is true.

I thought "normalising" the events the app sees is the solution, i.e. remove events or insert synthetic SDL events so Staging "sees" the same events in the same order on all supported OSes.

It's... complicated. SDL does do some amount of normalizing but you're never going to guarantee exactly the same events in the same order across platforms.

But now I'm starting to think that's too hard, a moving target with the millions of Linux window managers out there, not future proof, and ultimately brittle.

Differences in Linux window managers are the least of the problems with this approach. SDL also supports Android, iPhone, Web via emscripten, and even running directly from a Linux console (Dosbox Staging actually runs in this last mode). These platforms have no concept of a "window" at all but SDL has to fake it. You can't expect the same stream of events though.

The best approach would be that our init functions, at least for the graphics subsystem, assume no fixed init order of incoming SDL events. That requires a different thinking about the problem and slightly different code that involves "lazy initing" many things (or most, or maybe all).

From what I saw reading the code, our startup code does have a fixed ordering. We have a data structure in the Config section that has function pointers to each module's Init + Destroy functions. It looks like this is run before we even start polling for SDL events:

void Config::Init() const
{
for (const auto& sec : sectionlist) {
sec->ExecuteInit();
}
}

Perhaps what you're seeing though is certain modules re-initalizing themselves in response to certain events. Ex. maybe the renderer has to do some re-initalizing when there's a change in window size.

I think the crux of the problem is that there's a lot of cross-talk between the different modules and some incorrect assumptions being made as to their state. Ex. With this divide by 0 bug, the renderer is assuming based on its own state that VGA is ready to go which it turns out is an incorrect assumption.

I brought this up several times, I think this is the 3rd or 4th time, but never got a reaction, maybe because the problem and how it can affect the init order is hard to see. In any case, if I'm right, this is important.

This is the first I'm reading about this in detail. I've probably just missed the other times you've brought it up. I think it would be useful if you could raise a ticket containing "code smells" as a result of this problem so that we can go over these when it's time to clean some of this stuff up. You mentioned you had to do some platform specific workarounds for the shader code. Maybe add a reference to that code so that we can have something specific to investigate.

@johnnovak
Copy link
Copy Markdown
Member

johnnovak commented Nov 2, 2023

It's... complicated. SDL does do some amount of normalizing but you're never going to guarantee exactly the same events in the same order across platforms.

Differences in Linux window managers are the least of the problems with this approach. SDL also supports Android, iPhone, Web via emscripten, and even running directly from a Linux console (Dosbox Staging actually runs in this last mode). These platforms have no concept of a "window" at all but SDL has to fake it. You can't expect the same stream of events though.

Interesting, and I'm not that familiar with SDL as I'm with GLFW. That console use-case is particularly interesting, I never considered that special scenario but it makes perfect sense SDL has to fake some events in that case.

Perhaps what you're seeing though is certain modules re-initalizing themselves in response to certain events. Ex. maybe the renderer has to do some re-initalizing when there's a change in window size.

Yeah, most likely, and it's a confusing combination of things. I'm dead convinced there are many things during init that just work literally by fluke. In other words, no one understands why certain things work because it's a complex interaction of DOSBox-side code triggered by SDL events, then if we change something in the code or something in SDL changes, things break... then we add more band-aids to patch things up because we don't even understand what's really happening. Worse yet, we don't quite have a plan to make this more deterministic (maybe perfection is not achievable, but we could surely improve the current undefined-and-works-by-fluke-mostly mess). Barring from @kcgen's init re-work, of course—I'm talking about the event handling here.

I think the crux of the problem is that there's a lot of cross-talk between the different modules and some incorrect assumptions being made as to their state. Ex. With this divide by 0 bug, the renderer is assuming based on its own state that VGA is ready to go which it turns out is an incorrect assumption.

100%, and this cross-talk is not well understood, yeah. I had to do a lot of debug tracing and experimentation when figuring out where to wedge in my adaptive-shader stuff. You simply cannot reason about it by reading the code; it's impossible. You just gotta figure it out essentially by trial and error.

This is the first I'm reading about this in detail. I've probably just missed the other times you've brought it up. I think it would be useful if you could raise a ticket containing "code smells" as a result of this problem so that we can go over these when it's time to clean some of this stuff up. You mentioned you had to do some platform specific workarounds for the shader code. Maybe add a reference to that code so that we can have something specific to investigate.

Yeah, I always just wrote my similar comments in response to some ticket when something triggered me 😄 Raising a ticket for this is a good idea and I'll do that once I'm back in full capacity 🚀

Appreciate your input @weirddan455 ❤️

@johnnovak johnnovak changed the title Fix a crash when using glshader=crt-auto-machine Fix a crash when using glshader = crt-auto-machine Dec 11, 2023
@johnnovak johnnovak added video Graphics and video related issues shaders Issues related to shaders labels Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working shaders Issues related to shaders video Graphics and video related issues

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

Crash at startup with crt-auto-machine shader

3 participants