Conversation
c74251a to
3344c52
Compare
|
Added an |
3344c52 to
e74553d
Compare
|
|
||
| // DOSBOX_RealInit may have not been run yet. | ||
| // If not, go ahead and set the globals from the config. | ||
| if (machine == MCH_INVALID) { |
There was a problem hiding this comment.
Maybe add a comment that this is a workaround for the inconsistent/random/platform & config dependent init order problem.
There was a problem hiding this comment.
I believe this problem is going to be consistently reproducible across all platforms. I'm going to test on Windows here in a little bit to confirm.
Yeah, maybe this specific case is not platform-specific. My original adaptive shader stuff was fine on macOS, but when raising the final PR I tested it on Windows, and I got hard crashes in release builds. So I hit at least one platform-specific init order problem.
This is fine and it works, I'm just wondering if there's a less invasive / simpler solution. In my case, the fix for the problem I mentioned above was simply to return early from some functions when the render dimensions were still 0. That got me past the problem because that init function somehow got called twice, for the first time with zero render dimensions, then for the second time with the proper values... Yeah, it's a mess 🤷🏻 E.g., you could try checking if the divisor in But if that's not possible, your current fix will be fine. Can you also regression test it on Windows? Then I'll test on macOS at the end. |
That is basically what's going on. I should have included more details about the divide by 0 crash, although my original description was already getting quite long. dosbox-staging/src/hardware/vga_draw.cpp Lines 2702 to 2704 in 55a6f98 This is from the 2nd time I dug a bit deeper and found the root cause is from this The VGA timings get used to calculate dosbox-staging/src/hardware/vga_draw.cpp Lines 1519 to 1532 in 55a6f98 I wasn't sure how to fix this problem, and doing so won't fix the assertion failure anyway because that happens before this divide by 0 bug does. I think I understand the problem you were saying about indeterminate/platform specific init and destroy if that's the same thing @kcgen was trying to fix in that PR was closed for getting too hard to maintain. If I understand correctly, the problem there was global and file scoped objects with constructors and destructors getting automatically called by the C++ runtime library at program start and exit. This I believe is a different problem because we're dealing with plain old data. It's just a problem of various parts of the code reaching out to global variables before they get properly initialized by their respective modules. Ideally we'd want to lock some of these things behind API boundaries or something so this can't happen (open to suggestions but this seems like a systematic issue with these global variables).
Yeah, I'll test on Windows here in a little bit. |
Yes; that's my understanding too, @weirddan455. The setup/control class is in charge of starting up each module in-order of their registration (all those _Init() functions along with their conf key-val descriptions in This includes the VGA and Render modules. However, I think chunks of I'm all for as much re-ordering as needed of startup pieces (combined with asserts to catch unexpected empty/null values), to ensure the ordering is robust and correct. I don't like the idea of quiet skips or repeated calls until eventually everything is valid; the more we can get to the correct serialized startup sequence, the better! |
Yes, that's exactly what's happening. sdlmain in
Agree with all of that. We've got modules calling into each other in their init code all willy-nilly and probably some circular dependencies as well. This certainty needs to be cleaned up at some point. What are you thoughts on the 0.81 feature freeze though? I'm not sure what exactly we should be merging right now. This PR (IMO) is a low risk, quick fix of a hard crash. Ideally we'd fix the root cause by cleaning up the init code as you said but that is a larger, more risky change. |
I think your changes are better than the baseline and necessary - so absolutely should go in. I also think that every possible code path this PR touches is going to be tested almost immediately, so there's little risk of a regression slipping through the cracks. |
kcgen
left a comment
There was a problem hiding this comment.
Nice and small. Passing tests here on Linux and macOS.
|
I don't see why we would want to carry this known crash into 0.81; merging. Thanks for the fix, @weirddan455. |
|
Yeah, I figured we would need to fix the crash before release. I was more talking about how the bigger job of cleaning up the init ordering and dependencies might have to wait until after 0.81 (unless you think it's important enough to try to solve now). |
Oh - yeah, I agree with you and @johnnovak to hold off on this larger re-ordering pursuit, especially if's not understood or what it will entail |
Absolutely, cleaning up the init order stuff is a huge tasks, now is not the time. One can work on that for many many months to get it right. The global and static variables problem is well-understood and what you guys wrote above about the various issues that inject randomness into the init order is all 100% true @kcgen & @weirddan455. I think there is another source of indeterminism that is harder to see, although I haven't systematically verified it. I think the different window managers on different OSes send UI events in different order in various circumstances. E.g., maybe on macOS you get these UI events through SDL on init:
Then on Windows:
I'm completely making the example event streams up; I'm just trying to illustrate the theorethical problem here. I'm pretty sure SDL just translates OS-issued events into SDL events (GLFW works this way that I know for sure), but does not remove events or inserts extra "synthetic" events to make the event stream exactly the same across all supported platforms and platform versions on init and in response to various other actions (fullscreen toggling, minimize/maximize, etc.; for example, on Windows 7 and macOS you might get a resize event after a windows maximize, on Windows 10 and some Linux you might not get it... I did encounter such differences during UI dev work with GLFW at the very least). Because different SDL events at startup trigger different call chains, this can also add a significant platform dependent "randomness" to the init call chain. I brought this up several times, I think this is the 3rd or 4th time, but never got a reaction, maybe because the problem and how it can affect the init order is hard to see. In any case, if I'm right, this is important. I thought "normalising" the events the app sees is the solution, i.e. remove events or insert synthetic SDL events so Staging "sees" the same events in the same order on all supported OSes. But now I'm starting to think that's too hard, a moving target with the millions of Linux window managers out there, not future proof, and ultimately brittle. The best approach would be that our init functions, at least for the graphics subsystem, assume no fixed init order of incoming SDL events. That requires a different thinking about the problem and slightly different code that involves "lazy initing" many things (or most, or maybe all). Hope that all makes some sense at last @kcgen and @weirddan455 😅 |
Yeah, almost certainly that is true.
It's... complicated. SDL does do some amount of normalizing but you're never going to guarantee exactly the same events in the same order across platforms.
Differences in Linux window managers are the least of the problems with this approach. SDL also supports Android, iPhone, Web via emscripten, and even running directly from a Linux console (Dosbox Staging actually runs in this last mode). These platforms have no concept of a "window" at all but SDL has to fake it. You can't expect the same stream of events though.
From what I saw reading the code, our startup code does have a fixed ordering. We have a data structure in the Config section that has function pointers to each module's Init + Destroy functions. It looks like this is run before we even start polling for SDL events: dosbox-staging/src/misc/setup.cpp Lines 1203 to 1208 in e91cae4 Perhaps what you're seeing though is certain modules re-initalizing themselves in response to certain events. Ex. maybe the renderer has to do some re-initalizing when there's a change in window size. I think the crux of the problem is that there's a lot of cross-talk between the different modules and some incorrect assumptions being made as to their state. Ex. With this divide by 0 bug, the renderer is assuming based on its own state that VGA is ready to go which it turns out is an incorrect assumption.
This is the first I'm reading about this in detail. I've probably just missed the other times you've brought it up. I think it would be useful if you could raise a ticket containing "code smells" as a result of this problem so that we can go over these when it's time to clean some of this stuff up. You mentioned you had to do some platform specific workarounds for the shader code. Maybe add a reference to that code so that we can have something specific to investigate. |
Interesting, and I'm not that familiar with SDL as I'm with GLFW. That console use-case is particularly interesting, I never considered that special scenario but it makes perfect sense SDL has to fake some events in that case.
Yeah, most likely, and it's a confusing combination of things. I'm dead convinced there are many things during init that just work literally by fluke. In other words, no one understands why certain things work because it's a complex interaction of DOSBox-side code triggered by SDL events, then if we change something in the code or something in SDL changes, things break... then we add more band-aids to patch things up because we don't even understand what's really happening. Worse yet, we don't quite have a plan to make this more deterministic (maybe perfection is not achievable, but we could surely improve the current undefined-and-works-by-fluke-mostly mess). Barring from @kcgen's init re-work, of course—I'm talking about the event handling here.
100%, and this cross-talk is not well understood, yeah. I had to do a lot of debug tracing and experimentation when figuring out where to wedge in my adaptive-shader stuff. You simply cannot reason about it by reading the code; it's impossible. You just gotta figure it out essentially by trial and error.
Yeah, I always just wrote my similar comments in response to some ticket when something triggered me 😄 Raising a ticket for this is a good idea and I'll do that once I'm back in full capacity 🚀 Appreciate your input @weirddan455 ❤️ |
glshader = crt-auto-machine
Description
This was a weird one. I got an assert failure in the shader code because the machine global was invalid. Removing the assert gave the same divide by 0 crash that @FeralChild64 got so I assume he was just running a Release build.
My solution is admittedly kind of a hack. I'm just reading the machine type from the config (extracted code from DOSBOX_RealInit into a separate function) because the shader code gets run before Dosbox get initialized.
As @johnnovak mentioned in the issue thread, init ordering is kind of a mess although in this case I am not seeing anything platform specific. I haven't tested on Windows but I would wager a guess that at least the assertion failure happens there as well.
I had to bust out gdb to figure out what was really going on here. If you're interested in the details, read on. The first time the shader code gets run, it's from this stack trace:
As far as I can tell, the SDL section is always the first to get initialized (its init function is
GUI_StartUp). This in turn callsRENDER_Initwhich calls the shader code. This all happens before the Dosbox section gets initialized so the shader manager doesn't know what to do about the machine type (triggering the assert).Later on, the Render section gets initialized (calling for a 2nd time
RENDER_Initand repeating this process.)By this point (if you're running a release build and the assert didn't trigger), the Dosbox section has been initialized and the shader manager is happy. However, the shader manager reports that there has been a shader change (from "empty string" to something valid). This makes the 2nd run of
RENDER_Inittake a branch to call intoVGA_SetupDrawingwhich runs into a divide by 0 bug due to more uninitialized global variables.dosbox-staging/src/gui/render.cpp
Lines 947 to 962 in 55a6f98
This PR kind of sidesteps that 2nd issue because now the shader manager gets it right the first time and the
needs_reinitvariable will be false.Related issues
Fixes #3059
Manual testing
crt-auto-machineshader #3059Checklist
Please tick the items as you have addressed them. Don't remove items; leave the ones that are not applicable unchecked.
I have: