Good evening, ladies and gentlemen! Your friendly neighbourhood Kazinsal here.
I have been working away at this issue thanks to the wonders of coffee and No-Doz and believe I have solved the problem. I can either get into deep technical detail or I can summarize the problem and why it looked incredibly similar to memory/controller failure, and while I'm sure it would make me look like a poster at /r/iamverysmart if I did the former I prefer my reputation to be not completely tarnished.
The issue seemed to be a collision between different versions of the Visual C++ runtime used by FLHook and FLHook's plugins having changed the internal memory structures of various runtime components. We had a couple different versions of the runtime in used in the FLHook plugins, including 2012, 2015, and 2017. The memory shared between these plugins is pretty open, so each plugin can technically read and write to every other plugin's data. This is great as long as each plugin is certain as to the layout of the memory used by each other plugin, and they usually do! The problem shows up when something internal to the runtime changes, because the internals of the runtime are undocumented and are subject to change at any time. When you're mixing runtimes like this, you generally don't have issues, but since we do a lot of manual fiddling with memory in FLHook (kind of a requirement for working around software as old as FLServer) we have a chance to run into problems like this.
Why didn't we run into this sooner? Well, we don't know what changed between Visual C++ 2012/2015 and Visual C++ 2017, which we've started using to modernize the internals of FLHook and make it a bit more optimized on modern hardware. Visual C++ 2012's compiler doesn't know how to optimize code paths for anything newer than Sandy Bridge, for example, and our server's CPU is three generations newer than that. A lot changes in four or five years of compiler and CPU development!
What led me to believe it was hardware failure is that the result of this looked almost identical to DRAM and DRAM controller failure: bits being written to wrong locations, data disappearing from data structures and class structures, and my personal (least) favourite, code paths not executing because the hooks have disappeared from the master hook list. I have tested my resolution for this issue and will be bringing the server back up shortly once I do an integrity check on the player and POB databases. Once I get the go-ahead from the admins we'll re-enable POB construction and siege mechanics.
Thank you for your patience during this outage. We're glad it wasn't as bad as we expected, and we do not anticipate any player data loss. See you in space!
It was almost indistinguishable from hardware failure and manifested in very specific circumstances that only existed on the live server, but yes, it was not actually hardware failure. Just incredibly similar to it.
I want to thank all those who worked so frantically to get this fixed, wasnt sure we would be able to get back so soon or at all after the initial reports so awesome job people .