Posted by davis on 9/9/2010 11:59:00 AM (view original):
I would assume that it is difficult to figure out, because that is true of most software/hardware problems that "randomly" pop up. If it happened every night, it'd be easy to figure out. The random ones are always the hardest to pin down, in my experience.
I understand the frustration, and in a way it is good - it shows how emotionally invested people are in their teams. I just think that these problems are harder to solve than is generally acknowledged, and that people can be pretty unreasonable in terms of what they think they are entitled to in exchange for $10 or $12.
Nothing is "random" when you're talking about programming. If you write the code, it'll execute the code... there isn't a lot of gray area. Especially when you start talking about an error that is causing the entire process to either hang or terminate altogether - there should be plenty of error handling in place to either stop that from happening (and just skipping over the problematic game(s), then flagging them to be simmed later) or to send a detailed log of what in the world happened to bring the entire process to a halt.
Now if they are just overworking their servers or something along those lines, then they can just check their error/performance logs and come up with a plan for reducing the stress.
Regardless, it really isn't that difficult to find and correct an issue that has been happening "for a while", especially one that happens two nights in a row... that should give you all of the info you need to figure out the cause and then you can get to work on the solution immediately.