Why it went wrong
April 11th, 2011As many of you experienced, this morning the Kilg.us daily boxscores went horribly wrong and began sending piles of duplicate mails to everyone. If I haven’t apologized to you yet: I’m sorry. I try very hard to make sure any changes I make to Kilg.us won’t have an adverse effect on users. I bungled this one, but at least I think I’ve figured out what happened.
I mentioned previously that I moved the daily boxscore email script to the new Kilg.us server. That is why emails went out to anyone. On the new server PHP is installed differently from the old server. As a result, when I run the CRON job to generate the emails, I needed to use an application to call the script rather than just calling PHP to execute it. I chose to use a program called wget. The idea is that wget makes a call to a URL and that URL (a PHP file) generates and sends the emails. Before scheduling the morning’s emails, I tested to ensure the process worked. When I tested, though, I only used a sub-set of data. I didn’t really need 350 emails coming into my inbox, so I tested with a couple emails each for those sent to a team owner and those sent to a team viewer. That worked great.
When the CRON job ran this morning, everything seemed to go well until duplicates started showing up. A second round, then a third, then a fourth and so on. Interestingly, each wave was 15 minutes apart. As it turns out, if wget can’t complete a request (in this case a VERY long request for 350 emails), it tries again. By default it will try up to 20 times to fetch a file. Because the script takes so long to run, I believe it exceeded the server time-out. When the script timed out, wget requested it a second time, then a third, then a fourth and so on. I believe this is what caused the duplicate emails.
I’ve made three adjustments to address this. First, I increased the server time-out for this process. Second, I have changed the CRON job to tell wget never to retry the fetch if it fails. Third, I’ve broken the massive emails process into multiple chunks. This is a temporary fix until I convert to using PEAR::Mail to more intelligently manage the process. I’m done for tonight, but that will probably be the priority tomorrow.
So that’s all of today’s work. Six hours sunk, but I think the emails will work in the morning. I’ve pushed back the time that the emails go out by a couple hours to they’ll line up with when I roll out of bed in the morning. If things go off-track, I’ll be able to curtail things faster than today. If all goes well, I’ll move the schedule back to the early morning hour so everyone has the boxscore email when they get up in the morning.
Keep your fingers crossed…