Timing Problem
Yesterday I was working on an interesting timing problem. One of the applications I develop at work stopped working at a customer yesterday morning.
Nobody told us until after lunch, so the whole evening was filled with debugging. After more than two hours we found out the app (service) lost its user-connection at the main application.
Sidenote: Every Application that uses the Main Application via API needs a “login”.
Interestingly the kick happened 158 milliseconds after the login. Shorter than a blink of the eye. But it gets stranger.
Every service that has a connection to the main app needs to send a “keep-alive” at least every 29 seconds (30 is the timeout). When the service would log in, the first keep-alive was sent by the main server itself, so it was surely done.
And that is where the problem was. When the service logged in, its last keep-alive timestamp was set to -1. Then the login was saved to the database. After that, the first keep-alive was sent by the server and so the keep-alive timestamp was updated.
The saving to the database must have taken longer and so the worker, that kills of connections that are not sending a keep-alive was running.
The worker checked the last keep-alive of the newly logged-in service: -1. And proceeded to kick it, because -1 is way smaller than the current timestamp. The logged-in service didn’t even notice that it was kicked because it was still in the login process and didn’t even receive a response for that.
The simple solution is to set the initial keep-alive value to the current timestamp. After a few minutes of discussion, we think that the problem even occurred is a chance of one in a million or more. It probably will never ever happen ever again, but because it could potentially, we fixed it anyway. :)