Tales from a Developer

Today I was working on a really daunting bug. It all started with requests from the Frontend (a VueJS app) not serving anymore.

Since I am the backend guy (even though I’m full stack with a focus on the backend), and I am the guy who wrote the code, I got to fix the bugs.

The whole app is pretty small and so is the codebase. Currently, it is only my Coworker (another full stack with a focus on frontend) and occasionally guys from the Java team (who work on the large monolith [client & server] with the API we as the web team use).

After a large amount of debugging I finally got to the place that was causing the problems. Suddenly I forgot to separate one part of the API connections on the backend which leads to duplicated sequence numbers.

Since sequence numbers should be unique per connection, this was the first problem I figured out. A few lines of code later, that bug was fixed. Just in time to rise a second one.

After a random amount of time and requests, the backend server would be stuck. Hours passed by and I finally found the cause. This time, it is the JAVA server not responding to our requests.

Great. The Java guy was already out of the office as his child became feverish so we will continue on debugging the problem tomorrow.

2020-02-18

As expected yesterday, I continued to work on the problem today. With the help of two Java guys (one of the two is called “The Brain”) and some Wireshark magic we managed to track parts of it down.

I build a few scripts to send requests like crazy, but even with them, it is hard to reproduce the problem. We found out that the Java App is sending the response for our requests through Wireshark.

We also found out that “our” (WebDev-Side) is receiving this data. But it looks like after an unknown and unpredictable amount of time something get’s screwed up.

I’ll have to reach out a little bit more. The connection between our NodeJS backend and the Java App is a “plain old” socket. No, no fancy WebSocket, just a “simple” network socket.

I heard that in the past there were already problems with this connection. To parse the data we are receiving from the Java app, bytes are read, counted and put in to correct order so parseable data is produced.

In the byte-counting sorting, thingy was a flaw back in the days. It seems like there is again (or still) something fishy.

Since the same connection library is used for another product, this problem could have a severe impact if it would occur regularly.

The problem is, as soon as you miscount one packet of bytes, every packet afterwards is also broken which renders the app useless till restart.

So, tomorrow I will be building more scripts to reproduce the problem also in the other app. If it is reproducible there, we have a large problem. At least when we want to scale to a few hundred users.

If the problem really exists on both apps, we probably need to reimplement the connection module. Great.

2020-02-20

Finally. Yesterday I found the problem. Luckily it is only affecting one product. As the brain suggested, we miscounted bytes.

Since the buggy code was only inside an if’s else-Block, in an if that would normally resolve as true, the code was only triggered under special circumstances.

I found the problematic code only by chance when I compared the new implementation with the old one and an even older one. Only one char made the difference and so it was really hard to spot.

What happened was basically this, in the else case of the if, I was setting a value instead of adding to the current one.

Instead of +=, I did =. As I have said, a simple char that was missing. One char that could break the app.

After I found that and tested that it worked when fixed, I tested the bug on the older app. Almost the same results, if I would remove the plus from the code, the old app breaks immediately. Not after X amount of time or request, immediately.

In programming, even the smallest unit can make problems. Still, the NASA managed to send people to the moon and back with the help of software.