Logging system design for highly error sensitive platforms
Aa a developer, logging is something you can’t run away from. You might not need them most of the times but in some cases, you are doomed without them. Imagine a failure that can cause the business to lose a lot of money. Missing those failures is just something you can’t effort. You won’t even have time to run tests and figure out a solution. You have to find the solution right after noticing the problem and patch it right away. But for that, you need a really good logging system that can notify you of high priority failures and give you enough insights so you can act fast.
Story of a cloud platform
So I was working on a cloud platform that provided a graphical interface for customers to control their servers and even add/remove servers. There was an auto-upgrade/downgrade background service that decreased/increased each server’s resources based on the load and needs for that server. There was also the billing section that was handling the billing calculations based on a Pay As You Go policy. The problem was that everything should’ve been communicated with another API and then synced with the local database. Even the pricing for all services. So you probably can imagine that if there is a failure in communication, a lot of people can get into a lot of trouble if the failure doesn’t get fixed right away. It gets worse because of the fact that the external API that we had to communicate with is not in our control and nither we have a fast communication way with the developer team on the other side.
Designing the logger system
So in order to be able to fix failures and notice the API changes as soon as possible, we came up with the idea to make some changes to our simple logger service with adding a notification layer and another layer for generating highly detailed descriptions of the problem consist of the input data, the exact time and condition that caused the problem, all the details that could be gathered about the problem and at the end the final output.
In the notification layer, we added a list of errors and exceptions that are important for us and we need to get a notification if any of those happened. But most importantly, we had to be notified if errors that we didn’t specify occurred. And if you think about it, they matter the most because when you know some sort of errors might happen, you try as hard as you can to avoid them or at least have some routines that stop the error to cause any damage or much damage to the system and the business.
Performance optimizations
Generating highly detailed logs and sending notifications can use a lot of resources which is costly and can cause the main services to go slow or even crash if you don’t separate the logger service which doing that has some other risks itself and generating a whole bunch of new problems to figure out a solution for. What we ended up doing was finding the best balance that could optimize the use of resources and at the same time, generate the detailed logs where we needed them. There is no point to generate detailed logs for outputs that are completely anticipated so we filtered them out to a more simple logger system.
Conclusion
You won’t always need such a detailed logging service because you won’t often work on a service that is depended on an external API for highly sensitive operations that can affect the billing system directly. And keep in mind that a cloud platform that provides the basic infrastructures needed for other businesses and can affect them directly is way more error sensitive than other kinds of platforms. And of course, having a good logging service alone can’t really solve your problems. You need to have routines that get triggered when an error happens and avoid the error to cause any serious damage. A logging system is there just because we can’t really anticipate everything and expect no one to make a mistake, not to give us a reason to not spend enough time implementing measures for when an error happens. We all hate getting a notification in the middle of the night or in a party telling us we or someone else screwed up or made some changes that broke some of our stuff. So let’s be smart about it and avoid them by doing things as good as we can in the first place.