How Prezi replaced a homegrown Log Management System with Grafana Loki | by Alex


Alex

Prezi Engineering

Prezi has quite a sophisticated engineering culture where solutions are built that do the job. Some solutions that have been built in the past stood out and aged well. In other areas, some solutions have lost traction compared to industry standards.

In the second half of 2023, we modernized one of those areas that was not market-standard anymore: the logfile management system of Prezi. This is our testimonial.

Photo by Álvaro Serrano on Unsplash

We traced the beginning of the existing solution back to 2014. So it is safe to say that it was a stable solution.

The following depicts the solution. Every workload Prezi ran was instrumented with a special sidecar that took care of handling all log messages. That sidecar was built on top of two open-source solutions: scribe (https://github.com/facebookarchive/scribe), a tool built by Facebook, archived on Github in 2022.
Scribe took care of receiving log events, aggregating them, and sending them downstream.

The second component, sTunnel (https://www.stunnel.org/), took care of encrypted communication from the workload systems to the central system.

Prezi collected log events from all environments in one central place and made them accessible to engineers.

legacy log management system

Yes, the picture is telling the truth: the consumption of collected log events happened a good part of 2023 over SSH and not over any UI.

That alone was a fact to reimplement the whole solution and develop it with current market-best practices in mind. Our goal was to make the user experience more accessible and the query results better shareable.

Logshipping

With that in mind, we started the project’s first iteration. Our first take was to provide a central system that could aggregate and display log events in a more user-friendly way. Also, we wanted to get rid of the sidecar to ease operational load: while a sidecar per se is not a bad thing and a very battle-proven design pattern, it comes with certain costs when running thousands of pods.

The sidecar solution was born in times when Prezis workload ran on Elastic Beanstalk which means just an additional container on a probably oversized EC2 instance.

With the shift to Kubernetes as the workload engine the oversized EC2 instance vanished but the sidecar remained. Also, Kubernetes offers a very standardized way to consume logs from containers: stdout and stderr of the container are written to file on the Kubernetes worker hosts by the container runtime. And files can be easily consumed.

We did exactly that and used one of the established tools in that domain — filebeat — which is capable of reading the mentioned files and enriching the resulting events with metadata from Kubernetes — e.g. pod and container name and namespace.

Details on the log shipping process

This was the first optimization. The second was how events will be sent to downstream systems.

Operating in cloud environments requires a fast shipping of events away from nodes as these nodes can vanish at any time.

A common design pattern for this is to use a message queue as the first persistent layer. This can protect downstream systems in case of event bursts. It also decouples the individual parts from each other which can be helpful for maintenance or even replacements of tools.

Most of the time, the message queue used for that is an Apache Kafka installation that is capable of storing events at scale. As we already used a Kafka setup to store business events from multiple sources, we went that route without digging further into alternative persistent layers.

Sending events to a message queue

Once the events are in the queue, they can be parsed and ingested into a central system.

In our first take on this, we planned to set up the central log file management system inside our cloud environment. When doing that, there are 2 major options to go: do something with Elasticsearch or use Grafana Loki as the backend.

We’ve started with AWS OpenSearch service as a backend and Logstash to feed events from Kafka into our OpenSearch cluster.

As we run most of our software on Kubernetes, we also set up Logstash on Kubernetes and soon discovered all the joy of running a JVM inside containers. We suffered a lot of out-of-memory kills of that component.

Storing and indexing a massive amount of data into OpenSearch leads to massive indexes that soon have not been manageable anymore. This was caused by the vast amount of non-standardized fields in the application logging. A lot of heterogeneity in the fields and the contents leads to a lot of parsing errors. The most prominent example is the time and date format. Some applications have been logging unix timestamps, whereas others are using a string representation.

We discovered that if we don’t control the sources, a solution based on OpenSearch will not service us well. Going to control the source by evangelizing a common log scheme throughout all applications would have been the only way to make this work.

We started to look into an alternative to Logstash to get rid of the memory issues. We started to replace it with vector.dev which has a smaller footprint, a more flexible configuration, and it can also send to more possible backends. Logstash, without any modification, is tied closely to the OpenSearch ecosystem. But as spoiled above there is another major option to save log events: Grafana Loki.

With the replacement of Logstash, we got rid of the constant restart but not of the constant indexing errors.

Soon we started to look into Loki as an alternative. Also, we considered the hosting option as the running and maintenance of a log management system is not one of our core tasks. Running that system is more or less a commodity and takes away precious time that could be spent otherwise.

Focusing on our core tasks as the SRE team is also beneficial for customers of Prezi.

Looking at log management systems is in most cases also a buy or make (host) decision: Do one want to have the whole aggregation systems self-hosted or can this be offloaded to some 3rd party vendor?

Security and compliance concerns aside, this mostly boils down to the question of “How much can we spend?”.

With the security clearing to send logs to a 3rd party vendor and the budget to do so, we started to look at the hosted version of Loki. It turned out to be within our cost range and it can service us well: They had no issues with our ingestion rate. The way Loki stores log events as streams was perfect as it moved the problem Opensearch had with the variety of field contents from indexing time to later. With Loki, those differences surface at query time and can be tackled by predefined dashboards. This way we don’t lose any events by parsing errors.

The way to consume events that are stored in Loki is a very common user interface: Grafana, which is a well-known dashboarding solution and is already in use. With that, engineers can rely on existing tool knowledge.

Offloading logs to an external vendor also removes them from your direct control. To avoid any issues with any retention period, we also started to write logs additionally to S3 to archive them. That way, we have control over them and can use them in case we need them later.

Details on parsing and storing

With that last piece in place, we have been able to shut down the above-mentioned original log management system at the end of 2023.

Looking at the completely new log management system, we went from a very homegrown solution to a modern stack:

  • We consume logs via a standard API of the container runtime
  • Sending the events to Kafka enables us to consume them decoupled from the creation time. Kafka also stores events for a certain period, so any downtime of downstream systems does not cause data loss.
  • Vector enables us to feed events into multiple sinks. Even though not outlined before, it enables us to make certain parsing and routing decisions when parsing events. But that is part of another story.
  • Loki enables us to consume event streams via the well-known Grafana UI and query a vast amount of data in real time.
The whole process

The whole project took us most of a year until we shut down the old solution. We took this amount of time to verify all is well set up, all engineers are onboarded and familiar with the solution, and the solution is capable of handling all different peak situations.

  • Keeping the old system running has been a good decision. By that, we have been able to optimize the new system until it was able to handle the load and satisfy our needs
  • Starting to advertise a common logging scheme through a company is beneficial. That scheme makes collecting and analyzing events simpler. It gives a better user experience, too, because a timestamp is always in the same format for example.
  • Controlling log levels and the understanding of the various log levels are also crucial. What one engineer sees as debug another emits as info. Creating a common understanding is helpful.
  • Decoupling the different components from one another enables us to change them if we have other requirements or find better solutions. E.g. if we start to get unhappy with Vector, we can replace it without any hassle as the interface between the log source and Vector is Kafka.



Source link

Post a Comment

أحدث أقدم