Azure Monitoring is no replacement for prometheus+grafana

Patrick Cornelissen
4 min readMar 29, 2022

--

Photo by Luke Chesser on Unsplash

In a current project I am tasked with enabling monitoring and metrics for a spring boot application that exposes its metrics in the prometheus format to be used in Azure Monitoring. The same problem arises when you try to run many other applications that settled to offer prometheus metrics, which is a de facto standard. The task got more interesting than expected…

Monitoring in the Modern Age

In modern applications, especially when it’s a distributed application it’s important to have observability to be able to detect problems (ideally early) and often also have business metrics to be able to see if the application helps you to achieve your business goals.

Prometheus, the De Facto Standard

The format coined by prometheus is the raising star in the metrics field for a couple of years now. There are a couple of other monitoring products that are able to consume prometheus metrics. Monitoring often also offer their own agents to gather the metrics, but in the community as a whole, they are not surpassing the usage of prometheus. (Quite the contrary)

Prometheus is a time series database which stores all metrics from a given point in time as large row and then you use PromQL to query the data and extract the information you need.

Example snippet of prometheus metrics that are fetched in one request

...
# HELP http_server_requests_seconds_max
# TYPE http_server_requests_seconds_max gauge
http_server_requests_seconds_max{exception="ClientAbortException",method="GET",outcome="SUCCESS",status="200",uri="/**",} 0.0
http_server_requests_seconds_max{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/logs",} 0.0
http_server_requests_seconds_max{exception="None",method="POST",outcome="SUCCESS",status="202",uri="/api/events",} 0.0
http_server_requests_seconds_max{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/forms",} 0.0
http_server_requests_seconds_max{exception="None",method="GET",outcome="REDIRECTION",status="302",uri="REDIRECTION",} 0.0
...

Azure Monitoring

If you are running your applications in Azure via Apps or in a kubernetes instance via AKS, you are already in on the azure train and it’s not far fetched to use as many services from azure to avoid having to maintain them yourselves. Azure Monitoring is a comprehensive monitoring solution for logs, metrics and other interesting stuff.
Azure can consume metric endpoints in the prometheus format from Apps and containers running in AKS and store them in the internal database.

But Azure Monitoring is running basically on some kind of “relational database” (judged by the feature set and some blog articles I have read over the passing days) which works quite different than a time series database which is basically schema-less. As a result each metric that is gathered from the metrics endpoints is stored as separate row in the monitoring database.

Table in azure metrics for the given metric, you can see that each value is it’s own row

You can configure Azure to fetch the data from the prometheus endpoints, but you will loose a lot of information in the process due to the nature the data is saved in.
Another possibility to get the data from your application is via the azure agent, which pumps the data to azure without the intermediate step of the prometheus endpoint. That stors data in a real time series database, which makes queries a lot easier, but you still loose a lot of information.

The Problem

If you are running for example a spring boot application that uses micrometer, the app exposes a number of default metrics. For example http_server_requests_seconds_count and http_server_requests_seconds_sum.

The count metrics contains the number of requests for the server since the last time the metrics are fetched and the sum metric contains the sum of the seconds for these requests.

So for example count = 10 and sum = 2.5 (Seconds)

This means that the average request time on that timebucket is .25 seconds because we have to calculate sum/count to have a more meaningful metric (or to be able to gain any insight here to be honest). This is slightly simplified, because these metrics appear multiple time in the same prometheus scrape with different tags. These Tags allow for example to see the request duration per endpoint and http result status, but we don’t want to get even more complicated here.

In PromQL you could simply do http_server_requests_seconds_sum/http_server_requests_seconds_count to calculate the avg. duration.

In Kusto (which is the quer language in azure monitoring) you can’t simply to that because you’d have to join each row with name http_server_requests_seconds_sum with another row with the name http_server_requests_seconds_count and then you could to the calculation. Which transforms a simple line to a 20–30 line Kusto query that is hard to read and as far as I have tried yet, is not that reliable as well.
Also you ned sometimes to calculate rates etc. in PromQL, that also seems to be pretty hard in Kusto as well when the data is gathered from prometheus endpoints.

Using the direct azure metrics connection makes life a little bit better, but for example you can’t filter or even access the information of the incomming http server requests per endpoint, you just get one cumulated metric, which makes the metric basically useless.

Conclusion

If you think to replace your self hosted prometheus with azure monitoring, think again and verify that you can create the queries you need before you sink too much time and money in to that decision!
You can sometimes work around that for example by replacing the prometheus endpoint with the azure monitoring plugin for spring to work around that problem, but sometimes that is not an option (for example when you just run the application and don’t develop it yourselves).

This is written to help others that are also trying to switch. I may be wrong and I’d be glad if you can show me that it is in fact easy to have similar queries in kusto that are so easy in promQL. So please prove me wrong. :-)

--

--

Patrick Cornelissen
Patrick Cornelissen

Written by Patrick Cornelissen

CEO of orchit GmbH, as such Software developer based in Germany, Focus: microservices/cloud architectures, java/JVM, open source, agile methods and security

No responses yet