Infrastructure Monitoring: The Basics (New Series!)

Infrastructure Monitoring: The Basics (New Series!)

In This New Series, Learn from My Insights and On the Job Experience When it Comes to Monitoring and Logging Your Environments

Ah, yes. Metrics and log collection - the tent pole of all modern systems and devops teams. Without metrics and logs, we all feel, well, exposed (I do at least). In this article, I will be talking about methods and the overall mentality you should have when speaking to others about metrics, designing monitors and subsequent alerting rules, and finally, what tooling to use that would best suit you and your teams. Remember - all the teams in your company should not be hiding metrics away from others but rather making customized dashboards or monitoring screens so those developers and engineers on any other team that needs to monitor your stack's performance when they load test their new feature that calls your K8s cluster pods for real-time data.

Without a single and company-wide mandated monitoring, log aggregation, and end-user monitoring or EUM (not to be confused with RUM or real-time user monitoring) platform, not only will the teams inside your company suffer, but your users will too.

Pick Your Monitoring Data Format

Time Field

There are a few ways to collect and monitor your infrastructure metrics. The older method is using a database like MySQL and using a time field. The time field could be collected at any frequency like per second, millisecond, minute, hour, etc. However, when you try to display a graph using time field data using a time format the data was not collected in (Example: Showing the Last 1 Day of a metric collected 1 time per hour), you will be getting a graph showing the average or another value that does likely not show you the truth hidden behind that metric.

For the very technical out there, the best way to explain time series vs. time field datasets is that time series datasets track changes to the overall metric(s) as INSERTs, not UPDATEs, resulting in an append-only ingestion patterns. Imagine when you write code and you want to clobber or append your log files. Same type of idea here. You can either include ALL values as the log gets written, or you can wipe the log file and write the last entry.

Let's say metric Z spiked at 00:00:25 and you only have the average of the values for midnight, and one past midnight. How will you know from looking at the graph when the issue really took place?

You won't. The only way to really know is to then check two minutes of syslogs which itself can take 5 minutes each time you need to do it. Enter time series data to save the day!

Here is a great blog all about how time series data can be used to quickly determine all types of useful averages and readings.

Time Series & Time Series Databases

Next up, Time Series data is a more open and standardized format that makes the generation of graphs and dashboards far more simple on the user and also allows for super-high resolution graphs and metric collection. For instance, Prometheus was and still is, the most used time series database in the open source and even cloud communities. AWS offers Prometheus deployments for EKS/ECS metrics and any other data you want to store in your database. Typically, people tend to match Grafana with Prometheus to create monitors, graphs, and dashboards to make monitoring and checking performance much more simple.

In Prometheus, each metric datapoint contains each of the following:

  • a float64 value
  • a millisecond-precision timestamp

This data is taken from the db by Grafana and each field's value gets graphed to create detailed line or other graph types extremely quickly as no math needs to be done to draw the graphs, unless averages or other values are requested by the user.

Both methods would work but when you want to create monitoring of metric Z and not have false positives waking you up in the middle of the night, you want to have multiple readings without say, a one minute period. So instead of collecting 1 reading per second and taking the average of the recorded 60 values for the minute and graphing that single point, what if instead you were able to keep all 60 values in that one minute period so you get detailed graphs that are simple to zoom into and see when and what really happened.

Modern DevOps / SRE / SysEng teams I work with tend to lean into time series collection with Prometheus due to the pinpoint accuracy you get - you can also make the graphs that average out the time scale, or the SUM of 4 hours of the metric; it opens the door to all metric calculation and aggregation without need of storing data into another table or database. You can really get any graph you want by storing time series data in a good time series compatible database - yes, it may end up costing more in storage fees in the long run, but the sanity of your DevOps/SRE and SysEng teams are on the line. Be kind to them. Grafana is open source, has tons of addons to extend functionality, and since it is open sourced, you can make an addon for custom actions and graphs.


Let's say you are purely in AWS and you use DataDog. Well, DataDog collects CloudWatch data and customers inject that data into the DataDog production account via an IAM policy. Think about this for a second here: You will now be paying for metrics two times. Once with AWS CloudWatch, then again for DataDog. I will say it makes collection super simple and DataDog also will take your CloudTrail logs the same way but since you are paying for more detailed metric collection, why should you pay another company to display and alert you on data you already have? CloudWatch over the last two or three years has exploded and become super simple to use. You pay for CloudWatch and CloudTrail already - hook up a Grafana instance and maybe even a dedicated Prometheus DB if you have a ton of people using Grafana at the same time, and you can save costs and avoid multi-million dollar contracts over the next 5 years with ${A_COMPANY}.

Below I have a table I have made showing what I like and dislike about some of the platforms I have used and managed the deployment and continued support of. I tend to always want to do alerting and data collection as I love the insights I can get out of that data and help my team and other teams save time and effort.

Continual Improvement and Innovation

You can add all the monitors you have in your current system into Grafana and re-make all the dashboards you may need across each of your companies verticals. Since it is open source, there are no limits as to what you can make or use from others to get that data you've collected to display in a way that makes the most sense. You should work closely with your devs to learn what should be monitored, at what frequency, and what they want to see on a dashboard to pinpoint the issue when it happens. Also make sure you circle back around with them after an incident does take place and see if the current deployed version of that dashboard helped, or what should be changed or added to make it that much better.

Pick a Platform

DataDog, New Relic, Honeycomb - these are some of the big players out in the field of monitoring and log aggregation. Honestly, the three tools I mentioned are all fantastic, in their own ways. The choice of what platform fits your company best is not a choice that should be made by one person or team. What if you think Honeycomb is the best for all that juicy ML they inject and allows you to ask the console questions about what is happening in your environment so you don't have query the data yourself?

If Honeycomb solves one team's requirements but not the other teams, then it is not only going to waste time, money, and energy getting the configuration files authored, agent rolled out to all containers, servers, VMs, etc. but it is going to be even worse when you have to move to New Relic becuase it has a far better platform all the teams can use? As someone who loves data and looks for new ways to leverage all the data I collect, I can say I see this exact issue take place all the time.

I work with a company I will not name, but I can say they were on DataDog for 5 years and when a new incoming CTO wanted New Relic deployed and canceled their DataDog contract (purely because the CTO was friends with some of the higher ups inside New Relic!) not only was this really frustrating to the cloud team, but it took them a majority of the year to rebuild Dashboards, monitors, and agent configs and deploy them all once again. This is an extreme example, but its not the first time and won't be the last time I see it happen.

One Table to Monitor Them All!

There are so many large monitoring companies, but here is a table of the ones I know well and the things they do extremely well that maybe the others on the list do not do so well, or they do not do them at all.

PlatformGreat AtNot So GreatLink
DataDogLightweight Agent monitoring, most built in integrations, easy to setup and build out huge enterprise accounts quickly, encourage logs, metrics and EUM to get the whole pictureSome of the newer features need better documentation and since they try to do all the things, not all are done well at launch. Sometimes it takes a year or more to get a well-made and easy to use product out of DataDog. Support is not
New RelicRUM, EUM, Canaries, Web-Testing, AB testing results and detailed reporting on all website activities, very lightweight JS code loads the browser agent on page loading, great for web-apps or mobile apps. Can compete with AppDynamicsSince there is a large focus on APM/EUM, and even GUI testing, this is the tool for devs to build in testing. This removes syseng and SRE teams if your company is siloed. Can be a huge issue. Also, documentation is not the best and they push features out too fast. GUI is not great and is far harder to know how to get around unless you get forced into using it all the
Honeycomb.ioOnly platform where you can ask a question to the app, and have it auto-generate graphs and answers to your questions. This makes it the most dev friendly. This platform has a huge focus on collecting all the data you can so the ML/AI can answer questions as best as it can. Super cool interface and best for troubleshooting in the midst of a live production issue. Some companies replicate data to Honeycomb just to be able to query the AI when they have a production issue reported.The client and support docs are not the best, a lot of the docs assume you know a lot of Linux and Windows admin tasks. Company is newest on the list and is always
Logic MonitorHands down best support. Chat agents will setup complex items for you. This is the best platform for technical roles such as DevOps/SRE/SysEng - it is not dev friendly, but if they can live using dashboards made by SysEng teams to their spec, it can be a powerful tool. Logic Monitor is like Datadog and New Relic if they had a baby. Powerful system and app metrics paired with extensive log collection and analysis - AI provides insights as you are looking at data to try and help you nail the issue down.Complex. Made by engineers and boy it is an engineering GUI and engineer tool. Some may consider this a pro not a con, which is why I list it on each side. Devs may get frustrated and not use the platform as they never want to learn new things. The other big con: Logic Monitor uses the old style of agent collection which requires a dedicated server in your environment all the agents can talk to and forward data to and through that single proxy server. This is a hard and fast requirement. SecOps may not be okay with it, or maybe they want a single ingress/egress point to

Closing Thoughts

Well, this is part one of a long series about monitoring systems, log collection and centralization, and more. This is going to be a reoccurring series until the end of the year which I hope will teach you more and more than you ever knew before about data and the power it really has to make your job much more simple.

Next up: DataDog and What a Free Account Showed Me About my Home Environment

Did you find this article valuable?

Support Jeremy Hall 🖥️ ⌨️🖱️ by becoming a sponsor. Any amount is appreciated!