Any infrastructure – cloud or on-prem – that hosts and supports a business service at a technical level needs monitoring. Monitoring is an essential part of infrastructure maintenance and helps keep the components healthy. Nowadays we also have proactive solutions in place that raise the flag for upcoming discrepancies.
AWS CloudWatch is a monitoring service that is connected to most of the other AWS services. At the core of it, it is a metrics database for anything deployed on AWS. However, please note that there are a few exceptional services that are not integrated with AWS CloudWatch. In this post, we shall take a look at certain core concepts of how CloudWatch integrates and works with AWS services and cloud infrastructure.
Before we proceed, it is important to understand the difference between monitoring and logging. Logging is a process of collecting event-based logs. They could be a system created or user created. It is data generated by activities being performed by various components of the system. Monitoring, on the other hand, generates metrics that are based on data points. Metrics are a set of data read at a point in time. When the data points are collected over a while, multiple operations can be performed and graphs can be drawn to represent the same against the timeline.
To avoid confusion, AWS CloudWatch is a monitoring solution. As opposed to this, AWS CloudTrail logs the event-based API activities – generated by systems or users – and logs the same in a file on S3 bucket. Metrics can also be based on logs. Thus, processing of CloudTrail logs can also be done in CloudWatch.
Metrics generated in CloudWatch are huge in volume. CloudWatch metrics are enabled by default for some services, and some need explicit enablement. As a result, it becomes a task to maintain a huge number of logs in a way that makes sense. CloudWatch metrics make use of namespaces. Namespaces group the logs based on the service and few more dimensions. This makes it easy to navigate among the logs.
For the default logs, AWS provides the namespace with this format – AWS/namespace. For example, all the EC2 related logs are managed under AWS/EC2 namespace. Similarly, while creating custom data points we can create our namespaces. Please keep in mind, there are always some rules/limits like which characters can be used and how long they can be.
As you must have realized till now, metrics are fundamental to CloudWatch. Metrics are a set of data points recorded over a while. In monitoring, a value is read at a given frequency and recorded, which is then used to make sense out of, in the form of graphs or estimates. Since they are based on schedule and frequency, every data point collected should have a timestamp.
Every metric has a resolution. Resolution defines the frequency of data point collection. As a standard, AWS maintains a resolution of 1 minute – i.e. it reads the data value for the given attribute/property every one minute. Any frequency that is lower than 1 minute, is regarded as a high-resolution metric.
Collecting data points (at any resolution) can generate a lot of data. AWS implements a data retention policy for these metrics, wherein high-resolution metrics are retained for 3 hours. After 3 hours, high-resolution metrics are aggregated at a resolution of 1 minute which is stored for the next 15 days. Then these 1-minute resolution metrics are further aggregated into a 5-minute resolution for the next 63 days. Finally, they are aggregated to 1-hour resolution for the next 15 months before being deleted.
Every metric consists of a set of attributes – that are key/value pairs. These attributes describe various characteristics of the given metric. Some of these key/value pairs may be used to identify the characteristic behavior of the resource, others may be used to identify the resource itself. These identifiers are called dimensions of the metric. A metric can have as far as 10 dimensions.
When we take a look at CloudWatch metrics for EC2, it would display all the metrics for all the EC2 instances collected over a specified period. However, to narrow this down to a specific EC2 instance, we may use instance ID as one of the dimensions to display metrics associated with a single instance of interest.
Various types of statistical calculations can be performed on the collected data points that are used for evaluation and presentation. Statistical operations support aggregate functions like sum, min, max, avg, count, and percentile. Applying these functions to the data points over a given period helps us generate graphs that can explain various behavioral, functional, and performance patterns.
Every 2-dimensional graph has time on one of its axes (usually the X-axis). Thus, specifying the period while generating graphs based on statistics is very important. The number of data points present in a given period depends on the resolution of the metric. Statistical calculations like this can also be used to trigger alarms.
Alarms are used in CloudWatch to trigger any action – SNS or Autoscaling etc. Evaluation of alarm trigger condition happens based on 3 parameters –
- Period – time interval for the alarm to evaluate the given metric. Do not confuse here – think of this as an alarm’s monitoring frequency.
- Evaluation periods – Number of latest data points to consider in evaluation.
- Data Points to alarm – number of breached data points in the given evaluation period.
This can get a bit tricky to understand, so let us take an example. We have a metric with a resolution of 1 minute, which tracks the CPU performance. There is a requirement from the client which says – if the CPU utilization crosses 3 times in the last 5 minutes, autoscaling should trigger the provisioning of a new instance.
To achieve this, we set the evaluation period to 5 minutes and data points to alarm to 3. This would be interpreted as – if CPU utilization crosses the threshold on any given 3 minutes in the last 5 minutes, an alarm would be triggered. If it crosses the threshold once or twice in the last 5 minutes, the alarm would not go off.
Alarms are of 2 types – metric alarms, that are based directly on a metric. Composite alarms, that are based on the combination of multiple metric alarms. Alarms have states –
OK– if everything is okay as per the evaluation.
ALARM– if the evaluation indicates breach.
INSUFFICUENT_DATA– when the alarm has just begun collecting data.
Sometimes, we may wish to collect metrics that cannot be generated based on the given set of data. For example, if we want to create a metric that monitors the endpoints in a certain way. We can create canaries that run on a given schedule to generate such custom metric data. The canaries are configurable scripts written in NodeJS or Python. The Canaries offer programmatic access to a headless Google Chrome Browser via Puppeteer or Selenium Webdriver.
Apart from support for AWS services, we can also monitor the core infrastructure. At times, there is a requirement to collect system logs from EC2 instances, or on-prem servers and process the metrics in CloudWatch. We can do this with the help of CloudWatch Agent. CloudWatch Agent enables us to collect internal metrics from EC2 instances, on-prem servers, containers, applications, etc. It can be configured with details like – the kind of logs or metrics to be collected, the frequency, period, etc. These configurations are done in a configuration file of CloudWatch Agent, which is in JSON format.
CloudWatch offers Containers Insights and Lambda Insights as their newest offerings. Metrics related to containerized applications like ContainerInstanceCount, CpuUtilized, DeploymentCount, etc., and those related to Kubernetes like cluster_failed_node_count, cluster_note_count, node_cpu_limit, etc. can be collected using CloudWatch Agent. The Agent used for generating these metrics and insights is containerized as well. Container Insights can be enabled on Elastic Container Service and Elastic Kubernetes Service.
Similarly, CloudWatch Lambda Insights is a monitoring solution for serverless architecture. It uses CloudWatch Lambda extension that acts as a Lambda layer. This helps collect various metrics like CPU total time, INIT duration, Memory utilization, and a few more.