Goals for observability
The goal is to have good overall system observability. Otherwise the developers(Service Providers) and devops will be left in the dark in production.
What does high cardinality means in short
High cardinality means high dimensionality.
High dimensionality means in simple terms this: You have correlated log records with many parameters none of which can be derived from the other.
This means that every single parameter in your log records is highly valuable dimension which you can use to filter, group, aggregate, etc. and extract valuable information about your system.
The key features for observability which are very important in microservices are :
- High cardinality of logs
- (APM) Application Performance Metrics (end-to-end)
- (TT) Transaction tags in order to have telemetry correlation
What does High Cardinality mean and why does it matter
Running distributed applications in k8s cluster without observability is like flying a plane without instruments – the crash is near and sure and you will not even know when or how it happened.
On your local machine you have full observability of your application and its components but this is not true in the cloud.
When distributing our system we are also distributing the places where things will go wrong.
So we need a way/instruments to know more about our system.
Monitoring applies to known failure modes.
What about everything else?
What about the 3 pillars?
Metrics, Traces and Logs.
Metrics – aggregate numbers about things that happened correlated by Time
Traces – Things correlated by Time
Logs – Raw information that can be correlated by Type/Category(if built in) and Time
Those three are reduced derivatives of Events – Something happened in your system in specific rich context.
With Events aggregated we get Metrics. Aggregated over time – we get time series and time metrics. With Events correlated between each other we get distributed tracing. With Events raw or indexed we have raw data to analyze.
High-cardinality dimensions are fields with many possible values that provide rich context.
These fields provide the rich context necessary to explore UNKNOWN situations.
Some defects in distributed system are UNKNOWNS.
Therefore we strive to get those context rich Events in the first place and then derive the other aspects/views.
KNOWN UNKNOWN
- Health checks – works or not
- APM – Application Performance Metrics – numbers for things
- Time-series metrics
- Distributed tracing
- Logs
- Events
– High cardinality dimensions. The previous can be derived from them.
– Events give us the options to answer more questions
– Events give us Debugging and Exploration capabilities
– The more we are UP the more we are into Monitoring and Resiliency capabilities.
– The more we go DOWN the more we are into Debugging and Exploration capabilities
– The Debugging and Exploration part can not be automated – you can not invent mitigations for UNKNOWNS in advance.
– The Monitoring and Resiliency part can be automated because you can create mitigations – thresholds, alerts, scripts for values of metrics and events, etc.
UNKNOWN UNKNOWNS
High cardinality log entry
Example of high cardinality Log Entry:
- What is it? Message: Read timed out
- What is the message? Level: ERROR
- When does it happen? Timestamp : 2021-02-19T12:22:48Z
- What is it? Service: registration-service
- What is it? Team: Events
- What is running? Back-end-Commit: 123asdA321
- What is running? Back-end-Build: 123123.9
- What is running? Back-end-runtime: java-18.0/.NET 5
- What is running? Back-end-OS: Linux/Windows
- What is running? Front-end-Browser Type: Chrome/Safari
- What is running? Front-end-Commit: 123asdA321
- What is running? Front-end-Build: 123123.9
- Where is it? Region: eu-west-1
- Where is it? Node: node_e89123s52
- Who caused it? IP: 192.168.0.40
- Who caused it? Customer Id: 55123
- Who caused it? User Id: 4561
- Who caused it? Subscription Id: 4561
- How to correlate all? Correlation Id: G-U-I-D
- … more info
=> then serialize as json and save it
Such an example is the APIM http call log.
Exploration power
When we have the context rich events we have exploration power to create all kinds of analytics on top of it.
- Activity spikes? –
- Activity spikes in a time frame?
- Activity spikes for a user?
- Errors affecting a region?
- Errors affecting specific user or customer?
- Errors coming from a specific build/commit?
- Performance changes between builds/commits/versions?
- Error rate changes between builds/commits/versions?
- Different load times in different browsers/regions/etc.?
- Are we getting DDoS`ed by our own customers inadvertently?
- Who is responsible for putting the data there in the first place?
- There should be an agreement between the teams what is needed for this context and then put it in.
- How do we handle GDPR compliance in case we run in EU?
- We substitute Personal Identifiers as Email, IP, etc. with hashes in order to allow compliance and retain log value
What technology stack allows that?
- Azure Monitor
- Elastic search / OpenSearch
- Honeycomb
- AWS CloudWatch
Example error message with high cardinality is the APIM http call log (more context should be added down the road):
{
"Level": 4,
"isRequestSuccess": true,
"time": "2021-02-16T13:03:53.8940820Z",
"operationName": "Microsoft.ApiManagement/GatewayLogs",
"category": "GatewayLogs",
"durationMs": 56,
"callerIpAddress": "99.99.999.9",
"correlationId": "66db63c9-0b9f-432a-be0f-43e9a5a8330e",
"location": "West Europe",
"properties": {
"method": "GET",
"url": "https://xapim.azure-api.net/xapi/",
"backendResponseCode": 200,
"responseCode": 200,
"responseSize": 777,
"cache": "none",
"backendTime": 88,
"requestSize": 888,
"apiId": "xapi",
"operationId": "root__get",
"apimSubscriptionId": "hamburger",
"clientProtocol": "HTTP/1.1",
"backendProtocol": "HTTP/1.1",
"apiRevision": "1",
"clientTlsVersion": "1.2",
"backendMethod": "GET",
"backendUrl": "https://xapi.azurewebsites.net/"
},
"resourceId": "/SUBSCRIPTIONS/XXX/RESOURCEGROUPS/X/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/X-APIM"
}
Types of APM (Application Performance Metrics)
- Service-oriented KPIs are about the performance of your services from the users’ perspective, for example response time, availability and mean time to repair.
- Efficiency-oriented KPIs indicate the number of handled events or transactions per time frame.
- Classic monitoring KPIs are usually resource-oriented ones like CPU and memory consumption, I/O, network load, database usage and the like.
(APM) Application Performance Metrics use-cases
- Monitoring means watching KPIs and watermarks and reacting in case of a breach.
- Proactive production-analysis goes a step further: Instead of just reacting, we try to anticipate what will happen based on current KPIs: If memory consumption keeps rising with the number of users, what does it mean?
- Usage Analysis What kind of users are using your application in which ways? What browsers are being used? From what geographic regions or locations? What pages are hit most frequently? How many parallel sessions are there?
- Post-mortem diagnosis Quickly triage, diagnose and pinpoint problems.
- Profiling is a classic developer task to find out about method performance, memory usage, CPU consumption, garbage collections and other metrics.
- Architecture validation aims at identifying anti-patterns and other problems before they create performance relevant problems under load. This is most important for architects and DevOps because it can be done early and prevents non-suitable architectures. Performance cannot be built on top of your apps, it must build the basis.
- Performance testing comprises various ways of measuring the application’s behavior under load. (This is not covered by the platform)
(TT) Transaction tags
Transaction tagging puts a transaction tag/id on each single transaction, from the UI to the server to the database.
The transaction id is sent to the monitoring server along with the other telemetry data. Through the correlation of the of the data by the transaction id, the server can answer questions like:
- Which user action caused the problem?
- Which user caused the problem?
- What ajax-requests were triggered by the “checkout” user action?
- What REST-services were triggered by this user action?
- Why did a certain user action trigger 150 database calls, whereas others triggered only 10?
As you can compare and see the questions that we have to answer in TT are different than APM. They are both important.
(TT) Transaction tags use-cases
Technology that enables APM, TT, Structured Event logs in Azure