AKS Monitoring in 2026: Master Azure Kubernetes Service
Master AKS monitoring in 2026 with Azure Monitor & Managed Prometheus. Learn manual methods, then automate diagnostics with OpsSqad's AI.

Founder of OpsSqaad.ai. Your AI on-call engineer — it connects to your servers, learns how they run, and helps your team resolve issues faster every time.

Mastering Azure Kubernetes Service (AKS) Monitoring in 2026: A Comprehensive Guide
As of 2026, Azure Kubernetes Service has become the backbone of enterprise containerized workloads, with organizations running increasingly complex microservices architectures across multi-cluster deployments. Without comprehensive monitoring, these environments become black boxes where issues cascade undetected until they impact end users. This guide provides a complete framework for implementing production-grade AKS monitoring that catches problems before they become incidents.
Key Takeaways
- Azure Kubernetes Service monitoring requires collecting metrics, logs, and events across five distinct layers: infrastructure, cluster control plane, node, pod/container, and application levels.
- Container insights, integrated with Azure Monitor, automatically collects performance metrics and logs from AKS clusters without requiring manual instrumentation of most Kubernetes components.
- Managed Prometheus and Managed Grafana in 2026 provide enterprise-grade observability without the operational overhead of self-hosting these open-source tools.
- Effective alerting strategies focus on actionable signals like resource saturation, pod restart loops, API server latency, and application-specific error rates rather than vanity metrics.
- The average cost of AKS monitoring in 2026 ranges from $150-$800 per month per cluster depending on log retention policies and metric granularity settings.
- Security monitoring extends beyond traditional health metrics to include audit log analysis, network traffic patterns, and runtime threat detection.
- AI-powered troubleshooting platforms can reduce mean time to resolution (MTTR) from 15-30 minutes of manual kubectl commands to under 2 minutes via conversational interfaces.
The Critical Need for AKS Monitoring in 2026
What is AKS Monitoring?
AKS monitoring encompasses the collection, analysis, and visualization of data related to the health, performance, and behavior of your Azure Kubernetes Service clusters and the applications running within them. This includes metrics (time-series numerical data), logs (text-based event records), events (Kubernetes state changes), and traces (request flow across services) from various components, from the Kubernetes control plane to individual pods and application code.
Unlike traditional server monitoring where you track a handful of static machines, AKS monitoring deals with ephemeral workloads that scale dynamically, pods that migrate across nodes, and services that communicate through complex networking layers. A single AKS cluster might contain hundreds of pods spread across dozens of nodes, with containers starting and stopping multiple times per hour during normal operations.
Why is AKS Monitoring Essential in 2026?
The landscape of cloud-native operations has evolved dramatically. In 2026, the importance of AKS monitoring is amplified by several converging factors:
Increased Complexity: The average enterprise AKS deployment in 2026 runs 40% more microservices than in 2023, according to current industry data. Microservices architectures, dynamic scaling, and multi-cluster deployments create intricate systems where a failure in one component can cascade across dozens of services. Without comprehensive visibility, identifying the root cause of an issue becomes nearly impossible.
Cost Optimization: With Azure compute costs representing 60-70% of total cloud spend for container-heavy workloads, understanding resource consumption is critical. 2026 data shows that organizations with mature AKS monitoring practices reduce their Kubernetes infrastructure costs by 25-35% by identifying over-provisioned resources, inefficient scaling policies, and resource-intensive pods.
Enhanced Reliability: Modern SLA requirements demand 99.95% or higher uptime for business-critical applications. Proactive detection of issues through monitoring prevents outages and ensures high availability. The difference between detecting a memory leak through monitoring alerts versus discovering it when your application crashes can mean the difference between zero downtime and a 30-minute outage affecting thousands of users.
Security Posture: The threat landscape in 2026 includes sophisticated attacks targeting Kubernetes environments specifically. Continuous monitoring helps identify and respond to security threats in real-time, from unauthorized API access attempts to privilege escalation attacks and cryptocurrency mining containers injected into your cluster.
Developer Productivity: Developers need insights into their application's performance and behavior within the AKS environment to iterate quickly and effectively. When developers can see exactly how their code performs in production—including resource consumption, error rates, and dependency latencies—they ship features 40% faster according to 2026 DevOps benchmarking studies.
Core Components of AKS Monitoring: Metrics, Logs, and Events
Understanding Kubernetes Monitoring Concepts
Before diving into AKS specifics, understanding the different layers of monitoring is essential. Each layer provides unique insights and requires different collection strategies:
Infrastructure Level: This monitors the underlying Azure infrastructure that hosts AKS, including virtual machine health, disk performance, and network throughput. For AKS, this includes the Azure Virtual Machine Scale Sets that comprise your node pools and the Azure networking components that enable cluster connectivity.
Cluster Level: This focuses on Kubernetes control plane components—the API server, etcd database, scheduler, and controller manager. These components orchestrate your entire cluster, and their health directly impacts your ability to deploy, scale, and manage workloads.
Node Level: This monitors the health and resource utilization of individual worker nodes. Each node runs a kubelet that manages pods, a container runtime that executes containers, and various system services. Node-level monitoring tracks CPU, memory, disk, and network usage for each physical or virtual machine in your cluster.
Pod/Container Level: This monitors the performance and resource consumption of pods and their containers. Since pods are the smallest deployable units in Kubernetes and containers are the actual processes running your code, this layer provides the most granular operational data.
Application Level: This monitors the internal metrics and behavior of your deployed applications, including business metrics, custom application metrics, and distributed traces showing how requests flow through your microservices.
AKS Data Plane: Metrics, Logs, and Events Explained
Metrics are time-series numerical data representing performance indicators. For AKS, critical metrics include:
- CPU and memory usage at node, pod, and container levels
- Network traffic (bytes sent/received, packet loss)
- Disk I/O operations and throughput
- Pod restart counts and container exit codes
- API server request latency and throughput
- etcd database size and performance
- Persistent volume usage and IOPS
Metrics are typically collected at 60-second intervals by default in Container insights, though you can configure more granular collection for specific workloads.
Logs are text-based records of events occurring within the cluster and applications. AKS generates several types of logs:
- Container logs (stdout/stderr from your applications)
- Control plane logs (API server, scheduler, controller manager, audit logs)
- Node system logs (kubelet, container runtime)
- Application-specific logs (custom logging from your code)
Unlike metrics which are structured numerical data, logs contain unstructured or semi-structured text that requires parsing and analysis to extract meaningful information.
Events are records of significant occurrences within the Kubernetes cluster. The Kubernetes API server generates events when state changes occur, such as:
- Pod scheduling decisions
- Image pulling operations
- Volume mounting
- Container startup failures
- Node status changes
- Resource quota violations
Events are stored in etcd for a limited time (typically one hour) and provide crucial context when troubleshooting issues.
Collecting Data: Azure Monitor and Container Insights
Azure Monitor serves as the central hub for collecting and analyzing telemetry data from AKS. The Container insights feature provides specialized capabilities designed specifically for monitoring containerized workloads.
When you enable Container insights for an AKS cluster, Azure deploys a containerized version of the Log Analytics agent (also called the Azure Monitor agent) as a DaemonSet, ensuring one monitoring pod runs on each node. This agent collects performance metrics, container logs, and Kubernetes events, then forwards them to a Log Analytics workspace for storage and analysis.
The data collection process works like this:
- The monitoring agent scrapes metrics from the kubelet on each node
- Container logs are collected from the container runtime
- Kubernetes events are retrieved from the API server
- All data is batched and sent to Log Analytics over HTTPS
- Azure Monitor processes and indexes the data for querying
This architecture ensures comprehensive data collection with minimal performance impact—typically less than 2% CPU overhead per node in 2026 implementations.
Leveraging Azure Monitor for Comprehensive AKS Insights
Azure Monitor is Microsoft's native solution for collecting, analyzing, and acting on telemetry from your cloud and on-premises environments. For AKS, Azure Monitor provides a powerful and integrated platform for gaining deep visibility into your clusters without requiring third-party tools for basic monitoring needs.
Enabling Monitoring for AKS Clusters
Enabling monitoring for a new AKS cluster is straightforward. When creating a cluster through the Azure portal, the "Integrations" tab includes an option to enable Container insights. For existing clusters, you can enable monitoring through the Azure portal, Azure CLI, or infrastructure-as-code tools.
For a new cluster via Azure CLI:
az aks create \
--resource-group myResourceGroup \
--name myAKSCluster \
--enable-addons monitoring \
--workspace-resource-id /subscriptions/<subscription-id>/resourceGroups/<workspace-rg>/providers/Microsoft.OperationalInsights/workspaces/<workspace-name>For an existing cluster:
az aks enable-addons \
--resource-group myResourceGroup \
--name myAKSCluster \
--addons monitoring \
--workspace-resource-id /subscriptions/<subscription-id>/resourceGroups/<workspace-rg>/providers/Microsoft.OperationalInsights/workspaces/<workspace-name>After enabling monitoring, verify the deployment by checking that the monitoring pods are running:
kubectl get pods -n kube-system | grep ama-logsYou should see output similar to:
ama-logs-4j8xz 2/2 Running 0 3m
ama-logs-rs-6d4c8b7f9d-xkp2l 1/1 Running 0 3m
ama-logs-t9k2m 2/2 Running 0 3m
Warning: The Log Analytics workspace must be in the same subscription as your AKS cluster. Cross-subscription workspace configurations require additional RBAC permissions and are not supported in all Azure regions as of 2026.
Azure Monitor Platform Metrics and Resource Logs
Platform Metrics are metrics emitted by AKS itself, providing insights into the health and performance of the control plane and node pools. These metrics are automatically collected without any configuration and are available immediately when you create an AKS cluster.
Key platform metrics include:
- API Server Latency: Measures how long the Kubernetes API server takes to respond to requests. Sustained latency above 1 second indicates control plane stress.
- Node CPU/Memory Percentage: Shows resource utilization across your node pools, helping identify when to scale nodes.
- Pod Count: Tracks the number of running pods, which helps with capacity planning.
- Node Condition: Indicates whether nodes are in Ready, NotReady, or Unknown states.
You can view these metrics in the Azure portal under your AKS cluster's "Metrics" section. They're stored for 93 days by default and can be used for alerting without additional configuration.
Resource Logs provide detailed information about AKS operations and events. Unlike platform metrics which are automatically collected, resource logs require you to create a diagnostic setting to specify which log categories to collect and where to send them.
Critical log categories for AKS include:
- kube-apiserver: All requests to the Kubernetes API server
- kube-controller-manager: Controller manager operations
- kube-scheduler: Pod scheduling decisions
- kube-audit: Kubernetes audit logs showing who did what and when
- cluster-autoscaler: Autoscaling decisions and events
To enable resource logs via Azure CLI:
az monitor diagnostic-settings create \
--name aks-diagnostics \
--resource /subscriptions/<subscription-id>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/<cluster-name> \
--logs '[
{"category": "kube-apiserver", "enabled": true},
{"category": "kube-controller-manager", "enabled": true},
{"category": "kube-scheduler", "enabled": true},
{"category": "kube-audit", "enabled": true},
{"category": "cluster-autoscaler", "enabled": true}
]' \
--workspace /subscriptions/<subscription-id>/resourceGroups/<workspace-rg>/providers/Microsoft.OperationalInsights/workspaces/<workspace-name>Note: Audit logs can be extremely verbose in busy clusters. In 2026, a typical production AKS cluster generates 5-15 GB of audit logs per day. Consider filtering audit logs to specific namespaces or operations to control costs.
Container Insights: Deep Dive into Cluster and Application Performance
Container insights offers a specialized experience for monitoring AKS that goes beyond raw metrics and logs. It provides pre-built visualizations, performance analysis, and health monitoring specifically designed for Kubernetes environments.
Metrics Collection: Container insights automatically collects key performance metrics from nodes, pods, and containers every 60 seconds. This includes:
- CPU usage (millicores and percentage)
- Memory usage (working set and RSS)
- Network bytes sent/received
- Disk reads/writes
- Container restart counts
These metrics are stored in the Perf and InsightsMetrics tables in Log Analytics, where you can query them using Kusto Query Language (KQL).
Log Collection: Container insights streams container logs from stdout and stderr to Log Analytics, storing them in the ContainerLog table. You can configure log collection to include or exclude specific namespaces, reducing data volume for noisy applications.
To exclude logs from a specific namespace, create a ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: container-azm-ms-agentconfig
namespace: kube-system
data:
schema-version: v1
config-version: ver1
log-data-collection-settings: |-
[log_collection_settings]
[log_collection_settings.stdout]
enabled = true
exclude_namespaces = ["kube-system", "gatekeeper-system"]
[log_collection_settings.stderr]
enabled = true
exclude_namespaces = ["kube-system", "gatekeeper-system"]Live Data: One of the most powerful features of Container insights in 2026 is the Live Data view, which provides real-time streaming of container logs, events, and metrics directly in the Azure portal. This is invaluable when troubleshooting active incidents, as you can see log output and metrics updating in real-time without SSH access to nodes.
To access Live Data, navigate to your AKS cluster in the Azure portal, select "Insights" from the left menu, then choose the "Containers" tab and click "View live data" for any container.
Pro tip: For advanced log analysis and correlation, consider exporting your AKS logs from Log Analytics to a dedicated SIEM solution like Microsoft Sentinel or Splunk. This enables cross-platform correlation, advanced threat detection, and longer retention periods for compliance requirements. The export can be configured using Log Analytics workspace data export rules, which stream logs to Azure Storage or Event Hubs in near real-time.
Harnessing Prometheus and Managed Grafana for Advanced Visualization
While Azure Monitor provides a robust foundation, many organizations leverage open-source tools like Prometheus for metrics collection and Grafana for visualization. These tools have become industry standards in the Kubernetes ecosystem, with extensive community support and pre-built dashboards for common workloads.
Understanding Managed Prometheus with Azure Monitor
Managed Prometheus, integrated with Azure Monitor, allows you to collect Prometheus metrics from your AKS clusters without managing the Prometheus infrastructure yourself. This service, generally available in 2026, eliminates the operational burden of running Prometheus servers, managing storage, ensuring high availability, and handling upgrades.
The architecture works as follows:
- You enable Managed Prometheus on your AKS cluster
- Azure deploys a metrics collection agent as a DaemonSet
- The agent scrapes Prometheus metrics from your applications and Kubernetes components
- Metrics are sent to an Azure Monitor workspace (different from Log Analytics)
- Data is stored in a fully managed, scalable time-series database
To enable Managed Prometheus for an existing AKS cluster:
az aks update \
--resource-group myResourceGroup \
--name myAKSCluster \
--enable-azure-monitor-metrics \
--azure-monitor-workspace-resource-id /subscriptions/<subscription-id>/resourceGroups/<rg>/providers/Microsoft.Monitor/accounts/<workspace-name>By default, Managed Prometheus scrapes metrics from:
- Kubernetes API server
- Kubelet on each node
- cAdvisor (container metrics)
- Node Exporter (node-level system metrics)
To scrape custom application metrics, you need to annotate your pods with Prometheus scrape annotations:
apiVersion: v1
kind: Pod
metadata:
name: my-application
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
image: myapp:1.0
ports:
- containerPort: 8080Warning: Managed Prometheus in 2026 has a default retention period of 18 months, but this comes with storage costs that scale with metric cardinality. High-cardinality metrics (those with many unique label combinations) can significantly increase costs. Monitor your metric cardinality using the Azure portal's cost analysis tools.
Building Custom Dashboards with Managed Grafana
Managed Grafana provides a fully managed Grafana service that integrates seamlessly with Azure Monitor and Managed Prometheus. As of 2026, it runs Grafana 10.x with enterprise features like RBAC, team management, and reporting included at no additional cost.
To connect Managed Grafana to your Managed Prometheus data:
- Create a Managed Grafana instance in the Azure portal
- Navigate to the Grafana endpoint provided (e.g.,
https://myinstance-abcd1234.grafana.azure.com) - Add a data source: Configuration → Data Sources → Add data source → Prometheus
- Use the Azure Monitor workspace endpoint as the URL
- Select "Azure Authentication" and choose your managed identity
Once connected, you can create custom dashboards or import community dashboards from Grafana's dashboard repository. Popular AKS dashboards include:
- Kubernetes Cluster Monitoring (ID: 15760): Comprehensive cluster overview
- Kubernetes Pod Monitoring (ID: 15761): Detailed pod-level metrics
- Node Exporter Full (ID: 1860): In-depth node metrics
Here's an example PromQL query for visualizing pod CPU usage:
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)And for memory usage:
sum(container_memory_working_set_bytes{namespace="production"}) by (pod) / 1024 / 1024Key Prometheus Metrics for AKS
Understanding which Prometheus metrics to monitor is crucial for effective AKS observability. Here are the most important metrics as of 2026:
Resource Requests and Limits:
kube_pod_container_resource_requests_cpu_cores: Shows how much CPU each container has requestedkube_pod_container_resource_limits_cpu_cores: Shows the CPU limit for each containerkube_pod_container_resource_requests_memory_bytes: Memory requestskube_pod_container_resource_limits_memory_bytes: Memory limits
These metrics help identify over-provisioned or under-provisioned containers.
Actual Resource Usage:
container_cpu_usage_seconds_total: Total CPU time consumed (userate()to get usage rate)container_memory_working_set_bytes: Current memory usagenode_cpu_seconds_total: Node-level CPU usagenode_memory_MemAvailable_bytes: Available memory on nodes
Kubernetes State Metrics:
kube_pod_status_phase: Current phase of each pod (Running, Pending, Failed, etc.)kube_pod_container_status_restarts_total: Number of container restartskube_deployment_status_replicas_available: Number of available replicas for deploymentskube_node_status_condition: Node condition status (Ready, DiskPressure, MemoryPressure, etc.)
By tracking these metrics and setting appropriate alerts, you can proactively identify issues before they impact your applications.
Proactive Alerting Strategies for AKS in 2026
Collecting data is only half the battle; acting on it is crucial. Effective alerting ensures that you are notified of potential issues before they impact your users or your business. However, poorly configured alerts lead to alert fatigue, where teams ignore notifications because most are false positives.
What are the Key Signals for Configuring Alerts in AKS?
The most effective alerts focus on symptoms that directly impact user experience or system stability rather than low-level technical metrics. Here are the critical signals across different layers of your AKS deployment:
Resource Utilization:
- Node CPU above 80% for 5+ minutes: Indicates the need to scale node pools or optimize workloads
- Node memory above 85% for 5+ minutes: Risk of pod evictions and OOMKilled containers
- Pod CPU throttling: When containers hit their CPU limits, causing performance degradation
- Persistent Volume (PV) usage above 85%: Risk of applications failing due to disk space
Application Errors:
- Container exit code non-zero: Application crashes that need investigation
- Increased error rate in application logs: Pattern matching for ERROR or FATAL log levels
- HTTP 5xx response rate above threshold: Application-level failures affecting users
Pod Restarts:
- Pod restarted more than 3 times in 10 minutes: Indicates CrashLoopBackOff or recurring failures
- ImagePullBackOff events: Problems pulling container images from registries
- Liveness probe failures: Application health check failures
Node Health:
- Node in NotReady state for more than 2 minutes: Node connectivity or kubelet issues
- DiskPressure or MemoryPressure node conditions: Resource exhaustion on nodes
- High node count in node pool: Unexpected scaling events
API Server Latency:
- API server P95 latency above 1 second: Control plane performance degradation
- API server error rate above 1%: Problems with the Kubernetes control plane
Deployment Failures:
- Deployment not reaching desired replica count within 10 minutes: Rollout issues
- Pods stuck in Pending state for more than 5 minutes: Scheduling problems or resource constraints
Configuring Alerts on Specific Metric Types in AKS
Azure Monitor provides several mechanisms for creating alerts based on metrics, logs, and resource health. As of 2026, the recommended approach is to use metric alerts for numerical thresholds and log query alerts for pattern matching in logs.
Creating a Metric Alert for High CPU Usage:
az monitor metrics alert create \
--name high-node-cpu \
--resource-group myResourceGroup \
--scopes /subscriptions/<subscription-id>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/<cluster-name> \
--condition "avg node_cpu_usage_percentage > 80" \
--window-size 5m \
--evaluation-frequency 1m \
--description "Alert when node CPU exceeds 80% for 5 minutes" \
--severity 2 \
--action-group /subscriptions/<subscription-id>/resourceGroups/<rg>/providers/Microsoft.Insights/actionGroups/<action-group-name>Creating a Log Query Alert for Pod Restarts:
First, create the KQL query:
KubePodInventory
| where TimeGenerated > ago(10m)
| where Namespace == "production"
| where ContainerRestartCount > 3
| summarize RestartCount = max(ContainerRestartCount) by Name, Namespace
| where RestartCount > 3Then create the alert rule in the Azure portal or via ARM template. Log query alerts in 2026 support advanced features like dynamic thresholds that use machine learning to automatically adjust alert thresholds based on historical patterns.
Cluster Infrastructure Alerts
Infrastructure-level alerts focus on the health of nodes and the underlying Azure resources. Key alerts include:
Node Pool Scaling Alerts:
Monitor when your node pools scale up or down unexpectedly. This can indicate either legitimate load increases or potential issues like pod scheduling problems causing unnecessary scaling.
AzureMetrics
| where ResourceProvider == "MICROSOFT.CONTAINERSERVICE"
| where MetricName == "kube_node_status_allocatable_cpu_cores"
| summarize NodeCount = dcount(Resource) by bin(TimeGenerated, 5m)
| where NodeCount > 10 // Adjust threshold based on your environmentNetwork Performance Alerts:
In 2026, AKS supports advanced network observability through Azure Network Watcher integration. Monitor for:
- Packet loss above 1%
- Network latency spikes between pods
- Bandwidth saturation on node network interfaces
Kubernetes Control Plane Alerts
The Kubernetes control plane is critical for cluster operations. If the API server, scheduler, or controller manager experience issues, your entire cluster can become unstable.
API Server Availability Alert:
AzureDiagnostics
| where Category == "kube-apiserver"
| where TimeGenerated > ago(5m)
| summarize ErrorCount = countif(ResponseCode >= 500) by bin(TimeGenerated, 1m)
| where ErrorCount > 10etcd Performance Alert:
etcd is the database backing Kubernetes. Slow etcd operations indicate control plane stress:
InsightsMetrics
| where Name == "etcd_disk_backend_commit_duration_seconds"
| where Val > 0.1 // 100ms threshold
| summarize P95 = percentile(Val, 95) by bin(TimeGenerated, 5m)Application-Specific Alerts
Application-specific alerts require instrumenting your applications to expose custom metrics and structured logs. In 2026, most teams use OpenTelemetry for standardized instrumentation.
For a Node.js application exposing Prometheus metrics:
const promClient = require('prom-client');
const express = require('express');
const app = express();
const register = new promClient.Registry();
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
registers: [register]
});
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});Then create an alert on elevated request duration:
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2Advanced Troubleshooting and Monitoring Challenges
While native tools and open-source integrations provide excellent monitoring capabilities, complex scenarios and specific challenges can arise that require advanced techniques and careful planning.
Common AKS Monitoring Challenges
Data Volume and Cost: The most common challenge teams face in 2026 is managing the sheer volume of logs and metrics generated by AKS clusters. A medium-sized production cluster (50 nodes, 500 pods) can generate:
- 50-100 GB of container logs per day
- 5-15 GB of control plane logs per day
- 500,000-1,000,000 metric data points per hour
At 2026 Log Analytics pricing ($2.76 per GB for the first 100 GB), this translates to $150-300 per month just for log ingestion, plus additional costs for retention and queries.
Strategies for cost optimization:
- Implement log filtering at the source: Configure container log collection to exclude verbose debug logs in production
- Use sampling for high-volume logs: Collect 10% of application logs for non-critical services
- Reduce metric cardinality: Avoid labels with high cardinality like user IDs or request IDs
- Set appropriate retention policies: Keep detailed logs for 30 days, aggregated data for 90 days
- Use workspace-based pricing: If ingesting more than 100 GB per day, commitment tiers offer 30-50% savings
Correlation Across Layers: When an issue occurs, correlating data across infrastructure, cluster, and application layers is challenging. A user-facing error might be caused by:
- An application bug (application layer)
- Resource exhaustion in a pod (container layer)
- Node memory pressure (node layer)
- Azure infrastructure issues (infrastructure layer)
The solution is implementing distributed tracing and correlation IDs. In 2026, Azure Monitor supports OpenTelemetry natively, allowing you to trace requests from the load balancer through multiple microservices and correlate with infrastructure metrics.
Distributed Tracing: Understanding the flow of requests across microservices requires distributed tracing capabilities. Azure Monitor Application Insights now supports OpenTelemetry tracing natively, providing end-to-end visibility into request flows.
Example OpenTelemetry instrumentation for a Go application:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() {
exporter, _ := otlptracehttp.New(
context.Background(),
otlptracehttp.WithEndpoint("https://<app-insights-endpoint>"),
)
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
}Advanced Troubleshooting Scenarios
Debugging Intermittent Pod Startup Failures:
When pods occasionally fail to start, the issue might not be captured in current logs. Use this approach:
- Check recent events across the namespace:
kubectl get events --sort-by='.lastTimestamp' -n production- Look for ImagePullBackOff, CrashLoopBackOff, or Pending states
- Query historical pod status from Log Analytics:
KubePodInventory
| where TimeGenerated > ago(24h)
| where Namespace == "production"
| where PodStatus !in ("Running", "Succeeded")
| summarize FailureCount = count() by Name, PodStatus, bin(TimeGenerated, 1h)
| render timechart- Examine previous container logs (for CrashLoopBackOff):
kubectl logs <pod-name> --previous -n productionInvestigating Performance Degradation:
When users report slow application performance:
- Check pod CPU throttling:
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "cpuThrottledNanoCores"
| where CounterValue > 0
| summarize ThrottledTime = sum(CounterValue) by InstanceName
| order by ThrottledTime desc- Identify memory pressure:
kubectl top pods -n production --sort-by=memory- Analyze application-level metrics for slow database queries or external API calls
- Review network latency between services using service mesh telemetry
Exploring AI/ML-Driven Anomaly Detection
As of 2026, Azure Monitor includes built-in anomaly detection capabilities that use machine learning to automatically identify unusual patterns in metrics and logs. This reduces the need for manual threshold configuration and adapts to changing application behavior.
Dynamic Thresholds automatically adjust alert thresholds based on historical data. For example, if your application typically sees higher traffic on weekdays versus weekends, dynamic thresholds account for this pattern and only alert when traffic deviates from the expected pattern.
To enable dynamic thresholds for an alert:
az monitor metrics alert create \
--name adaptive-cpu-alert \
--resource-group myResourceGroup \
--scopes <resource-id> \
--condition "avg node_cpu_usage_percentage > dynamic medium 2 of 4" \
--window-size 5m \
--evaluation-frequency 1mThe "medium" sensitivity and "2 of 4" parameters mean the alert triggers when the metric violates the learned threshold in 2 out of 4 consecutive evaluation periods, reducing false positives.
Security Monitoring Specific to AKS
Security is a critical aspect of any Kubernetes deployment. Beyond general health monitoring, specialized security monitoring is essential to protect your AKS clusters from threats ranging from misconfigured RBAC to runtime container compromises.
Beyond Microsoft Defender for Containers
While Microsoft Defender for Containers offers robust security features including vulnerability scanning, runtime threat detection, and Kubernetes-specific security recommendations, augmenting it with custom monitoring provides defense in depth.
Audit Log Analysis: Kubernetes audit logs record every API request to the cluster, including who made the request, what resource was accessed, and what action was performed. These logs are invaluable for security investigations but require careful analysis due to their volume.
Key security events to monitor in audit logs:
AzureDiagnostics
| where Category == "kube-audit"
| where log_s contains "authorization.k8s.io"
| extend auditLog = parse_json(log_s)
| where auditLog.responseStatus.code >= 403
| project TimeGenerated,
User = auditLog.user.username,
Verb = auditLog.verb,
Resource = auditLog.objectRef.resource,
Namespace = auditLog.objectRef.namespace,
ResponseCode = auditLog.responseStatus.code
| order by TimeGenerated descThis query identifies failed authorization attempts, which could indicate:
- Misconfigured service accounts
- Attackers attempting privilege escalation
- Users accessing resources they shouldn't
Network Traffic Monitoring: Unusual network communication patterns can indicate compromised containers. For example:
- Containers initiating outbound connections to unknown IPs
- Internal services communicating with external cryptocurrency mining pools
- Excessive DNS queries (potential DNS tunneling)
With Azure Network Policy or Calico, you can log network policy violations:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-external-egress
namespace: production
spec:
podSelector:
matchLabels:
tier: backend
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443Monitor violations by querying network policy logs in Azure Network Watcher.
Vulnerability Scanning Integration: Microsoft Defender for Containers automatically scans container images for vulnerabilities, but you should also monitor for:
- Containers running with known critical CVEs
- Images pulled from untrusted registries
- Containers running as root when not necessary
Query for containers running as root:
KubePodInventory
| where PodStatus == "Running"
| extend SecurityContext = parse_json(ContainerSecurityContext)
| where SecurityContext.runAsUser == 0 or isempty(SecurityContext.runAsUser)
| project Name, Namespace, ContainerName, ImageImplementing Security Alerts
Configure alerts for critical security events:
Privileged Container Alert:
KubePodInventory
| where TimeGenerated > ago(5m)
| extend SecurityContext = parse_json(ContainerSecurityContext)
| where SecurityContext.privileged == true
| where Namespace !in ("kube-system", "gatekeeper-system") // Exclude system namespaces
| project TimeGenerated, Name, Namespace, ContainerNameSuspicious Process Execution Alert:
Using Microsoft Defender for Containers, you can create custom alerts for suspicious processes like cryptocurrency miners or network scanning tools being executed inside containers.
Secret Access Alert:
Monitor for unusual access to Kubernetes secrets:
AzureDiagnostics
| where Category == "kube-audit"
| extend auditLog = parse_json(log_s)
| where auditLog.objectRef.resource == "secrets"
| where auditLog.verb in ("get", "list")
| summarize AccessCount = count() by User = tostring(auditLog.user.username), bin(TimeGenerated, 5m)
| where AccessCount > 10 // Threshold for unusual access frequencySkip the Manual Work: How OpsSqad Automates AKS Debugging
You've spent hours configuring Azure Monitor, setting up Prometheus exporters, building Grafana dashboards, and fine-tuning alert rules. Your monitoring is comprehensive—you know exactly when something goes wrong. But when an alert fires at 2 AM because pods are in CrashLoopBackOff, you're still faced with a barrage of manual kubectl commands to diagnose the problem.
What if you could simply ask an AI agent to do it for you?
The OpsSqad Approach to AKS Troubleshooting
OpsSqad's reverse TCP architecture means your AKS nodes connect out to OpsSqad cloud, eliminating the need for complex firewall configurations, VPN setups, or exposing your Kubernetes API server to the internet. AI agents, organized into specialized Squads like the K8s Squad, can then interact with your cluster via a simple chat interface.
Unlike traditional monitoring tools that only tell you what is wrong, OpsSqad's AI agents actively investigate why it's wrong and suggest remediation steps—all through natural language conversation.
Your 5-Step Journey to Effortless AKS Debugging with OpsSqad
1. Create account and Node: Sign up at app.opssquad.ai → Navigate to Nodes → Create Node with a descriptive name like "production-aks-cluster" → Get your unique Node ID and authentication token from the dashboard. This takes about 30 seconds.
2. Deploy agent: SSH to any node in your AKS cluster (or use Azure Cloud Shell with kubectl access) and run the installation commands using your Node ID and token from the dashboard:
curl -fsSL https://install.opssquad.ai/install.sh | bash
opssquad node install --node-id=<from-dashboard> --token=<from-dashboard>
opssquad node startThe lightweight agent establishes a secure reverse TCP connection to OpsSqad cloud. No inbound firewall rules required—the connection is initiated from inside your network.
3. Browse Squad Marketplace: In the OpsSqad dashboard, navigate to Squad Marketplace → Find the "K8s Troubleshooting Squad" → Deploy Squad. This creates a private instance with specialized AI agents trained on Kubernetes diagnostics, AKS-specific issues, and common troubleshooting workflows.
4. Link agents to nodes: Open your deployed Squad → Go to the Agents tab → Give agents access to your Node. This grants permission for AI agents to execute whitelisted Kubernetes commands on your infrastructure. You control exactly which commands are allowed through the whitelist configuration.
5. Start debugging: Go to chat.opssquad.ai → Select your K8s Squad → Start chatting. The AI agents can now execute real kubectl commands on your cluster and provide intelligent analysis.
Realistic Chat Interaction: Debugging a Failing Pod
User: "My application pod 'payment-service-7d4f8b9c-xyz' in the production namespace is in CrashLoopBackOff. What's going on?"
K8s Squad: "I'll investigate the 'payment-service-7d4f8b9c-xyz' pod in the production namespace. First, let me check its current status and recent events."
kubectl get pod payment-service-7d4f8b9c-xyz -n production -o yaml
kubectl describe pod payment-service-7d4f8b9c-xyz -n productionK8s Squad: "The pod is indeed in CrashLoopBackOff with 5 restarts in the last 10 minutes. The kubectl describe output shows Back-off restarting failed container and the last state was Terminated with exit code 137. Let me check the container logs to understand why it's terminating."
kubectl logs payment-service-7d4f8b9c-xyz -n production -c payment-service --previousK8s Squad: "The logs show the container was killed with signal 9 (SIGKILL), and the last log entry before termination indicates java.lang.OutOfMemoryError: Java heap space. This is a memory issue. Let me check the current resource limits:"
kubectl get pod payment-service-7d4f8b9c-xyz -n production -o jsonpath='{.spec.containers[0].resources}'K8s Squad: "The container has a memory limit of 512Mi, but your application is attempting to allocate more. Based on the logs, the Java heap is consuming all available memory. I have a few recommendations:
- Immediate fix: Increase the memory limit to at least 1Gi in your deployment
- Long-term optimization: Review your Java heap settings (-Xmx flag) and application memory usage
- Monitoring: Set up an alert for containers approaching their memory limits
Would you like me to show you how to update the deployment with increased memory limits?"
User: "Yes, show me how to update it."
K8s Squad: "I'll generate the patch command for you. Since I see this is a Deployment (based on the pod naming pattern), here's how to update it:"
kubectl patch deployment payment-service -n production -p '{"spec":{"template":{"spec":{"containers":[{"name":"payment-service","resources":{"limits":{"memory":"1Gi"},"requests":{"memory":"768Mi"}}}]}}}}'K8s Squad: "This will trigger a rolling update. The new pods will have 1Gi memory limit and 768Mi memory request. Would you like me to monitor the rollout status?"
Benefits of the OpsSqad Model
Time Savings: What previously took 15 minutes of manual kubectl commands, log analysis, and documentation searching now takes 90 seconds via chat. During a production incident, this difference is measured in thousands of dollars of business impact.
Enhanced Security: OpsSqad's command whitelisting ensures AI agents can only execute approved commands. Sandboxed execution prevents destructive operations. Comprehensive audit logging records every command executed, who requested it, and the results—crucial for compliance and security investigations. And because the connection is reverse TCP (outbound from your network), you never expose your Kubernetes API server to the internet.
Reduced Complexity: No need to memorize hundreds of kubectl commands, jsonpath expressions, or KQL queries. The AI understands natural language requests like "show me pods using the most memory" and translates them to the appropriate commands.
Focus on Resolution: Spend less time on diagnostics and more time on fixing problems and building features. The AI handles the tedious investigation work, presenting you with actionable insights and suggested remediation steps.
Knowledge Sharing: Junior team members can troubleshoot complex issues with AI assistance, learning best practices through the AI's explanations. The chat history serves as documentation of how issues were resolved.
Prevention and Best Practices for Proactive AKS Monitoring
Proactive monitoring and a well-defined strategy are key to preventing issues before they impact your production environment. This section consolidates best practices for AKS monitoring based on lessons learned from thousands of production deployments in 2026.
Establishing a Monitoring Baseline
Define what "normal" looks like for your AKS clusters and applications. This baseline will help you quickly identify deviations and potential problems. Spend 2-4 weeks in a new environment collecting metrics before setting aggressive alert thresholds.
Key baseline metrics to establish:
- Average and P95 CPU/memory usage per node during business hours vs. off-hours
- Typical pod restart rate (even healthy applications occasionally restart)
- Normal API server request rate and latency
- Expected network traffic patterns between services
- Baseline error rates in application logs
Document these baselines and review them quarterly as your application evolves. What's normal in January might be abnormal in June after significant feature releases.
Implementing a Layered Monitoring Approach
Ensure you have visibility across all layers of your AKS deployment:
| Layer | What to Monitor | Tools | Retention |
|---|---|---|---|
| Infrastructure | Azure VM health, disk IOPS, network bandwidth | Azure Monitor platform metrics | 93 days |
| Cluster | Control plane logs, API server latency, etcd health | Container insights, diagnostic settings | 30-90 days |
| Node | CPU, memory, disk usage, kubelet health | Container insights, Managed Prometheus | 30 days |
| Pod/Container | Resource usage, restart counts, scheduling events | Container insights, Managed Prometheus | 30 days |
| Application | Business metrics, error rates, request latency, traces | Application Insights, custom metrics | 90 days |
Don't rely on a single tool or single layer. A complete picture requires correlation across layers.
Regular Review and Tuning of Alerts
Periodically review your alert rules to ensure they are still relevant and effective. Schedule monthly alert review meetings where you:
- Analyze alert frequency and false positive rate
- Review alerts that fired but didn't require action (tune thresholds)
- Identify incidents that occurred without alerts (create new alerts)
- Retire alerts that are no longer relevant
Alert fatigue is real: If your team receives more than 10 alerts per day on average, you have too many alerts or thresholds set too aggressively. In 2026 studies, teams that receive more than 15 alerts per day begin ignoring critical alerts, increasing MTTR by 40%.
Integrating Monitoring with CI/CD Pipelines
Automate the deployment of monitoring configurations and ensure that new deployments are properly monitored. This includes:
Infrastructure as Code for Monitoring:
Store alert rules, dashboard definitions, and log collection configurations in Git alongside your application code. Use tools like Terraform or Bicep to deploy monitoring configuration automatically.
Example Terraform for creating an AKS metric alert:
resource "azurerm_monitor_metric_alert" "high_cpu" {
name = "aks-high-cpu-alert"
resource_group_name = azurerm_resource_group.main.name
scopes = [azurerm_kubernetes_cluster.main.id]
description = "Alert when node CPU exceeds 80%"
severity = 2
frequency = "PT1M"
window_size = "PT5M"
criteria {
metric_namespace = "Microsoft.ContainerService/managedClusters"
metric_name = "node_cpu_usage_percentage"
aggregation = "Average"
operator = "GreaterThan"
threshold = 80
}
action {
action_group_id = azurerm_monitor_action_group.main.id
}
}Automated Dashboard Creation:
When deploying a new microservice, automatically create a Grafana dashboard for it using the Grafana API:
curl -X POST https://myinstance.grafana.azure.com/api/dashboards/db \
-H "Authorization: Bearer ${GRAFANA_API_KEY}" \
-H "Content-Type: application/json" \
-d @dashboard-template.jsonPre-deployment Monitoring Checks:
Add a step to your CI/CD pipeline that verifies monitoring is configured before deploying to production:
- name: Verify monitoring configured
run: |
# Check that Prometheus scrape annotations exist
if ! grep -q "prometheus.io/scrape" k8s/deployment.yaml; then
echo "ERROR: Prometheus scrape annotations missing"
exit 1
fi
# Check that health check endpoints are defined
if ! grep -q "livenessProbe" k8s/deployment.yaml; then
echo "ERROR: Liveness probe missing"
exit 1
fiCost Optimization for AKS Monitoring
Monitoring costs can spiral out of control without careful management. As of 2026, implement these cost optimization strategies:
1. Use Log Analytics workspace commitment tiers: If you ingest more than 100 GB per day across all sources, commitment tiers offer 30-50% savings compared to pay-as-you-go pricing.
2. Implement retention policies based on data value:
- Keep detailed logs for 7-30 days for recent troubleshooting
- Aggregate and downsample older data to 5-minute resolution
- Archive compliance-required logs to Azure Storage ($0.01 per GB) after 90 days
3. Filter logs at the source: Don't send debug-level logs from production applications to Log Analytics. Filter them at the application level or using Container insights configuration.
4. Reduce metric cardinality: High-cardinality labels (user IDs, request IDs, timestamps) in Prometheus metrics exponentially increase storage costs. Keep label cardinality under 100 unique values per metric.
5. Use basic logs for high-volume data: Azure Monitor's basic logs tier (introduced in 2025) costs 60% less than analytics logs but with limited query capabilities—perfect for compliance logs you rarely query.
6. Monitor your monitoring costs: Set up alerts when Log Analytics ingestion exceeds expected thresholds:
Usage
| where TimeGenerated > ago(1d)
| where IsBillable == true
| summarize DataGB = sum(Quantity) / 1000 by bin(TimeGenerated, 1h)
| where DataGB > 5 // Alert if ingesting more than 5 GB per hourWith these strategies, teams typically reduce monitoring costs by 40-60% while maintaining comprehensive observability.
Frequently Asked Questions
How much does AKS monitoring cost in 2026?
AKS monitoring costs vary based on cluster size and data retention policies, but typically range from $150-$800 per month per production cluster. This includes Log Analytics ingestion ($2.76 per GB for the first 100 GB), Managed Prometheus storage ($0.20 per million samples), and Managed Grafana ($8.76 per active user per month). A medium-sized cluster (50 nodes, 500 pods) generating 75 GB of logs monthly would cost approximately $250/month for monitoring alone. Costs can be reduced by 40-60% through commitment tiers, log filtering, and appropriate retention policies.
What's the difference between Container insights and Managed Prometheus?
Container insights is Azure's native monitoring solution that automatically collects metrics, logs, and events from AKS clusters and stores them in Log Analytics workspaces. Managed Prometheus is an Azure-managed implementation of the open-source Prometheus monitoring system that collects metrics in Prometheus format and stores them in Azure Monitor workspaces. Container insights is easier to set up and integrates tightly with Azure Monitor, while Managed Prometheus offers better compatibility with Prometheus-based tools, more flexible querying with PromQL, and is preferred by teams already using Prometheus. Many organizations use both: Container insights for logs and basic metrics, Managed Prometheus for detailed application metrics.
How do I monitor AKS clusters across multiple Azure subscriptions?
For multi-subscription AKS monitoring, create a centralized Log Analytics workspace in a dedicated monitoring subscription, then configure each AKS cluster to send data to this central workspace regardless of which subscription it resides in. This requires granting the AKS cluster's managed identity "Monitoring Metrics Publisher" role on the workspace. Use Azure Lighthouse for cross-tenant monitoring scenarios. In 2026, Azure Monitor workspaces support cross-subscription queries, allowing you to correlate data from clusters in different subscriptions using a single KQL query. For Grafana dashboards, use a single Managed Grafana instance connected to multiple data sources (one per subscription).
What are the most important metrics to alert on for AKS?
The most critical AKS metrics to alert on are: node CPU usage above 80% (indicates need to scale), node memory usage above 85% (risk of pod evictions), pod restart count exceeding 3 in 10 minutes (application instability), API server P95 latency above 1 second (control plane issues), persistent volume usage above 85% (disk space exhaustion), and pod pending state lasting more than 5 minutes (scheduling problems). For applications, monitor HTTP error rates above 5%, request latency P95 exceeding SLA thresholds, and application-specific business metrics like failed payment transactions. Focus on symptoms that impact users rather than low-level technical metrics to reduce alert fatigue.
How do I troubleshoot high memory usage in AKS pods?
To troubleshoot high memory usage in AKS pods, start by identifying which pods are consuming the most memory using kubectl top pods --sort-by=memory -A. Check if pods are approaching their memory limits with kubectl describe pod <pod-name> -n <namespace> and look for OOMKilled events. Examine container logs for OutOfMemoryError messages using kubectl logs <pod-name> --previous. Query historical memory usage in Log Analytics to identify trends and memory leaks. For Java applications, analyze heap dumps; for Node.js, use heap snapshots. Compare memory requests vs. actual usage to right-size resource allocations. Consider implementing horizontal pod autoscaling based on memory metrics if usage varies with load, or increase memory limits if the application legitimately requires more resources.
Conclusion: Elevating Your AKS Observability
Mastering Azure Kubernetes Service monitoring in 2026 is not just about collecting data—it's about gaining actionable insights that drive reliability, performance, and security. By implementing a layered monitoring approach with Azure Monitor, integrating open-source tools like Prometheus and Grafana, and adopting proactive alerting strategies focused on symptoms rather than noise, you can build highly observable AKS environments that catch problems before they impact users.
The key is treating monitoring as a first-class concern from day one, not an afterthought. Instrument your applications, configure comprehensive logging, set meaningful alerts, and regularly review your monitoring strategy as your systems evolve.
If you want to take your AKS operations to the next level and eliminate the tedious manual work of troubleshooting, OpsSqad's AI-powered approach transforms how you interact with your clusters. Instead of memorizing kubectl commands and spending hours diagnosing issues, you can simply describe problems in natural language and let specialized AI agents do the investigation work for you.
Ready to experience the future of AKS monitoring and troubleshooting? Create your free account at app.opssquad.ai and see how OpsSqad can transform your operational efficiency in minutes, not days.