video-iac/k8s/DEBUGGING.md

3.6 KiB

Debugging Guide

This guide provides common commands and procedures for debugging the Kubernetes infrastructure, with a focus on Monitoring, Loki, and Grafana.

Prerequisites

Ensure you have your environment activated:

source ~/bin/activate-stg  # or activate-prod

Monitoring & Logging (Loki, Grafana, Prometheus)

1. Quick Status Check

Check if all pods are running in the relevant namespaces:

kubectl get pods -n monitoring
kubectl get pods -n loki

2. Verifying Loki Log Ingestion

Check if Loki is receiving logs: Paradoxically, "duplicate entry" errors are a good sign that logs are reaching Loki (it just means retries are happening).

kubectl -n loki logs -l app.kubernetes.io/name=loki --tail=50

Check if the Grafana Agent is sending logs: The Agent runs as a DaemonSet. Check the logs of one of the agent pods:

kubectl -n loki logs -l app.kubernetes.io/name=grafana-agent -c grafana-agent --tail=50

Look for errors like 401 Unauthorized or 403 Forbidden.

Inspect Agent Configuration: Verify the Agent is actually configured to scrape what you expect:

kubectl -n loki get secret loki-logs-config -o jsonpath='{.data.agent\.yml}' | base64 -d

3. Debugging Grafana

Check Grafana Logs: Look for datasource provisioning errors or plugin issues:

kubectl -n monitoring logs deployment/monitoring-grafana --tail=100

Verify Datasource Provisioning: Grafana uses a sidecar to watch secrets and provision datasources. Check its logs:

kubectl -n monitoring logs deployment/monitoring-grafana -c grafana-sc-datasources --tail=100

Inspect Provisioned Datasource File: Check the actual file generated inside the Grafana pod to ensure uid, url, etc., are correct:

kubectl -n monitoring exec deployment/monitoring-grafana -c grafana -- cat /etc/grafana/provisioning/datasources/datasource.yaml

Restart Grafana: If you suspect configuration hasn't been picked up:

kubectl -n monitoring rollout restart deployment/monitoring-grafana

4. Connectivity Verification (The "Nuclear" Option)

If the dashboard is empty but you think everything is working, run a query directly from the Grafana pod to Loki. This bypasses the UI and confirms network connectivity and data availability.

Test Connectivity:

kubectl -n monitoring exec deployment/monitoring-grafana -- curl -s "http://loki.loki.svc:3100/loki/api/v1/labels"

Query Actual Logs: This asks Loki for the last 10 log lines for any job. If this returns JSON data, Loki is working perfectly.

kubectl -n monitoring exec deployment/monitoring-grafana -- curl -G -s "http://loki.loki.svc:3100/loki/api/v1/query_range" --data-urlencode 'query={job=~".+"}' --data-urlencode 'limit=10'

Common Issues & Fixes

"Datasource not found" in Dashboard

  • Cause: The dashboard expects a specific Datasource UID (e.g., uid: loki), but Grafana generated a random one.
  • Fix: Ensure values.yaml explicitly sets the UID:
    additionalDataSources:
      - name: Loki
        uid: loki  # <--- Critical
    

Logs not showing in Loki

  • Cause: The PodLogs resource might be missing, or the Agent doesn't have permissions.
  • Check:
    1. Ensure ClusterRole has pods/log permission.
    2. Ensure LogsInstance selector matches your PodLogs definition (or use {} to match all).

ArgoCD Out of Sync

  • Cause: Sometimes ArgoCD doesn't auto-prune resources or fails to update Secrets.
  • Fix: Sync manually with "Prune" enabled, or delete the conflicting resource manually if safe.