# Debugging Guide This guide provides common commands and procedures for debugging the Kubernetes infrastructure, with a focus on Monitoring, Loki, and Grafana. ## Prerequisites Ensure you have your environment activated: ```bash source ~/bin/activate-stg # or activate-prod ``` ## Monitoring & Logging (Loki, Grafana, Prometheus) ### 1. Quick Status Check Check if all pods are running in the relevant namespaces: ```bash kubectl get pods -n monitoring kubectl get pods -n loki ``` ### 2. Verifying Loki Log Ingestion **Check if Loki is receiving logs:** Paradoxically, "duplicate entry" errors are a *good* sign that logs are reaching Loki (it just means retries are happening). ```bash kubectl -n loki logs -l app.kubernetes.io/name=loki --tail=50 ``` **Check if the Grafana Agent is sending logs:** The Agent runs as a DaemonSet. Check the logs of one of the agent pods: ```bash kubectl -n loki logs -l app.kubernetes.io/name=grafana-agent -c grafana-agent --tail=50 ``` Look for errors like `401 Unauthorized` or `403 Forbidden`. **Inspect Agent Configuration:** Verify the Agent is actually configured to scrape what you expect: ```bash kubectl -n loki get secret loki-logs-config -o jsonpath='{.data.agent\.yml}' | base64 -d ``` ### 3. Debugging Grafana **Check Grafana Logs:** Look for datasource provisioning errors or plugin issues: ```bash kubectl -n monitoring logs deployment/monitoring-grafana --tail=100 ``` **Verify Datasource Provisioning:** Grafana uses a sidecar to watch secrets and provision datasources. Check its logs: ```bash kubectl -n monitoring logs deployment/monitoring-grafana -c grafana-sc-datasources --tail=100 ``` **Inspect Provisioned Datasource File:** Check the actual file generated inside the Grafana pod to ensure `uid`, `url`, etc., are correct: ```bash kubectl -n monitoring exec deployment/monitoring-grafana -c grafana -- cat /etc/grafana/provisioning/datasources/datasource.yaml ``` **Restart Grafana:** If you suspect configuration hasn't been picked up: ```bash kubectl -n monitoring rollout restart deployment/monitoring-grafana ``` ### 4. Connectivity Verification (The "Nuclear" Option) If the dashboard is empty but you think everything is working, run a query **directly from the Grafana pod** to Loki. This bypasses the UI and confirms network connectivity and data availability. **Test Connectivity:** ```bash kubectl -n monitoring exec deployment/monitoring-grafana -- curl -s "http://loki.loki.svc:3100/loki/api/v1/labels" ``` **Query Actual Logs:** This asks Loki for the last 10 log lines for *any* job. If this returns JSON data, Loki is working perfectly. ```bash kubectl -n monitoring exec deployment/monitoring-grafana -- curl -G -s "http://loki.loki.svc:3100/loki/api/v1/query_range" --data-urlencode 'query={job=~".+"}' --data-urlencode 'limit=10' ``` ## Common Issues & Fixes ### "Datasource not found" in Dashboard * **Cause**: The dashboard expects a specific Datasource UID (e.g., `uid: loki`), but Grafana generated a random one. * **Fix**: Ensure `values.yaml` explicitly sets the UID: ```yaml additionalDataSources: - name: Loki uid: loki # <--- Critical ``` ### Logs not showing in Loki * **Cause**: The `PodLogs` resource might be missing, or the Agent doesn't have permissions. * **Check**: 1. Ensure `ClusterRole` has `pods/log` permission. 2. Ensure `LogsInstance` selector matches your `PodLogs` definition (or use `{}` to match all). ### ArgoCD Out of Sync * **Cause**: Sometimes ArgoCD doesn't auto-prune resources or fails to update Secrets. * **Fix**: Sync manually with "Prune" enabled, or delete the conflicting resource manually if safe.