video-iac/k8s/DEBUGGING.md

# Debugging Guide

This guide provides common commands and procedures for debugging the Kubernetes infrastructure, with a focus on Monitoring, Loki, and Grafana.

## Prerequisites

Ensure you have your environment activated:
```bash
source ~/bin/activate-stg  # or activate-prod
```

## Monitoring & Logging (Loki, Grafana, Prometheus)

### 1. Quick Status Check
Check if all pods are running in the relevant namespaces:
```bash
kubectl get pods -n monitoring
kubectl get pods -n loki
```

### 2. Verifying Loki Log Ingestion

**Check if Loki is receiving logs:**
Paradoxically, "duplicate entry" errors are a *good* sign that logs are reaching Loki (it just means retries are happening).
```bash
kubectl -n loki logs -l app.kubernetes.io/name=loki --tail=50
```

**Check if the Grafana Agent is sending logs:**
The Agent runs as a DaemonSet. Check the logs of one of the agent pods:
```bash
kubectl -n loki logs -l app.kubernetes.io/name=grafana-agent -c grafana-agent --tail=50
```
Look for errors like `401 Unauthorized` or `403 Forbidden`.

**Inspect Agent Configuration:**
Verify the Agent is actually configured to scrape what you expect:
```bash
kubectl -n loki get secret loki-logs-config -o jsonpath='{.data.agent\.yml}' | base64 -d
```

### 3. Debugging Grafana

**Check Grafana Logs:**
Look for datasource provisioning errors or plugin issues:
```bash
kubectl -n monitoring logs deployment/monitoring-grafana --tail=100
```

**Verify Datasource Provisioning:**
Grafana uses a sidecar to watch secrets and provision datasources. Check its logs:
```bash
kubectl -n monitoring logs deployment/monitoring-grafana -c grafana-sc-datasources --tail=100
```

**Inspect Provisioned Datasource File:**
Check the actual file generated inside the Grafana pod to ensure `uid`, `url`, etc., are correct:
```bash
kubectl -n monitoring exec deployment/monitoring-grafana -c grafana -- cat /etc/grafana/provisioning/datasources/datasource.yaml
```

**Restart Grafana:**
If you suspect configuration hasn't been picked up:
```bash
kubectl -n monitoring rollout restart deployment/monitoring-grafana
```

### 4. Connectivity Verification (The "Nuclear" Option)

If the dashboard is empty but you think everything is working, run a query **directly from the Grafana pod** to Loki. This bypasses the UI and confirms network connectivity and data availability.

**Test Connectivity:**
```bash
kubectl -n monitoring exec deployment/monitoring-grafana -- curl -s "http://loki.loki.svc:3100/loki/api/v1/labels"
```

**Query Actual Logs:**
This asks Loki for the last 10 log lines for *any* job. If this returns JSON data, Loki is working perfectly.
```bash
kubectl -n monitoring exec deployment/monitoring-grafana -- curl -G -s "http://loki.loki.svc:3100/loki/api/v1/query_range" --data-urlencode 'query={job=~".+"}' --data-urlencode 'limit=10'
```

## Common Issues & Fixes

### "Datasource not found" in Dashboard
*   **Cause**: The dashboard expects a specific Datasource UID (e.g., `uid: loki`), but Grafana generated a random one.
*   **Fix**: Ensure `values.yaml` explicitly sets the UID:
    ```yaml
    additionalDataSources:
      - name: Loki
        uid: loki  # <--- Critical
    ```

### Logs not showing in Loki
*   **Cause**: The `PodLogs` resource might be missing, or the Agent doesn't have permissions.
*   **Check**:
    1.  Ensure `ClusterRole` has `pods/log` permission.
    2.  Ensure `LogsInstance` selector matches your `PodLogs` definition (or use `{}` to match all).

### ArgoCD Out of Sync
*   **Cause**: Sometimes ArgoCD doesn't auto-prune resources or fails to update Secrets.
*   **Fix**: Sync manually with "Prune" enabled, or delete the conflicting resource manually if safe.
Adding loki support 2025-12-12 04:57:03 +00:00			`# Debugging Guide`

			`This guide provides common commands and procedures for debugging the Kubernetes infrastructure, with a focus on Monitoring, Loki, and Grafana.`

			`## Prerequisites`

			`Ensure you have your environment activated:`
			```bash
			`source ~/bin/activate-stg # or activate-prod`
			```

			`## Monitoring & Logging (Loki, Grafana, Prometheus)`

			`### 1. Quick Status Check`
			`Check if all pods are running in the relevant namespaces:`
			```bash
			`kubectl get pods -n monitoring`
			`kubectl get pods -n loki`
			```

			`### 2. Verifying Loki Log Ingestion`

			`Check if Loki is receiving logs:`
			`Paradoxically, "duplicate entry" errors are a good sign that logs are reaching Loki (it just means retries are happening).`
			```bash
			`kubectl -n loki logs -l app.kubernetes.io/name=loki --tail=50`
			```

			`Check if the Grafana Agent is sending logs:`
			`The Agent runs as a DaemonSet. Check the logs of one of the agent pods:`
			```bash
			`kubectl -n loki logs -l app.kubernetes.io/name=grafana-agent -c grafana-agent --tail=50`
			```
			Look for errors like `401 Unauthorized` or `403 Forbidden`.

			`Inspect Agent Configuration:`
			`Verify the Agent is actually configured to scrape what you expect:`
			```bash
			`kubectl -n loki get secret loki-logs-config -o jsonpath='{.data.agent\.yml}' \| base64 -d`
			```

			`### 3. Debugging Grafana`

			`Check Grafana Logs:`
			`Look for datasource provisioning errors or plugin issues:`
			```bash
			`kubectl -n monitoring logs deployment/monitoring-grafana --tail=100`
			```

			`Verify Datasource Provisioning:`
			`Grafana uses a sidecar to watch secrets and provision datasources. Check its logs:`
			```bash
			`kubectl -n monitoring logs deployment/monitoring-grafana -c grafana-sc-datasources --tail=100`
			```

			`Inspect Provisioned Datasource File:`
			Check the actual file generated inside the Grafana pod to ensure `uid`, `url`, etc., are correct:
			```bash
			`kubectl -n monitoring exec deployment/monitoring-grafana -c grafana -- cat /etc/grafana/provisioning/datasources/datasource.yaml`
			```

			`Restart Grafana:`
			`If you suspect configuration hasn't been picked up:`
			```bash
			`kubectl -n monitoring rollout restart deployment/monitoring-grafana`
			```

			`### 4. Connectivity Verification (The "Nuclear" Option)`

			`If the dashboard is empty but you think everything is working, run a query directly from the Grafana pod to Loki. This bypasses the UI and confirms network connectivity and data availability.`

			`Test Connectivity:`
			```bash
			`kubectl -n monitoring exec deployment/monitoring-grafana -- curl -s "http://loki.loki.svc:3100/loki/api/v1/labels"`
			```

			`Query Actual Logs:`
			`This asks Loki for the last 10 log lines for any job. If this returns JSON data, Loki is working perfectly.`
			```bash
			`kubectl -n monitoring exec deployment/monitoring-grafana -- curl -G -s "http://loki.loki.svc:3100/loki/api/v1/query_range" --data-urlencode 'query={job=~".+"}' --data-urlencode 'limit=10'`
			```

			`## Common Issues & Fixes`

			`### "Datasource not found" in Dashboard`
			* Cause: The dashboard expects a specific Datasource UID (e.g., `uid: loki`), but Grafana generated a random one.
			* Fix: Ensure `values.yaml` explicitly sets the UID:
			```yaml
			`additionalDataSources:`
			`- name: Loki`
			`uid: loki # <--- Critical`
			```

			`### Logs not showing in Loki`
			* Cause: The `PodLogs` resource might be missing, or the Agent doesn't have permissions.
			`* Check:`
			1. Ensure `ClusterRole` has `pods/log` permission.
			2. Ensure `LogsInstance` selector matches your `PodLogs` definition (or use `{}` to match all).

			`### ArgoCD Out of Sync`
			`* Cause: Sometimes ArgoCD doesn't auto-prune resources or fails to update Secrets.`
			`* Fix: Sync manually with "Prune" enabled, or delete the conflicting resource manually if safe.`