Home › 1 Mnemonica › 3 Eks › Troubleshooting Guide

Troubleshooting Guide ¶

This guide covers common issues and their solutions for the Mnemonica Kubernetes deployment.

Table of Contents ¶

Pod Issues
Storage Issues
Networking Issues
Service-Specific Issues
Performance Issues
Debugging Commands

Pod Issues ¶

Pods Not Starting (CrashLoopBackOff) ¶

Symptoms: Pods repeatedly restart and show CrashLoopBackOff status

Diagnosis:

kubectl get pods
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous

Common Causes:

Missing Secrets
- Check if all required secrets exist:
```
kubectl get secrets
```
- Required secrets: uwsgi, centrifugo, fsa, regcred, juicefs-secret, juicefs-rabbitmq
- Solution: Ensure secrets are properly linked from secrets directory
ConfigMap Issues
- Verify ConfigMaps exist:
```
kubectl get configmaps
```
- Ensure uwsgi-env, frontend-env, and other ConfigMaps are applied
Image Pull Errors
- Check if regcred is valid:
```
kubectl get secret regcred -o yaml
```
- Verify image exists: infomne/uwsgi-apps:v2.14.6-1-gab90065
- Check pull status: kubectl describe pod <pod-name> | grep -A 5 Events

Pods Pending ¶

Symptoms: Pods stuck in Pending state

Diagnosis:

kubectl describe pod <pod-name>

Common Causes:

Insufficient Resources
- Check node resources:
```
kubectl top nodes
kubectl describe nodes
```
- Solution: Cluster autoscaler should add nodes, or adjust resource requests
PVC Not Bound
- Check PVC status:
```
kubectl get pvc
```
- Solution: Ensure JuiceFS CSI driver is installed and secrets are configured
Anti-Affinity Rules
- If you have only one node, anti-affinity rules will prevent pod scheduling
- Temporarily remove anti-affinity for dev/testing, or add more nodes

ImagePullBackOff ¶

Symptoms: Cannot pull container images

Solutions:

# Recreate registry credentials
kubectl delete secret regcred
kubectl create secret docker-registry regcred \
  --docker-server=<registry-server> \
  --docker-username=<username> \
  --docker-password=<password> \
  --docker-email=<email>

Storage Issues ¶

PVC Not Binding ¶

Symptoms: PersistentVolumeClaims stuck in Pending state

Diagnosis:

kubectl get pvc
kubectl describe pvc <pvc-name>
kubectl get pv

Common Causes:

JuiceFS CSI Driver Not Installed

helm list -n kube-system | grep juicefs
kubectl get pods -n kube-system | grep juicefs

Solution: Install JuiceFS CSI driver

helm upgrade -i juicefs-csi-driver -n kube-system \
  -f custom-values/juicefs-custom-values-dev.yaml

JuiceFS Secrets Missing
```
kubectl get secret juicefs-secret
kubectl get secret juicefs-rabbitmq
```
- Solution: Create JuiceFS secrets with proper credentials

PV/PVC Label Mismatch

Check PV labels match PVC selector:

kubectl get pv dev-media -o yaml | grep -A 3 labels
kubectl get pvc dev-media -o yaml | grep -A 5 selector

Storage Access Issues ¶

Symptoms: Pods can't read/write to mounted volumes

Diagnosis:

# Exec into pod
kubectl exec -it <pod-name> -- ls -la /var/www/media/

# Check mount status
kubectl exec -it <pod-name> -- mount | grep juicefs

# Check JuiceFS mount pod logs
kubectl logs -n kube-system <juicefs-mount-pod>

Solutions:

Verify JuiceFS credentials are correct
Check S3 bucket permissions (JuiceFS backend)
Verify cache-size mount options
Check node connectivity to JuiceFS backend

Networking Issues ¶

Service Not Accessible ¶

Symptoms: Cannot reach services from within cluster

Diagnosis:

kubectl get svc
kubectl describe svc <service-name>
kubectl get endpoints <service-name>

Solutions:

No Endpoints

Service selector doesn't match pod labels

Verify labels:

kubectl get pods --show-labels
kubectl get svc <service-name> -o yaml | grep selector

Port Mismatch

Verify service port matches container port:

kubectl get svc <service-name> -o yaml
kubectl get pod <pod-name> -o yaml | grep containerPort

External Access Issues ¶

Symptoms: Cannot access application from devel.mnemonica.com

Diagnosis:

# Check NodePort service
kubectl get svc frontend
curl http://<node-ip>:30001

# Check AWS Load Balancer Controller
kubectl get pods -n kube-system | grep aws-load-balancer
kubectl logs -n kube-system <alb-controller-pod>

# Check Target Group Bindings
kubectl get targetgroupbindings
kubectl describe targetgroupbinding <tgb-name>

Solutions:

Verify AWS ALB is created and healthy
Check security groups allow traffic
Verify DNS points to correct load balancer
Check Target Group health checks

DNS Resolution Issues ¶

Symptoms: Services can't resolve DNS names

Diagnosis:

# Test DNS from pod
kubectl exec -it <pod-name> -- nslookup redis-master
kubectl exec -it <pod-name> -- nslookup backend.default.svc.cluster.local

# Check CoreDNS
kubectl get pods -n kube-system | grep coredns
kubectl logs -n kube-system <coredns-pod>

Solutions:

Restart CoreDNS: kubectl rollout restart deployment/coredns -n kube-system
Check service exists: kubectl get svc
Use FQDN: <service>.<namespace>.svc.cluster.local

Service-Specific Issues ¶

Backend Issues ¶

Symptoms: Backend pod crashes or returns errors

Diagnosis:

kubectl logs -f deployment/backend
kubectl exec -it <backend-pod> -- python manage.py check

Common Issues:

Database Connection Failed
- Check PostgreSQL credentials in uwsgi secret
- Verify PGSSLCERT exists at /tmp/postgresql.crt
- Test connection from pod:
```
kubectl exec -it <backend-pod> -- psql <connection-string>
```
Redis Connection Failed
- Check Redis is running: kubectl get pods | grep redis
- Test connection:
```
kubectl exec -it <backend-pod> -- redis-cli -h redis-master ping
```
RabbitMQ Connection Failed
- Check RabbitMQ: kubectl get pods | grep rabbitmq
- Verify credentials in CELERY_BROKER_URL (uwsgi-env ConfigMap)
- Test connection:
```
kubectl exec -it <backend-pod> -- curl http://guest:guest@rabbitmq:15672/api/whoami
```

Celery Issues ¶

Symptoms: Tasks not processing, workers offline

Diagnosis:

kubectl logs -f deployment/celery
kubectl logs -f deployment/celery-encoding

# Check Flower monitoring
kubectl port-forward deployment/flower 5555:5555
# Open http://localhost:5555

Common Issues:

Workers Not Connecting to Broker
- Verify CELERY_BROKER_URL in uwsgi-env ConfigMap
- Check RabbitMQ logs: kubectl logs <rabbitmq-pod>
Tasks Failing
- Check worker logs for exceptions
- Verify shared storage is accessible
- Check encoding parameters in uwsgi-env ConfigMap

Encoding Pool Issues ¶

Symptoms: Video encoding not working

Diagnosis:

kubectl logs -f deployment/encoding-pool-master
kubectl get jobs | grep encoding-worker
kubectl logs job/<encoding-worker-job>

Common Issues:

Workers Not Starting
- Check encoding-pool-master has kubectl access:
```
kubectl exec -it <master-pod> -- kubectl get pods
```
- Verify epmaster service account: kubectl get sa epmaster
- Check role binding: kubectl get rolebinding | grep epmaster
Encoding Failures
- Check GPU availability (if using h264_nvenc)
- Verify ffmpeg is installed: kubectl exec -it <worker-pod> -- ffmpeg -version
- Check encoding parameters in uwsgi-env ConfigMap
- Review worker job logs

TUSD Upload Issues ¶

Symptoms: File uploads failing or hanging

Diagnosis:

# Check TUSD pod
kubectl logs -f <tusd-pod>

# Check tus-hook-listener
kubectl logs -f deployment/tus-hook-listener

# Check lock files
kubectl exec -it <tusd-pod> -- find /var/www/media/mnemonica/storage/tus -name "*.lock"

Solutions:

Stale Lock Files

TUSD init container should remove them, but you can manually clean:

kubectl exec -it <tusd-pod> -- find /var/www/media/mnemonica/storage/tus -name "*.lock" -delete

Storage Full
- Check JuiceFS backend (S3) storage
- Review upload directory size
Hook Listener Not Responding
- Verify tus-hook-listener service: kubectl get svc tus-hook-listener
- Check pod is running: kubectl get pods -l app=tus-hook-listener

Centrifugo WebSocket Issues ¶

Symptoms: Real-time features not working

Diagnosis:

kubectl logs -f <centrifugo-pod>

# Check service
kubectl get svc centrifugo

# Test connection from backend
kubectl exec -it <backend-pod> -- curl http://centrifugo:9000/api

Solutions:

Token Mismatch
- Verify centrifugo secret matches backend configuration
- Check tokenHmacSecretKey and apiKey are set correctly
Connection Refused
- Ensure Centrifugo pod is running
- Verify service endpoints: kubectl get endpoints centrifugo

Redis Issues ¶

Symptoms: Cache not working, sessions lost

Diagnosis:

kubectl logs -f <redis-pod>
kubectl exec -it <redis-pod> -- redis-cli ping
kubectl exec -it <redis-pod> -- redis-cli INFO

Solutions:

Redis Out of Memory
- Check memory: kubectl exec -it <redis-pod> -- redis-cli INFO memory
- Increase resource limits or add eviction policy
Connection Refused
- Verify redis-master service: kubectl get svc redis-master
- Check MNEMONICA_CACHE_URL in uwsgi-env ConfigMap

Performance Issues ¶

Slow Response Times ¶

Diagnosis:

# Check resource usage
kubectl top pods
kubectl top nodes

# Check pod resource limits
kubectl describe pod <pod-name> | grep -A 5 Limits

# Check for throttling
kubectl describe pod <pod-name> | grep -i throttling

Solutions:

CPU/Memory Limits Too Low
- Uncomment and adjust resource requests/limits in deployment files
- Example:
```
resources:
  requests:
    cpu: 100m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 2Gi
```
Storage I/O Issues
- Increase JuiceFS cache-size mount option
- Check S3 backend performance
- Monitor JuiceFS mount pod metrics

Too Few Replicas

Scale critical services:

kubectl scale deployment backend --replicas=2
kubectl scale deployment celery --replicas=3

High Memory Usage ¶

Diagnosis:

kubectl top pods
kubectl exec -it <pod-name> -- free -h

Solutions:

Enable memory limits
Check for memory leaks in application code
Restart pods periodically if needed
Increase node instance size

Debugging Commands ¶

Get Resource Status ¶

# All resources overview
kubectl get all

# Specific resource types
kubectl get pods
kubectl get svc
kubectl get deployments
kubectl get pvc
kubectl get configmaps
kubectl get secrets
kubectl get jobs

# Across all namespaces
kubectl get pods -A
helm list -A

Describe Resources ¶

kubectl describe pod <pod-name>
kubectl describe deployment <deployment-name>
kubectl describe pvc <pvc-name>
kubectl describe node <node-name>

Logs ¶

# Current logs
kubectl logs <pod-name>
kubectl logs -f <pod-name>  # Follow
kubectl logs <pod-name> -c <container-name>  # Multi-container pod

# Previous crashed pod logs
kubectl logs <pod-name> --previous

# Deployment logs (any pod)
kubectl logs -f deployment/<deployment-name>

# Last N lines
kubectl logs <pod-name> --tail=100

# Logs with timestamps
kubectl logs <pod-name> --timestamps

Execute Commands in Pods ¶

# Interactive shell
kubectl exec -it <pod-name> -- /bin/bash
kubectl exec -it <pod-name> -- /bin/sh  # If bash not available

# Single command
kubectl exec <pod-name> -- ls -la /var/www/media
kubectl exec <pod-name> -- env
kubectl exec <pod-name> -- ps aux

# Python management commands (backend)
kubectl exec -it <backend-pod> -- python manage.py shell
kubectl exec -it <backend-pod> -- python manage.py dbshell
kubectl exec -it <backend-pod> -- python manage.py check

Port Forwarding ¶

# Forward local port to pod
kubectl port-forward pod/<pod-name> 8080:80
kubectl port-forward deployment/<deployment-name> 8080:80
kubectl port-forward svc/<service-name> 8080:80

# Access at http://localhost:8080

Copy Files ¶

# From pod to local
kubectl cp <pod-name>:/path/to/file ./local-file

# From local to pod
kubectl cp ./local-file <pod-name>:/path/to/file

Resource Usage ¶

# Real-time resource monitoring
kubectl top pods
kubectl top nodes

# Specific pod
kubectl top pod <pod-name>

# Sort by CPU/Memory
kubectl top pods --sort-by=cpu
kubectl top pods --sort-by=memory

Events ¶

# Cluster events
kubectl get events --sort-by='.lastTimestamp'

# Specific resource events
kubectl describe pod <pod-name> | grep -A 20 Events

Restart Deployments ¶

# Graceful restart
kubectl rollout restart deployment/<deployment-name>

# Force delete and recreate pod
kubectl delete pod <pod-name>

# Restart all pods with specific label
kubectl delete pods -l app=backend

Config and Secrets ¶

# View ConfigMap
kubectl get configmap uwsgi-env -o yaml

# Edit ConfigMap (will auto-reload with Reloader)
kubectl edit configmap uwsgi-env

# Decode secret (base64)
kubectl get secret uwsgi -o jsonpath='{.data.DATABASE_URL}' | base64 -d

Network Testing ¶

# Test service connectivity from pod
kubectl exec -it <pod-name> -- curl http://<service-name>:<port>
kubectl exec -it <pod-name> -- wget -O- http://<service-name>:<port>
kubectl exec -it <pod-name> -- nc -zv <service-name> <port>

# DNS testing
kubectl exec -it <pod-name> -- nslookup <service-name>
kubectl exec -it <pod-name> -- dig <service-name>.default.svc.cluster.local

Helm Debugging ¶

# List releases
helm list -A

# Check release status
helm status <release-name>

# View release values
helm get values <release-name>

# View all release details
helm get all <release-name>

# Rollback
helm rollback <release-name> <revision>

Node Debugging ¶

# Node details
kubectl describe node <node-name>

# Drain node (evict pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Uncordon node (allow scheduling)
kubectl uncordon <node-name>

# Cordon node (prevent new pods)
kubectl cordon <node-name>

Emergency Procedures ¶

Complete Application Restart ¶

# Restart all application services
kubectl rollout restart deployment/backend
kubectl rollout restart deployment/frontend
kubectl rollout restart deployment/celery
kubectl rollout restart deployment/celerybeat
kubectl rollout restart deployment/celery-encoding
kubectl rollout restart deployment/encoding-pool-master
kubectl rollout restart deployment/tus-hook-listener
kubectl rollout restart deployment/traffic-log-listener
kubectl rollout restart deployment/flower

Clear All Pods ¶

# Delete all pods (will be recreated by deployments)
kubectl delete pods -l pdb=minAvail1

Reset Redis ¶

kubectl delete pod -l app=redis-master
# Or flush all data
kubectl exec -it <redis-pod> -- redis-cli FLUSHALL

Reset RabbitMQ ¶

# Restart RabbitMQ (tasks in queue will be lost)
kubectl delete pod -l app.kubernetes.io/name=rabbitmq

Getting Help ¶

If issues persist:

Collect diagnostics:

kubectl get all -o wide > cluster-state.txt
kubectl describe pods >> cluster-state.txt
kubectl get events --sort-by='.lastTimestamp' >> cluster-state.txt

Login

Create New Document

Move/Rename Document

Delete Document

Message

Confirm Action

Attachments

Document Files

Document History

Previous Versions

Preview

Wiki Settings

User Management

Add New User

Active Rules

Import Results

Available Backups

Add/Edit Access Rule

Add Column

Add Link

Troubleshooting Guide ¶

Table of Contents ¶

Pod Issues ¶

Pods Not Starting (CrashLoopBackOff) ¶

Pods Pending ¶

ImagePullBackOff ¶

Storage Issues ¶

PVC Not Binding ¶

Storage Access Issues ¶

Networking Issues ¶

Service Not Accessible ¶

External Access Issues ¶

DNS Resolution Issues ¶

Service-Specific Issues ¶

Backend Issues ¶

Celery Issues ¶

Encoding Pool Issues ¶

TUSD Upload Issues ¶

Centrifugo WebSocket Issues ¶

Redis Issues ¶

Performance Issues ¶

Slow Response Times ¶

High Memory Usage ¶

Debugging Commands ¶

Get Resource Status ¶

Describe Resources ¶

Logs ¶

Execute Commands in Pods ¶

Port Forwarding ¶

Copy Files ¶

Resource Usage ¶

Events ¶

Restart Deployments ¶

Config and Secrets ¶

Network Testing ¶

Helm Debugging ¶

Node Debugging ¶

Emergency Procedures ¶

Complete Application Restart ¶

Clear All Pods ¶

Reset Redis ¶

Reset RabbitMQ ¶

Getting Help ¶

Attached Files

Comments