Troubleshooting Guide ¶
This guide covers common issues and their solutions for the Mnemonica Kubernetes deployment.
Table of Contents ¶
- Pod Issues
- Storage Issues
- Networking Issues
- Service-Specific Issues
- Performance Issues
- Debugging Commands
Pod Issues ¶
Pods Not Starting (CrashLoopBackOff) ¶
Symptoms: Pods repeatedly restart and show CrashLoopBackOff status
Diagnosis:
kubectl get pods
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
Common Causes:
-
Missing Secrets
- Check if all required secrets exist:
kubectl get secrets - Required secrets: uwsgi, centrifugo, fsa, regcred, juicefs-secret, juicefs-rabbitmq
- Solution: Ensure secrets are properly linked from secrets directory
- Check if all required secrets exist:
-
ConfigMap Issues
- Verify ConfigMaps exist:
kubectl get configmaps - Ensure uwsgi-env, frontend-env, and other ConfigMaps are applied
- Verify ConfigMaps exist:
-
Image Pull Errors
- Check if regcred is valid:
kubectl get secret regcred -o yaml - Verify image exists:
infomne/uwsgi-apps:v2.14.6-1-gab90065 - Check pull status:
kubectl describe pod <pod-name> | grep -A 5 Events
- Check if regcred is valid:
Pods Pending ¶
Symptoms: Pods stuck in Pending state
Diagnosis:
kubectl describe pod <pod-name>
Common Causes:
-
Insufficient Resources
- Check node resources:
kubectl top nodes kubectl describe nodes - Solution: Cluster autoscaler should add nodes, or adjust resource requests
- Check node resources:
-
PVC Not Bound
- Check PVC status:
kubectl get pvc - Solution: Ensure JuiceFS CSI driver is installed and secrets are configured
- Check PVC status:
-
Anti-Affinity Rules
- If you have only one node, anti-affinity rules will prevent pod scheduling
- Temporarily remove anti-affinity for dev/testing, or add more nodes
ImagePullBackOff ¶
Symptoms: Cannot pull container images
Solutions:
# Recreate registry credentials
kubectl delete secret regcred
kubectl create secret docker-registry regcred \
--docker-server=<registry-server> \
--docker-username=<username> \
--docker-password=<password> \
--docker-email=<email>
Storage Issues ¶
PVC Not Binding ¶
Symptoms: PersistentVolumeClaims stuck in Pending state
Diagnosis:
kubectl get pvc
kubectl describe pvc <pvc-name>
kubectl get pv
Common Causes:
-
JuiceFS CSI Driver Not Installed
helm list -n kube-system | grep juicefs kubectl get pods -n kube-system | grep juicefs- Solution: Install JuiceFS CSI driver
helm upgrade -i juicefs-csi-driver -n kube-system \ -f custom-values/juicefs-custom-values-dev.yaml
- Solution: Install JuiceFS CSI driver
-
JuiceFS Secrets Missing
kubectl get secret juicefs-secret kubectl get secret juicefs-rabbitmq- Solution: Create JuiceFS secrets with proper credentials
-
PV/PVC Label Mismatch
- Check PV labels match PVC selector:
kubectl get pv dev-media -o yaml | grep -A 3 labels kubectl get pvc dev-media -o yaml | grep -A 5 selector
- Check PV labels match PVC selector:
Storage Access Issues ¶
Symptoms: Pods can't read/write to mounted volumes
Diagnosis:
# Exec into pod
kubectl exec -it <pod-name> -- ls -la /var/www/media/
# Check mount status
kubectl exec -it <pod-name> -- mount | grep juicefs
# Check JuiceFS mount pod logs
kubectl logs -n kube-system <juicefs-mount-pod>
Solutions:
- Verify JuiceFS credentials are correct
- Check S3 bucket permissions (JuiceFS backend)
- Verify cache-size mount options
- Check node connectivity to JuiceFS backend
Networking Issues ¶
Service Not Accessible ¶
Symptoms: Cannot reach services from within cluster
Diagnosis:
kubectl get svc
kubectl describe svc <service-name>
kubectl get endpoints <service-name>
Solutions:
-
No Endpoints
- Service selector doesn't match pod labels
- Verify labels:
kubectl get pods --show-labels kubectl get svc <service-name> -o yaml | grep selector
-
Port Mismatch
- Verify service port matches container port:
kubectl get svc <service-name> -o yaml kubectl get pod <pod-name> -o yaml | grep containerPort
- Verify service port matches container port:
External Access Issues ¶
Symptoms: Cannot access application from devel.mnemonica.com
Diagnosis:
# Check NodePort service
kubectl get svc frontend
curl http://<node-ip>:30001
# Check AWS Load Balancer Controller
kubectl get pods -n kube-system | grep aws-load-balancer
kubectl logs -n kube-system <alb-controller-pod>
# Check Target Group Bindings
kubectl get targetgroupbindings
kubectl describe targetgroupbinding <tgb-name>
Solutions:
- Verify AWS ALB is created and healthy
- Check security groups allow traffic
- Verify DNS points to correct load balancer
- Check Target Group health checks
DNS Resolution Issues ¶
Symptoms: Services can't resolve DNS names
Diagnosis:
# Test DNS from pod
kubectl exec -it <pod-name> -- nslookup redis-master
kubectl exec -it <pod-name> -- nslookup backend.default.svc.cluster.local
# Check CoreDNS
kubectl get pods -n kube-system | grep coredns
kubectl logs -n kube-system <coredns-pod>
Solutions:
- Restart CoreDNS:
kubectl rollout restart deployment/coredns -n kube-system - Check service exists:
kubectl get svc - Use FQDN:
<service>.<namespace>.svc.cluster.local
Service-Specific Issues ¶
Backend Issues ¶
Symptoms: Backend pod crashes or returns errors
Diagnosis:
kubectl logs -f deployment/backend
kubectl exec -it <backend-pod> -- python manage.py check
Common Issues:
-
Database Connection Failed
- Check PostgreSQL credentials in uwsgi secret
- Verify PGSSLCERT exists at /tmp/postgresql.crt
- Test connection from pod:
kubectl exec -it <backend-pod> -- psql <connection-string>
-
Redis Connection Failed
- Check Redis is running:
kubectl get pods | grep redis - Test connection:
kubectl exec -it <backend-pod> -- redis-cli -h redis-master ping
- Check Redis is running:
-
RabbitMQ Connection Failed
- Check RabbitMQ:
kubectl get pods | grep rabbitmq - Verify credentials in CELERY_BROKER_URL (uwsgi-env ConfigMap)
- Test connection:
kubectl exec -it <backend-pod> -- curl http://guest:guest@rabbitmq:15672/api/whoami
- Check RabbitMQ:
Celery Issues ¶
Symptoms: Tasks not processing, workers offline
Diagnosis:
kubectl logs -f deployment/celery
kubectl logs -f deployment/celery-encoding
# Check Flower monitoring
kubectl port-forward deployment/flower 5555:5555
# Open http://localhost:5555
Common Issues:
-
Workers Not Connecting to Broker
- Verify CELERY_BROKER_URL in uwsgi-env ConfigMap
- Check RabbitMQ logs:
kubectl logs <rabbitmq-pod>
-
Tasks Failing
- Check worker logs for exceptions
- Verify shared storage is accessible
- Check encoding parameters in uwsgi-env ConfigMap
Encoding Pool Issues ¶
Symptoms: Video encoding not working
Diagnosis:
kubectl logs -f deployment/encoding-pool-master
kubectl get jobs | grep encoding-worker
kubectl logs job/<encoding-worker-job>
Common Issues:
-
Workers Not Starting
- Check encoding-pool-master has kubectl access:
kubectl exec -it <master-pod> -- kubectl get pods - Verify epmaster service account:
kubectl get sa epmaster - Check role binding:
kubectl get rolebinding | grep epmaster
- Check encoding-pool-master has kubectl access:
-
Encoding Failures
- Check GPU availability (if using h264_nvenc)
- Verify ffmpeg is installed:
kubectl exec -it <worker-pod> -- ffmpeg -version - Check encoding parameters in uwsgi-env ConfigMap
- Review worker job logs
TUSD Upload Issues ¶
Symptoms: File uploads failing or hanging
Diagnosis:
# Check TUSD pod
kubectl logs -f <tusd-pod>
# Check tus-hook-listener
kubectl logs -f deployment/tus-hook-listener
# Check lock files
kubectl exec -it <tusd-pod> -- find /var/www/media/mnemonica/storage/tus -name "*.lock"
Solutions:
-
Stale Lock Files
- TUSD init container should remove them, but you can manually clean:
kubectl exec -it <tusd-pod> -- find /var/www/media/mnemonica/storage/tus -name "*.lock" -delete
- TUSD init container should remove them, but you can manually clean:
-
Storage Full
- Check JuiceFS backend (S3) storage
- Review upload directory size
-
Hook Listener Not Responding
- Verify tus-hook-listener service:
kubectl get svc tus-hook-listener - Check pod is running:
kubectl get pods -l app=tus-hook-listener
- Verify tus-hook-listener service:
Centrifugo WebSocket Issues ¶
Symptoms: Real-time features not working
Diagnosis:
kubectl logs -f <centrifugo-pod>
# Check service
kubectl get svc centrifugo
# Test connection from backend
kubectl exec -it <backend-pod> -- curl http://centrifugo:9000/api
Solutions:
-
Token Mismatch
- Verify centrifugo secret matches backend configuration
- Check tokenHmacSecretKey and apiKey are set correctly
-
Connection Refused
- Ensure Centrifugo pod is running
- Verify service endpoints:
kubectl get endpoints centrifugo
Redis Issues ¶
Symptoms: Cache not working, sessions lost
Diagnosis:
kubectl logs -f <redis-pod>
kubectl exec -it <redis-pod> -- redis-cli ping
kubectl exec -it <redis-pod> -- redis-cli INFO
Solutions:
-
Redis Out of Memory
- Check memory:
kubectl exec -it <redis-pod> -- redis-cli INFO memory - Increase resource limits or add eviction policy
- Check memory:
-
Connection Refused
- Verify redis-master service:
kubectl get svc redis-master - Check MNEMONICA_CACHE_URL in uwsgi-env ConfigMap
- Verify redis-master service:
Performance Issues ¶
Slow Response Times ¶
Diagnosis:
# Check resource usage
kubectl top pods
kubectl top nodes
# Check pod resource limits
kubectl describe pod <pod-name> | grep -A 5 Limits
# Check for throttling
kubectl describe pod <pod-name> | grep -i throttling
Solutions:
-
CPU/Memory Limits Too Low
- Uncomment and adjust resource requests/limits in deployment files
- Example:
resources: requests: cpu: 100m memory: 512Mi limits: cpu: 1000m memory: 2Gi
-
Storage I/O Issues
- Increase JuiceFS cache-size mount option
- Check S3 backend performance
- Monitor JuiceFS mount pod metrics
-
Too Few Replicas
- Scale critical services:
kubectl scale deployment backend --replicas=2 kubectl scale deployment celery --replicas=3
- Scale critical services:
High Memory Usage ¶
Diagnosis:
kubectl top pods
kubectl exec -it <pod-name> -- free -h
Solutions:
- Enable memory limits
- Check for memory leaks in application code
- Restart pods periodically if needed
- Increase node instance size
Debugging Commands ¶
Get Resource Status ¶
# All resources overview
kubectl get all
# Specific resource types
kubectl get pods
kubectl get svc
kubectl get deployments
kubectl get pvc
kubectl get configmaps
kubectl get secrets
kubectl get jobs
# Across all namespaces
kubectl get pods -A
helm list -A
Describe Resources ¶
kubectl describe pod <pod-name>
kubectl describe deployment <deployment-name>
kubectl describe pvc <pvc-name>
kubectl describe node <node-name>
Logs ¶
# Current logs
kubectl logs <pod-name>
kubectl logs -f <pod-name> # Follow
kubectl logs <pod-name> -c <container-name> # Multi-container pod
# Previous crashed pod logs
kubectl logs <pod-name> --previous
# Deployment logs (any pod)
kubectl logs -f deployment/<deployment-name>
# Last N lines
kubectl logs <pod-name> --tail=100
# Logs with timestamps
kubectl logs <pod-name> --timestamps
Execute Commands in Pods ¶
# Interactive shell
kubectl exec -it <pod-name> -- /bin/bash
kubectl exec -it <pod-name> -- /bin/sh # If bash not available
# Single command
kubectl exec <pod-name> -- ls -la /var/www/media
kubectl exec <pod-name> -- env
kubectl exec <pod-name> -- ps aux
# Python management commands (backend)
kubectl exec -it <backend-pod> -- python manage.py shell
kubectl exec -it <backend-pod> -- python manage.py dbshell
kubectl exec -it <backend-pod> -- python manage.py check
Port Forwarding ¶
# Forward local port to pod
kubectl port-forward pod/<pod-name> 8080:80
kubectl port-forward deployment/<deployment-name> 8080:80
kubectl port-forward svc/<service-name> 8080:80
# Access at http://localhost:8080
Copy Files ¶
# From pod to local
kubectl cp <pod-name>:/path/to/file ./local-file
# From local to pod
kubectl cp ./local-file <pod-name>:/path/to/file
Resource Usage ¶
# Real-time resource monitoring
kubectl top pods
kubectl top nodes
# Specific pod
kubectl top pod <pod-name>
# Sort by CPU/Memory
kubectl top pods --sort-by=cpu
kubectl top pods --sort-by=memory
Events ¶
# Cluster events
kubectl get events --sort-by='.lastTimestamp'
# Specific resource events
kubectl describe pod <pod-name> | grep -A 20 Events
Restart Deployments ¶
# Graceful restart
kubectl rollout restart deployment/<deployment-name>
# Force delete and recreate pod
kubectl delete pod <pod-name>
# Restart all pods with specific label
kubectl delete pods -l app=backend
Config and Secrets ¶
# View ConfigMap
kubectl get configmap uwsgi-env -o yaml
# Edit ConfigMap (will auto-reload with Reloader)
kubectl edit configmap uwsgi-env
# Decode secret (base64)
kubectl get secret uwsgi -o jsonpath='{.data.DATABASE_URL}' | base64 -d
Network Testing ¶
# Test service connectivity from pod
kubectl exec -it <pod-name> -- curl http://<service-name>:<port>
kubectl exec -it <pod-name> -- wget -O- http://<service-name>:<port>
kubectl exec -it <pod-name> -- nc -zv <service-name> <port>
# DNS testing
kubectl exec -it <pod-name> -- nslookup <service-name>
kubectl exec -it <pod-name> -- dig <service-name>.default.svc.cluster.local
Helm Debugging ¶
# List releases
helm list -A
# Check release status
helm status <release-name>
# View release values
helm get values <release-name>
# View all release details
helm get all <release-name>
# Rollback
helm rollback <release-name> <revision>
Node Debugging ¶
# Node details
kubectl describe node <node-name>
# Drain node (evict pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Uncordon node (allow scheduling)
kubectl uncordon <node-name>
# Cordon node (prevent new pods)
kubectl cordon <node-name>
Emergency Procedures ¶
Complete Application Restart ¶
# Restart all application services
kubectl rollout restart deployment/backend
kubectl rollout restart deployment/frontend
kubectl rollout restart deployment/celery
kubectl rollout restart deployment/celerybeat
kubectl rollout restart deployment/celery-encoding
kubectl rollout restart deployment/encoding-pool-master
kubectl rollout restart deployment/tus-hook-listener
kubectl rollout restart deployment/traffic-log-listener
kubectl rollout restart deployment/flower
Clear All Pods ¶
# Delete all pods (will be recreated by deployments)
kubectl delete pods -l pdb=minAvail1
Reset Redis ¶
kubectl delete pod -l app=redis-master
# Or flush all data
kubectl exec -it <redis-pod> -- redis-cli FLUSHALL
Reset RabbitMQ ¶
# Restart RabbitMQ (tasks in queue will be lost)
kubectl delete pod -l app.kubernetes.io/name=rabbitmq
Getting Help ¶
If issues persist:
-
Collect diagnostics:
kubectl get all -o wide > cluster-state.txt kubectl describe pods >> cluster-state.txt kubectl get events --sort-by='.lastTimestamp' >> cluster-state.txt -
Check CloudWatch logs in AWS Console
-
Review Groundcover dashboards
-
Contact support:
Comments
Please login to leave a comment.
No comments yet. Be the first to comment!