概述
将 Kubernetes 集群部署到生产环境是一项复杂的任务。许多团队在实际运行中才发现配置不足或监控缺失的问题,导致服务中断、数据丢失或安全漏洞。本文通过系统的检查清单,帮助你在部署前识别和解决潜在风险。
一、集群基础设施检查
1.1 节点配置
硬件要求
控制平面节点(Master):
- CPU:最少 2 核,建议 4 核及以上
- 内存:最少 2GB,建议 8GB 及以上
- 磁盘:最少 20GB(etcd 存储),建议 50GB 及以上
- 网络:低延迟(< 10ms)互联互通
工作节点(Worker):
- CPU:最少 1 核,实际按工作负载确定(通常 2-8 核)
- 内存:最少 512MB(实际通常 2GB 及以上)
- 磁盘:最少 10GB(容器运行时 + 镜像存储),建议 50GB 及以上
验证方法
1 2 3 4 5 6 7
| kubectl get nodes -o wide kubectl describe node <node-name>
kubectl top nodes kubectl top pods --all-namespaces
|
1.2 集群网络
网络插件(CNI)选择
- Flannel:轻量级,易部署,适合小规模集群
- Calico:功能完整,支持网络策略,适合大规模生产环境
- Cilium:高性能,基于 eBPF,支持高级流量管理
验证清单
测试命令
1 2 3 4 5 6 7 8
| kubectl run test-pod --image=busybox --restart=Never -- sleep 3600
kubectl exec -it test-pod -- wget -O- http://kubernetes.default.svc.cluster.local/api/v1/
kubectl delete pod test-pod
|
1.3 存储配置
存储类型选择
| 存储类型 | 使用场景 | 风险 |
|---|
| 本地存储 | 临时数据、日志 | 节点故障导致数据丢失 |
| 网络存储(NFS) | 共享数据、中等可靠性 | 单点故障、性能瓶颈 |
| 块存储(云盘) | 数据库、关键应用 | 成本高、跨区域延迟 |
| 分布式存储 | 高可用、大规模应用 | 部署复杂、性能调优困难 |
验证清单
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| kubectl get storageclass
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: test-pvc spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi storageClassName: <your-storage-class> EOF
kubectl get pvc
kubectl delete pvc test-pvc
|
二、安全性检查
2.1 认证与授权
检查项
1 2 3 4 5 6 7 8 9 10 11 12 13
| kubectl api-resources
kubectl auth can-i get pods --as=system:serviceaccount:default:default
kubectl get rolebinding,clusterrolebinding -A | grep default
ps aux | grep kube-apiserver | grep -i anonymous
|
最佳实践
2.2 网络安全
NetworkPolicy 配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
| apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-all namespace: production spec: podSelector: {} policyTypes: - Ingress - Egress --- apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-app-traffic namespace: production spec: podSelector: matchLabels: app: myapp policyTypes: - Ingress - Egress ingress: - from: - namespaceSelector: matchLabels: name: production - podSelector: matchLabels: tier: frontend ports: - protocol: TCP port: 8080 egress: - to: - namespaceSelector: matchLabels: name: production ports: - protocol: TCP port: 5432 - to: - namespaceSelector: {} ports: - protocol: TCP port: 53
|
检查清单
2.3 镜像安全
检查项
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| kubectl create secret docker-registry regcred \ --docker-server=<registry-url> \ --docker-username=<username> \ --docker-password=<password>
|
Pod 安全标准
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: name: restricted spec: privileged: false allowPrivilegeEscalation: false requiredDropCapabilities: - ALL runAsUser: rule: 'MustRunAsNonRoot' seLinux: rule: 'MustRunAs' seLinuxOptions: level: "s0:c123,c456" fsGroup: rule: 'MustRunAs' ranges: - min: 1000 max: 65535 readOnlyRootFilesystem: true
|
三、高可用性检查
3.1 控制平面高可用
验证项
1 2 3 4 5 6 7 8 9 10 11
| kubectl get nodes -l node-role.kubernetes.io/master=
kubectl -n kube-system exec -it etcd-<node> -- \ etcdctl member list
for i in {1..10}; do curl -s https://kubernetes.default.svc.cluster.local/api/v1 || echo "Failed" done
|
HA 配置检查清单
3.2 应用高可用
Pod 副本配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
| apiVersion: apps/v1 kind: Deployment metadata: name: myapp namespace: production spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - myapp topologyKey: kubernetes.io/hostname containers: - name: app image: myapp:1.0.0 livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 3 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi
|
检查命令
1 2 3 4 5 6 7 8 9
| kubectl get pods -o wide
kubectl cordon <node-name> kubectl drain <node-name> --ignore-daemonsets
watch kubectl get pods -o wide
|
四、可观测性检查
4.1 日志收集
部署日志收集栈
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
| apiVersion: v1 kind: ConfigMap metadata: name: fluentd-config namespace: kube-system data: fluent.conf: | <source> @type tail path /var/log/containers/*.log pos_file /var/log/fluentd-containers.log.pos tag kubernetes.* read_from_head true <parse> @type json time_format %Y-%m-%dT%H:%M:%S.%NZ </parse> </source> <filter kubernetes.**> @type kubernetes_metadata @id filter_kube_metadata kubernetes_url "#{ENV['FLUENT_FILTER_KUBERNETES_URL'] || 'https://' + ENV.fetch('KUBERNETES_SERVICE_HOST') + ':' + ENV.fetch('KUBERNETES_SERVICE_PORT') + '/api'}" </filter> <match **> @type elasticsearch @id out_es @log_level info include_tag_key true host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}" port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}" path "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX'] || 'logstash'}._doc" logstash_format true logstash_prefix "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX'] || 'logstash'}" logstash_prefix_separator _ include_timestamp false type_name _doc </match>
|
验证日志收集
1 2 3 4 5 6 7
| kubectl get pods -n kube-system | grep -i log
curl http://elasticsearch:9200/_cat/indices
|
4.2 指标监控
Prometheus + Grafana 配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/kube-prometheus-stack \ -n monitoring --create-namespace
kubectl get pods -n monitoring kubectl get svc -n monitoring
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
|
关键指标检查清单
4.3 告警配置
PrometheusRule 示例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
| apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: kubernetes-alerts namespace: monitoring spec: groups: - name: kubernetes interval: 30s rules: - alert: NodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 5m annotations: summary: "Node {{ $labels.node }} is not ready" - alert: PodRestartingTooOften expr: rate(kube_pod_container_status_restarts_total[1h]) > 0.1 for: 5m annotations: summary: "Pod {{ $labels.pod }} is restarting too often" - alert: HighMemoryUsage expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9 for: 5m annotations: summary: "Container {{ $labels.container }} has high memory usage" - alert: APIServerHighLatency expr: histogram_quantile(0.99, apiserver_request_duration_seconds_bucket) > 1 for: 5m annotations: summary: "API Server latency is high"
|
五、备份与恢复检查
5.1 etcd 备份
自动备份脚本
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| #!/bin/bash
BACKUP_DIR="/backup/etcd" RETENTION_DAYS=30 TIMESTAMP=$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIR
ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ snapshot save $BACKUP_DIR/etcd-backup-$TIMESTAMP.db
ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ snapshot status $BACKUP_DIR/etcd-backup-$TIMESTAMP.db
find $BACKUP_DIR -name "*.db" -mtime +$RETENTION_DAYS -delete
echo "Backup completed: $BACKUP_DIR/etcd-backup-$TIMESTAMP.db"
|
设置 Cron 定时备份
1 2 3 4 5 6 7 8
| crontab -e
0 * * * * /usr/local/bin/backup-etcd.sh >> /var/log/etcd-backup.log 2>&1
0 2 * * * /usr/local/bin/backup-etcd.sh >> /var/log/etcd-backup.log 2>&1
|
5.2 应用备份
使用 Velero 进行应用备份
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| wget https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz tar -xzf velero-v1.12.0-linux-amd64.tar.gz sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/
velero install \ --provider aws \ --bucket velero-backup \ --secret-file ./credentials-velero \ --use-volume-snapshots=true \ --snapshot-location-config snapshotLocation=us-east-1
velero backup create my-backup --include-namespaces production
velero backup get velero backup logs my-backup
velero restore create --from-backup my-backup
|
5.3 恢复验证
定期进行恢复演练
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
|
velero restore create --from-backup my-backup --namespace test-restore
kubectl get all -n test-restore
kubectl run test-pod -n test-restore --image=busybox -- curl http://app-service:8080
kubectl logs -n test-restore deployment/myapp
velero restore delete my-backup-restore
|
六、性能与容量规划
6.1 资源配额
namespace 级别的资源限制
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
| apiVersion: v1 kind: ResourceQuota metadata: name: production-quota namespace: production spec: hard: requests.cpu: "100" requests.memory: "200Gi" limits.cpu: "200" limits.memory: "400Gi" pods: "500" services: "100" persistentvolumeclaims: "50" --- apiVersion: v1 kind: LimitRange metadata: name: production-limits namespace: production spec: limits: - type: Pod min: cpu: "10m" memory: "32Mi" max: cpu: "4" memory: "8Gi" - type: Container min: cpu: "10m" memory: "32Mi" max: cpu: "2" memory: "4Gi" default: cpu: "500m" memory: "512Mi" defaultRequest: cpu: "100m" memory: "128Mi"
|
验证配额使用
1 2 3 4 5
| kubectl describe resourcequota -n production
kubectl get limits -n production
|
6.2 容量规划
检查清单
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| kubectl top nodes kubectl top pods --all-namespaces
kubectl describe node <node-name> | grep Allocated
helm repo add autoscaler https://kubernetes.github.io/autoscaler helm install autoscaler autoscaler/cluster-autoscaler \ --namespace kube-system \ --set autoDiscovery.clusterName=<cluster-name>
|
七、合规性与审计
7.1 审计日志
启用 API 审计
1 2 3 4 5 6 7 8 9 10 11 12
| apiVersion: audit.k8s.io/v1 kind: Policy rules:
- level: RequestResponse omitStages: - RequestReceived
- level: RequestResponse verbs: ["create", "update", "patch", "delete"] resources: ["pods", "services", "secrets"]
|
检查审计日志
1 2 3 4 5 6 7 8
| --audit-policy-file=/etc/kubernetes/audit-policy.yaml --audit-log-path=/var/log/kubernetes/audit.log --audit-log-maxage=30 --audit-log-maxsize=100
tail -f /var/log/kubernetes/audit.log | jq '.'
|
7.2 RBAC 审计
定期检查权限
1 2 3 4 5 6 7 8 9
| kubectl get rolebindings,clusterrolebindings -A
kubectl auth can-i --list --as=<user>
kubectl get clusterrolebindings -o json | \ jq '.items[] | select(.roleRef.name=="cluster-admin") | .subjects'
|
八、常见问题与解决方案
Q1:部署后 Pod 无法启动
诊断步骤
1 2 3 4 5 6 7 8 9 10 11 12
| kubectl describe pod <pod-name>
kubectl logs <pod-name> kubectl logs <pod-name> --previous
kubectl describe node <node-name>
kubectl logs -n kube-system -l component=admission-webhook
|
Q2:Service 无法访问
诊断步骤
1 2 3 4 5 6 7 8 9 10 11
| kubectl describe service <service-name>
kubectl get endpoints <service-name>
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup <service-name>
kubectl get networkpolicies -A
|
Q3:存储挂载失败
诊断步骤
1 2 3 4 5 6 7 8 9 10 11 12 13
| kubectl get pvc
kubectl get pv
kubectl describe pvc <pvc-name>
|
九、总结与建议
部署前必做的 10 件事
- ✅ 集群规划:确定节点数量、配置规格、网络拓扑
- ✅ 安全加固:配置 RBAC、NetworkPolicy、Pod 安全标准
- ✅ 高可用部署:控制平面 HA、应用副本数、反亲和性
- ✅ 存储规划:选择合适的存储方案、配置备份
- ✅ 可观测性:部署日志、指标、追踪系统
- ✅ 告警配置:定义关键指标告警、确保告警链路畅通
- ✅ 容量规划:估算资源使用、规划扩容策略
- ✅ 备份恢复:制定备份策略、定期进行恢复演练
- ✅ 文档编写:记录集群架构、配置参数、常见故障排查
- ✅ 测试验证:进行压力测试、故障演练、安全审计
持续运维的关键指标
| 指标 | 目标 | 检查频率 |
|---|
| 可用性(Availability) | > 99.9% | 每周 |
| 平均恢复时间(MTTR) | < 30 分钟 | 每月 |
| 平均故障间隔(MTBF) | > 720 小时 | 每季度 |
| 备份完整性 | 100% | 每周 |
| 恢复可行性 | 成功恢复率 > 95% | 每月 |
| 安全事件 | 0 起严重事件 | 实时监控 |
结语
部署生产级 Kubernetes 集群需要在多个方面进行全面的规划和验证。通过遵循本检查清单,你可以显著降低生产环境中的风险,提升系统的可靠性和安全性。建议将这份清单纳入你的部署流程,并根据实际情况进行定制化调整。
定期审查和更新这份检查清单,确保它与你的技术栈和业务需求保持同步。最后,不要忘记进行定期的故障演练和安全审计,这些是保证系统持续稳定运行的关键。