Kubernetes — promtail + loki + grafana 日志系统部署

Kubernetes — promtail + loki + grafana 日志系统部署

背景

在k8s 部署一套 promtail + loki + grafana 日志系统。日志由 Promtail 从 Kubernetes 集群中收集并发送到 Loki。Promtail 会提取以下标签:

  • namespace: Pod 所在的命名空间
  • pod_name: Pod 名称
  • deployment_name: Deployment 名称(从 Pod 的 app 标签或 Pod 名称中提取)
  • container: 容器名称

部署 loki

  • Deployment: Loki 主服务
  • Service: 提供集群内部访问
  • ConfigMap: Loki 配置文件
  • PVC: 数据持久化存储(10Gi)
  • PV: 数据存储类

创建 YAML 文件

首先创建 namespace:

1kubectl create namespace logging

创建 loki 目录并创建 pv.yaml

 1apiVersion: v1
 2kind: PersistentVolume
 3metadata:
 4  name: loki-data-pv
 5spec:
 6  capacity:
 7    storage: 10Gi # 存储大小
 8  accessModes:
 9    - ReadWriteOnce
10  persistentVolumeReclaimPolicy: Retain
11  storageClassName: host-loki
12  hostPath:
13    path: /data/loki # 存储位置
14    type: DirectoryOrCreate
15  nodeAffinity:
16    required:
17      nodeSelectorTerms:
18        - matchExpressions:
19            - key: kubernetes.io/hostname
20              operator: In
21              values:
22                - worker-1 # 改为你要存储的k8s节点

创建pvc.yaml

 1apiVersion: v1
 2kind: PersistentVolumeClaim
 3metadata:
 4  name: loki-data
 5  namespace: logging
 6spec:
 7  accessModes:
 8    - ReadWriteOnce
 9  resources:
10    requests:
11      storage: 10Gi
12  storageClassName: host-loki

创建configmap.yaml

 1apiVersion: v1
 2kind: ConfigMap
 3metadata:
 4  name: loki-config
 5  namespace: logging
 6data:
 7  loki.yaml: |
 8    auth_enabled: false
 9
10    server:
11      http_listen_port: 3100
12      grpc_listen_port: 9096
13
14    common:
15      path_prefix: /loki
16      storage:
17        filesystem:
18          chunks_directory: /loki/chunks
19          rules_directory: /loki/rules
20      replication_factor: 1
21      ring:
22        instance_addr: 127.0.0.1
23        kvstore:
24          store: inmemory
25
26    schema_config:
27      configs:
28        - from: 2020-10-24
29          store: boltdb-shipper
30          object_store: filesystem
31          schema: v11
32          index:
33            prefix: index_
34            period: 24h
35
36    ruler:
37      alertmanager_url: http://localhost:9093
38
39    analytics:
40      reporting_enabled: false
41
42    limits_config:
43      ingestion_rate_mb: 16
44      ingestion_burst_size_mb: 32
45      max_query_length: 721h
46      max_query_parallelism: 32
47      max_streams_per_user: 10000
48      max_line_size: 0
49      max_query_series: 500
50      reject_old_samples: true
51      reject_old_samples_max_age: 168h
52
53    chunk_store_config:
54      max_look_back_period: 0s
55
56    table_manager:
57      retention_deletes_enabled: true
58      retention_period: 720h
59
60    compactor:
61      working_directory: /loki/compactor
62      shared_store: filesystem
63      compaction_interval: 10m
64      retention_enabled: true
65      retention_delete_delay: 2h
66      retention_delete_worker_count: 150    

创建 deployment.yaml

 1apiVersion: apps/v1
 2kind: Deployment
 3metadata:
 4  name: loki
 5  namespace: logging
 6spec:
 7  replicas: 1
 8  selector:
 9    matchLabels:
10      app: loki
11  template:
12    metadata:
13      labels:
14        app: loki
15    spec:
16      securityContext:
17        fsGroup: 10001
18      initContainers:
19        - name: init-storage
20          image: busybox:1.36
21          securityContext:
22            runAsUser: 0
23          command:
24            - sh
25            - -c
26            - |
27              mkdir -p /loki/chunks /loki/rules /loki/compactor
28              chown -R 10001:10001 /loki
29              chmod -R 755 /loki              
30          volumeMounts:
31            - name: storage
32              mountPath: /loki
33      containers:
34        - name: loki
35          image: grafana/loki:2.9.2
36          securityContext:
37            runAsUser: 10001
38            runAsNonRoot: true
39            readOnlyRootFilesystem: false
40          ports:
41            - containerPort: 3100
42              name: http
43            - containerPort: 9096
44              name: grpc
45          args:
46            - -config.file=/etc/loki/loki.yaml
47          volumeMounts:
48            - name: config
49              mountPath: /etc/loki
50            - name: storage
51              mountPath: /loki
52          livenessProbe:
53            httpGet:
54              path: /ready
55              port: 3100
56            initialDelaySeconds: 45
57            periodSeconds: 30
58            timeoutSeconds: 5
59            failureThreshold: 3
60          readinessProbe:
61            httpGet:
62              path: /ready
63              port: 3100
64            initialDelaySeconds: 15
65            periodSeconds: 10
66            timeoutSeconds: 5
67            failureThreshold: 3
68          resources:
69            requests:
70              memory: "512Mi"
71              cpu: "250m"
72            limits:
73              memory: "2Gi"
74              cpu: "1000m"
75      volumes:
76        - name: config
77          configMap:
78            name: loki-config
79        - name: storage
80          persistentVolumeClaim:
81            claimName: loki-data

创建service.yaml

 1apiVersion: v1
 2kind: Service
 3metadata:
 4  name: loki
 5  namespace: logging
 6spec:
 7  type: ClusterIP
 8  selector:
 9    app: loki
10  ports:
11    - port: 3100
12      targetPort: 3100
13      protocol: TCP
14      name: http
15    - port: 9096
16      targetPort: 9096
17      protocol: TCP
18      name: grpc

部署

  1. 创建 PV 和 PVC:
1kubectl apply -f pv.yaml
2kubectl apply -f pvc.yaml
  1. 创建 ConfigMap:
1kubectl apply -f configmap.yaml
  1. 部署 Deployment:
1kubectl apply -f deployment.yaml
  1. 创建 Service:
1kubectl apply -f service.yaml
  1. 验证部署状态:
1# 检查 Loki Pod 状态
2kubectl get pods -n logging -l app=loki
3
4# 查看 Loki 日志
5kubectl logs -n logging -l app=loki --tail=50

部署 promtail

Promtail 是 Grafana Loki 的日志收集代理,用于从 Kubernetes 集群中收集容器日志并发送到 Loki。

  • DaemonSet: 在每个节点上运行一个 Promtail Pod,收集节点上的容器日志
  • ConfigMap: 包含 Promtail 配置,定义日志收集规则和 Loki 输出
  • ServiceAccount & RBAC: 提供访问 Kubernetes API 的权限,用于获取 Pod 元数据

创建 YAML 文件

创建 promtail 目录并创建 configmap.yaml

 1apiVersion: v1
 2kind: ConfigMap
 3metadata:
 4  name: promtail-config
 5  namespace: logging
 6data:
 7  promtail.yaml: |
 8    server:
 9      http_listen_port: 3101
10      grpc_listen_port: 9096
11
12    positions:
13      filename: /tmp/positions.yaml
14
15    clients:
16      - url: http://loki.logging.svc.cluster.local:3100/loki/api/v1/push
17
18    scrape_configs:
19      - job_name: kubernetes-pods
20        kubernetes_sd_configs:
21          - role: pod
22        pipeline_stages:
23          - docker: {}
24        relabel_configs:
25          # 设置日志文件路径
26          - source_labels:
27              - __meta_kubernetes_namespace
28              - __meta_kubernetes_pod_name
29              - __meta_kubernetes_pod_uid
30            separator: _
31            target_label: __tmp_pod_path
32          - source_labels:
33              - __tmp_pod_path
34              - __meta_kubernetes_pod_container_name
35            separator: /
36            target_label: __path__
37            replacement: /var/log/pods/$1/$2/*.log
38          # 提取 namespace 标签
39          - source_labels:
40              - __meta_kubernetes_namespace
41            target_label: namespace
42          # 提取 pod_name 标签
43          - source_labels:
44              - __meta_kubernetes_pod_name
45            target_label: pod_name
46          # 提取 deployment_name 标签(优先从 app 标签获取)
47          - source_labels:
48              - __meta_kubernetes_pod_label_app
49            target_label: deployment_name
50            regex: (.+)
51          # 如果 app 标签不存在,从 pod 名称中提取(格式:deployment-name-replicaset-hash)
52          - source_labels:
53              - __meta_kubernetes_pod_name
54            target_label: deployment_name
55            regex: '^(.+?)-[0-9a-z]+-[0-9a-z]+$'
56            replacement: '${1}'
57            action: replace
58          # 提取 container 标签
59          - source_labels:
60              - __meta_kubernetes_pod_container_name
61            target_label: container
62          # 只保留有效的 pod 日志
63          - action: keep
64            source_labels:
65              - __meta_kubernetes_pod_name
66              - __meta_kubernetes_pod_node_name
67              - __meta_kubernetes_namespace
68          # 移除所有 __meta_kubernetes 前缀的标签
69          - action: labeldrop
70            regex: '__meta_kubernetes.*'    

创建 daemonset.yaml

 1apiVersion: apps/v1
 2kind: DaemonSet
 3metadata:
 4  name: promtail
 5  namespace: logging
 6spec:
 7  selector:
 8    matchLabels:
 9      app: promtail
10  template:
11    metadata:
12      labels:
13        app: promtail
14    spec:
15      serviceAccountName: promtail
16      tolerations:
17        - effect: NoSchedule
18          operator: Exists
19        - effect: NoExecute
20          operator: Exists
21      containers:
22        - name: promtail
23          image: grafana/promtail:2.9.2
24          args:
25            - -config.file=/etc/promtail/promtail.yaml
26          ports:
27            - name: http-metrics
28              containerPort: 3101
29          env:
30            - name: HOSTNAME
31              valueFrom:
32                fieldRef:
33                  fieldPath: spec.nodeName
34          volumeMounts:
35            - name: config
36              mountPath: /etc/promtail
37            - name: varlog
38              mountPath: /var/log
39              readOnly: true
40            - name: varlibdockercontainers
41              mountPath: /var/lib/docker/containers
42              readOnly: true
43            - name: positions
44              mountPath: /tmp
45          resources:
46            requests:
47              memory: "128Mi"
48              cpu: "100m"
49            limits:
50              memory: "256Mi"
51              cpu: "200m"
52          securityContext:
53            runAsUser: 0
54            runAsGroup: 0
55            runAsNonRoot: false
56      volumes:
57        - name: config
58          configMap:
59            name: promtail-config
60        - name: varlog
61          hostPath:
62            path: /var/log
63        - name: varlibdockercontainers
64          hostPath:
65            path: /var/lib/docker/containers
66        - name: positions
67          emptyDir: {}

创建 serviceaccount.yaml

 1apiVersion: v1
 2kind: ServiceAccount
 3metadata:
 4  name: promtail
 5  namespace: logging
 6
 7---
 8apiVersion: rbac.authorization.k8s.io/v1
 9kind: ClusterRole
10metadata:
11  name: promtail
12rules:
13  - apiGroups: [""]
14    resources:
15      - nodes
16      - nodes/proxy
17      - services
18      - endpoints
19      - pods
20    verbs: ["get", "list", "watch"]
21  - apiGroups:
22      - ""
23    resources:
24      - configmaps
25    verbs: ["get"]
26
27---
28apiVersion: rbac.authorization.k8s.io/v1
29kind: ClusterRoleBinding
30metadata:
31  name: promtail
32roleRef:
33  apiGroup: rbac.authorization.k8s.io
34  kind: ClusterRole
35  name: promtail
36subjects:
37  - kind: ServiceAccount
38    name: promtail
39    namespace: logging

部署

  1. 创建 ConfigMap:
1kubectl apply -f configmap.yaml
  1. 创建 ServiceAccount 和 RBAC:
1kubectl apply -f serviceaccount.yaml
  1. 部署 DaemonSet:
1kubectl apply -f daemonset.yaml
  1. 验证部署状态:
1# 检查 Promtail Pod 是否在所有节点上运行
2kubectl get pods -n logging -l app=promtail
3
4# 查看 Promtail 日志
5kubectl logs -n logging -l app=promtail --tail=50

部署 grafana

Grafana 用于可视化 Loki 中的日志数据,提供强大的查询和仪表板功能。

  • Deployment: Grafana 主服务
  • Service: 提供集群内部访问,你也可以 nodeport 直接访问
  • Ingress: 提供外部访问(通过 Traefik)
  • ConfigMap: 数据源配置(自动配置 Loki 数据源)
  • PVC: 数据持久化存储(10Gi,保存仪表板和配置)

创建 YAML 文件

创建 grafana 目录并创建 configmap.yaml

 1apiVersion: v1
 2kind: ConfigMap
 3metadata:
 4  name: grafana-datasources
 5  namespace: logging
 6data:
 7  datasources.yaml: |
 8    apiVersion: 1
 9    datasources:
10      - name: Loki
11        type: loki
12        access: proxy
13        url: http://loki.logging.svc.cluster.local:3100
14        isDefault: false
15        editable: true
16        jsonData:
17          maxLines: 1000
18          derivedFields:
19            - datasourceUid: loki
20              matcherRegex: "traceID=(\\w+)"
21              name: TraceID
22              url: '$${__value.raw}'    

创建 deployment.yaml

 1apiVersion: apps/v1
 2kind: Deployment
 3metadata:
 4  name: grafana
 5  namespace: logging
 6spec:
 7  replicas: 1
 8  selector:
 9    matchLabels:
10      app: grafana
11  template:
12    metadata:
13      labels:
14        app: grafana
15    spec:
16      initContainers:
17        - name: init-storage
18          image: busybox:1.36
19          securityContext:
20            runAsUser: 0
21          command:
22            - sh
23            - -c
24            - |
25              mkdir -p /var/lib/grafana/plugins /var/lib/grafana/data /var/lib/grafana/logs
26              chown -R root:root /var/lib/grafana
27              chmod -R 755 /var/lib/grafana              
28          volumeMounts:
29            - name: storage
30              mountPath: /var/lib/grafana
31      containers:
32        - name: grafana
33          image: grafana/grafana:10.2.2
34          securityContext:
35            runAsUser: 0
36            runAsNonRoot: false
37            readOnlyRootFilesystem: false
38          ports:
39            - containerPort: 3000
40              name: http
41          env:
42            - name: GF_SECURITY_ADMIN_USER
43              value: admin
44            - name: GF_SECURITY_ADMIN_PASSWORD
45              value: admin  # 建议通过 Secret 管理
46            - name: GF_INSTALL_PLUGINS
47              value: ""
48            - name: GF_SERVER_ROOT_URL
49              value: "%(protocol)s://%(domain)s:%(http_port)s/"
50            - name: GF_SERVER_SERVE_FROM_SUB_PATH
51              value: "false"
52          volumeMounts:
53            - name: storage
54              mountPath: /var/lib/grafana
55            - name: datasources
56              mountPath: /etc/grafana/provisioning/datasources
57          livenessProbe:
58            httpGet:
59              path: /api/health
60              port: 3000
61            initialDelaySeconds: 60
62            periodSeconds: 30
63            timeoutSeconds: 5
64            failureThreshold: 3
65          readinessProbe:
66            httpGet:
67              path: /api/health
68              port: 3000
69            initialDelaySeconds: 30
70            periodSeconds: 10
71            timeoutSeconds: 5
72            failureThreshold: 3
73          resources:
74            requests:
75              memory: "256Mi"
76              cpu: "100m"
77            limits:
78              memory: "512Mi"
79              cpu: "500m"
80      volumes:
81        - name: storage
82          persistentVolumeClaim:
83            claimName: grafana-data
84        - name: datasources
85          configMap:
86            name: grafana-datasources

创建 pv.yaml

 1apiVersion: v1
 2kind: PersistentVolume
 3metadata:
 4  name: grafana-data-pv
 5spec:
 6  capacity:
 7    storage: 10Gi # 存储大小
 8  accessModes:
 9    - ReadWriteOnce
10  persistentVolumeReclaimPolicy: Retain
11  storageClassName: host-grafana
12  hostPath:
13    path: /data/grafana # 存储目录
14    type: DirectoryOrCreate
15  nodeAffinity:
16    required:
17      nodeSelectorTerms:
18        - matchExpressions:
19            - key: kubernetes.io/hostname
20              operator: In
21              values:
22                - worker-1 # 修改成你自己的节点

创建 pvc.yaml

 1apiVersion: v1
 2kind: PersistentVolumeClaim
 3metadata:
 4  name: grafana-data
 5  namespace: logging
 6spec:
 7  accessModes:
 8    - ReadWriteOnce
 9  resources:
10    requests:
11      storage: 10Gi
12  storageClassName: host-grafana

创建 service.yaml

 1apiVersion: v1
 2kind: Service
 3metadata:
 4  name: grafana
 5  namespace: logging
 6spec:
 7  type: ClusterIP
 8  selector:
 9    app: grafana
10  ports:
11    - port: 3000
12      targetPort: 3000
13      protocol: TCP
14      name: http

创建 ingress.yaml

 1apiVersion: networking.k8s.io/v1
 2kind: Ingress
 3metadata:
 4  name: grafana-ingress
 5  namespace: logging
 6  annotations:
 7    traefik.ingress.kubernetes.io/router.entrypoints: web,websecure
 8    traefik.ingress.kubernetes.io/router.tls: "true"
 9spec:
10  tls:
11    - hosts:
12        - grafana-dev.jobcher.com           #改成你自己的域名
13      secretName: jobcher-com-tls           #改成你自己的SSL证书
14  rules:
15    - host: grafana-dev.jobcher.com         #改成你自己的域名
16      http:
17        paths:
18          - path: /
19            pathType: Prefix
20            backend:
21              service:
22                name: grafana
23                port:
24                  number: 3000

部署

  1. 创建 PV和PVC:
1kubectl apply -f pv.yaml
2kubectl apply -f pvc.yaml
  1. 创建 ConfigMap(数据源配置):
1kubectl apply -f configmap.yaml
  1. 部署 Deployment:
1kubectl apply -f deployment.yaml
  1. 创建 Service:
1kubectl apply -f service.yaml
  1. 创建 Ingress(可选,用于外部访问):
1kubectl apply -f ingress.yaml
  1. 验证部署状态:
 1# 检查所有组件状态
 2kubectl get pods -n logging
 3
 4# 检查 Loki 服务
 5kubectl get svc -n logging
 6
 7# 查看 Loki 日志
 8kubectl logs -n logging -l app=loki --tail=50
 9
10# 查看 Grafana 日志
11kubectl logs -n logging -l app=grafana --tail=50

验证和测试

验证 Loki 是否正常工作

1# 检查 Loki 健康状态
2kubectl exec -n logging -it deployment/loki -- wget -q -O - http://localhost:3100/ready
3
4# 查询 Loki 中的日志流
5kubectl exec -n logging -it deployment/loki -- wget -q -O - "http://localhost:3100/loki/api/v1/label/namespace/values"

在 Grafana 中查看日志

  1. 访问 Grafana(通过 Ingress 或端口转发):
1# 端口转发(如果使用 ClusterIP)
2kubectl port-forward -n logging svc/grafana 3000:3000
  1. 使用默认账号登录:admin / admin
  2. 进入 Explore 页面,选择 Loki 数据源
  3. 使用 LogQL 查询日志,例如:
    • {namespace="default"} - 查看 default 命名空间的日志
    • {pod_name="your-pod-name"} - 查看特定 Pod 的日志
    • {deployment_name="your-deployment"} - 查看特定 Deployment 的日志

注意事项

存储配置

  • 确保节点上的 /var/log/pods 路径存在(Kubernetes 1.14+ 标准日志路径)
  • 根据实际需求调整 PV 的存储大小和节点选择
  • 如果使用 containerd 而不是 Docker,日志路径格式相同,但可能需要在配置中调整

安全配置

  • Promtail 需要以 root 用户运行(uid 0)才能访问节点上的日志文件
  • Grafana 管理员密码建议通过 Secret 管理,而不是直接写在 Deployment 中
  • 生产环境建议启用 Loki 的认证功能

性能优化

  • 根据集群规模调整资源限制(CPU、内存)
  • 根据日志量调整 Loki 的 ingestion_rate_mbingestion_burst_size_mb
  • 定期清理旧日志,根据 retention_period 配置自动删除

标签提取逻辑

  • deployment_name 标签的提取逻辑:
    • 优先使用 Pod 的 app 标签值
    • 如果 app 标签不存在,则从 Pod 名称中提取(格式:deployment-name-replicaset-hash
  • 确保 Pod 有正确的标签,以便 Promtail 正确提取元数据

故障排查

  • 如果 Promtail 无法收集日志,检查 Pod 是否有权限访问 /var/log/pods
  • 如果 Loki 无法接收日志,检查网络连接和 Service 配置
  • 使用 kubectl logs 查看各组件的日志进行排查