我们将 prometheus.yml
配置文件以 configmap 的形式存储在 kubernetes 集群中。(configmap.yaml
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
| apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-cfg
namespace: monitor
data:
prometheus.yml: |
global:
scrape_interval: 30s
scrape_timeout: 30s
evaluation_interval: 1m
scrape_configs:
- job_name: prometheus
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: http
follow_redirects: true
static_configs:
- targets:
- localhost:9090
- job_name: kubernetes-node-exporter
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: http
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
follow_redirects: true
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_role]
separator: ;
regex: (.*)
target_label: kubernetes_role
replacement: $1
action: replace
- source_labels: [__address__]
separator: ;
regex: (.*):10250
target_label: __address__
replacement: ${1}:9100
action: replace
kubernetes_sd_configs:
- role: node
follow_redirects: true
- job_name: kubernetes-node-kubelet
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
follow_redirects: true
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
kubernetes_sd_configs:
- role: node
follow_redirects: true
- job_name: traefik
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: http
follow_redirects: true
static_configs:
- targets:
- traefik.kube-system.svc.cluster.local:8080
- job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
regex: default;kubernetes;https
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
# 监控 service
- job_name: kubernetes-services
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module:
- http_2xx
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_probe
- source_labels:
- __address__
target_label: __param_target
- replacement: blackbox
target_label: __address__
- source_labels:
- __param_target
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
# 监控 pod
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
|
为了将时间序列数据进行持久化,需要提前创建好这个 pvc 对象:(pvc.yaml
)
1
2
3
4
5
6
7
8
9
10
11
12
| apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus
namespace: monitor
spec:
storageClassName: managed-nfs-storage
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
|
这里使用了 nfs 提供动态 pvc, 关于 nfs 动态 pvc 的配置参考
使用 nfs-subdir-external-provisioner 扩展 NFS 向 kubernetes 提供动态 pvc 功能
因为需要 prometheus 去访问 Kubernetes 的相关信息,所以我们为 prometheus 授予相应的权限:(rbac.yaml
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
| apiVersion: v1
kind: Namespace
metadata:
name: monitor
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitor
|
最创建 prometheus 的资源配置清单(deployment.yaml
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
| apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: prometheus
name: prometheus
namespace: monitor
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.26.0
command:
- "/bin/prometheus"
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=24h"
- "--web.console.libraries=/usr/share/prometheus/console_libraries"
- "--web.console.templates=/usr/share/prometheus/consoles"
- "--web.enable-lifecycle"
- "--web.enable-admin-api"
ports:
- containerPort: 9090
protocol: TCP
name: http
volumeMounts:
- name: data
mountPath: "/prometheus"
- name: config
mountPath: "/etc/prometheus"
resources:
requests:
cpu: 100m
memory: 512Mi
limits:
cpu: 200m
memory: 512Mi
securityContext:
runAsUser: 0
volumes:
- name: data
persistentVolumeClaim:
claimName: prometheus
- name: config
configMap:
name: prometheus-cfg
defaultMode: 0644
---
apiVersion: v1
kind: Service
metadata:
labels:
app: prometheus
name: prometheus
namespace: monitor
spec:
ports:
- name: "http"
port: 9090
protocol: TCP
targetPort: 9090
selector:
app: prometheus
type: ClusterIP
|
有一个要注意的地方是我们这里必须要添加一个 securityContext
的属性,将其中的 runAsUser
设置为 0
,
这是因为现在的 prometheus
运行过程中使用的用户是 nobody
,否则会出现下面的 permission denied
之类的权限错误
按顺序应用资源配置清单
1
2
3
4
| kubectl apply -f rbac.yaml
kubectl apply -f pvc.yaml
kubectl apply -f configmap.yaml
kubectl apply -f deployment.yaml
|
使用 traefik 暴露 prometheus 服务: ingress-route.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
name: prometheus
namespace: monitor
spec:
entryPoints:
- web
routes:
- match: Host(`mon.host.com`)
kind: Rule
services:
- name: prometheus
port: 9090
|
应用
1
| kubectl apply -f ingress-route.yaml
|
准备 node-exporter.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
| apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app: node-exporter
name: node-exporter
namespace: monitor
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostPID: true
hostIPC: true
hostNetwork: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.1.2
command:
- "/bin/node_exporter"
args:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--collector.filesystem.ignored-mount-points='^/(dev|proc|sys|host|etc)($|/)'"
ports:
- containerPort: 9100
protocol: TCP
name: http
securityContext:
privileged: true
volumeMounts:
- name: dev
mountPath: "/host/dev"
- name: proc
mountPath: "/host/proc"
- name: sys
mountPath: "/host/sys"
- name: rootfs
mountPath: "/rootfs"
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 100m
memory: 128Mi
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
volumes:
- name: proc
hostPath:
path: /proc
- name: dev
hostPath:
path: /dev
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
|
由于我们需要监控所有节点包括主节点,所有需要忽略主节点上的污点, 加入 tolerations
配置项,使用 DaemonSet
方式部署
1
| kubectl apply -f node-exporter.yaml
|
查看 Prometheus 监控 Targets 状态
grafana.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
| ---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana
namespace: monitor
spec:
storageClassName: managed-nfs-storage
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitor
labels:
app: grafana
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:7.5.4
imagePullPolicy: IfNotPresent
ports:
- containerPort: 3000
name: grafana
env:
- name: GF_SECURITY_ADMIN_USER
value: admin
- name: GF_SECURITY_ADMIN_PASSWORD
value: admin321
readinessProbe:
failureThreshold: 10
httpGet:
path: /api/health
port: 3000
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
livenessProbe:
failureThreshold: 3
httpGet:
path: /api/health
port: 3000
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 100m
memory: 256Mi
requests:
cpu: 100m
memory: 256Mi
volumeMounts:
- mountPath: /var/lib/grafana
subPath: grafana
name: storage
securityContext:
runAsUser: 0
volumes:
- name: storage
persistentVolumeClaim:
claimName: grafana
---
apiVersion: v1
kind: Service
metadata:
labels:
app: grafana
name: grafana
namespace: monitor
spec:
ports:
- name: "http"
port: 3000
protocol: TCP
targetPort: 3000
selector:
app: grafana
type: ClusterIP
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
name: grafana
namespace: monitor
spec:
entryPoints:
- web
routes:
- match: Host(`grafana.host.com`)
kind: Rule
services:
- name: grafana
port: 3000
|
应用 grafana.yaml
1
| kubectl apply -f grafana.yml
|
插件安装有两种方式:
- 进入 Container 中,执行
grafana-cli plugins install $plugin_name
- 手动下载插件zip包,访问 https://grafana.com/grafana/plugins/ 下载 zip 包解压到
/var/lib/grafana/plugins
目录中
插件安装完毕后,需要重启 Grafana
的 Pod
进入容器运行如下命令安装插件
1
2
3
4
5
| grafana-cli plugins install grafana-kubernetes-app
grafana-cli plugins install grafana-clock-panel
grafana-cli plugins install grafana-piechart-panel
grafana-cli plugins install briangann-gauge-panel
grafana-cli plugins install natel-discrete-panel
|
配置 prometheus 数据源
浏览器打开 grafana.host.com
使用用户名: admin, 密码: admin321 登录,进入 Configuration -> Data Sources -> Add Data Source
选择 Prometheus
Url 为 http://prometheus:9090
Access: Server(default)
其他均为默认。
配置 kubernetes 插件
进入插件管理页面,先启用 kubernetes 插件
配置 kubernetes 插件