[Kubernetes] Install OTel(OpenTelemetry) Operator Using Helm Chart
Helm 설치 및 설명 참고 {: .prompt-info }
🔀 OTEL 예시 흐름: #
graph TD A[Application] --> B[OTel SDK] B --> C[OTel Collector] C --> D[Logs] C --> E[Metrics] C --> F[Traces] D --> G[Loki] E --> H[Prometheus] F --> I[Tempo/Jaeger]
Install Cert-manager #
1kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yamlOTel(OpenTelemetry) 설치 시 Cert-manager가 필요한 이유 #
HTTPS 통신 보안 OTel(OpenTelemetry) Collector는 기본적으로 HTTP를 통해 데이터를 수집하고 전송하지만,\ HTTPS를 사용하여 보안을 강화하는 것이 좋다.\ Cert-manager는 인증서 발급 및 관리를 자동화하여 OTel(OpenTelemetry) Collector가 안전하게 HTTPS를 사용하도록 설정하는 데 도움을 준다.
인증서 자동 갱신 HTTPS 인증서는 만료 기간이 있으며, 만료되면 OTel(OpenTelemetry) Collector가 작동하지 않게 된다.\ Cert-manager는 인증서가 만료되기 전에 자동으로 갱신하여 서비스 중단을 방지한다.
사용 편의성 향상 Cert-manager를 사용하면 수동으로 인증서를 발급하고 관리하는 번거로움 없이 OpenTelemetry Collector를 안전하게 배포하고 운영할 수 있다.
OTel(OpenTelemetry) 설치 시 반드시 Cert-manager가 필요한 것은 아니지만, HTTPS를 사용하여 보안을 강화하려는 경우 필수. Cert-manager는 Kubernetes 환경에서만 사용 가능. {: .prompt-warning }
Install OpenTelemetry Operator #
- Install OpenTelemetry Operator
1kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yamlOpenTelemetry Operator - 설치 참고 {: .prompt-info }
- Install Helm Chart - OpenTelemetry Operator
1helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts 2helm repo update 3helm install opentelemetry-operator open-telemetry/opentelemetry-operator --set "manager.collectorImage.repository=otel/opentelemetry-collector-k8s"OpenTelemetry Operator - Helm 설치 참고 {: .prompt-info }
Opentelemetry Operator가 관리하는 기능 두가지 #
- Opentelemetry Collector
- auto-instrumentation of the workloads using OpenTelemetry instrumentation libraries
프로젝트가 다수일 경우 매번 Opentelemetry Collector와 auto-instrumentation agent를 같이 띄울 필요 없이 Operator를 활용하여, 프로젝트별로 Collector를 설치할 수 있고 각 서버마다 agent를 명세할 필요 없이 annotation을 통하여 Operator가 해당 Pod에 sidecar 형태로 추가해준다.
OpenTelemetry Operator 사용하여 OpenTelemetry Collector 배포 할 수 있다. #
OpenTelemetry Collector #
- collector에 대한 배포판은 3가지
- opentelemetry-collector : 핵심 기능을 제공
- opentelemetry-collector-contrib : Contrib은 opentelemetry-collector 확장하여 다양한 환경에서 사용될 수 있도록 제작
- opentelemetry-collector-k8s : opentelemetry-collector와 contrib의 구성요소 중 k8s cluster와 구성요소를 모니터링할 수 있도록 특별히 제작
Node Collector(Daemonset) #
File Logs
Host metrics
Kubelet state metrics
공식 문서에서 DaemonSet을 권장하는 receiver가 모인 collector이다.
Log | Filelog #
수집 대상은 stdout/stderr로 생성된 Kubernetes, app log으로,\ 사실상 Fluentbit를 대체한다.\ 이를 위해 log scraping 및 전달 뿐 아니라 Processors 에서 언급한 다양한 processor 사용을 고려해야 한다.
- Receiver: Filelog Receiver
- Exporter: Loki exporter
Metric | Kubelet Stats #
node, pod, container, volume, filesystem network I/O and error metrics 등 CPU, memory 등 infra resource에 관한 metric을 다루어,\ 각 노드의 kubelet이 노출하는 API에서 추출한다. 사실 상 cAdvisor의 대체이다.
- Receiver: Kubelet Stats Receiver
- Exporter: OTLP/HTTP Exporter
Metric | Host Metrics #
수집 대상은 node (cpu, disk, CPU load, filesystem, memory, network, paging, process..)의 metric으로,\ 사실 상 Prometheus Node Exporter를 대체한다.\ Kubelet Stats Receiver와 일부 항목이 겹치므로 동시 운용 시 중복 처리가 필요하다.
- Receiver: Host Metrics Receiver
- Exporter: OTLP/HTTP Exporter
1# otel-node-collector service accounts are created automatically
2---
3apiVersion: rbac.authorization.k8s.io/v1
4kind: ClusterRole
5metadata:
6 name: otel-node-collector
7rules:
8 - apiGroups: [""]
9 resources: ["nodes/stats", "nodes/proxy"]
10 verbs: ["get", "watch", "list"]
11---
12apiVersion: rbac.authorization.k8s.io/v1
13kind: ClusterRoleBinding
14metadata:
15 name: otel-node-collector
16roleRef:
17 apiGroup: rbac.authorization.k8s.io
18 kind: ClusterRole
19 name: otel-node-collector
20subjects:
21 - kind: ServiceAccount
22 name: otel-node-collector
23 namespace: cluster
24---
25apiVersion: opentelemetry.io/v1beta1
26kind: OpenTelemetryCollector
27metadata:
28 name: otel-node
29 namespace: cluster
30 labels:
31 app: otel-node-collector
32spec:
33 mode: daemonset
34 resources:
35 # requests:
36 # cpu: 10m
37 # memory: 10Mi
38 limits:
39 cpu: 500m
40 memory: 1000Mi
41 podAnnotations:
42 prometheus.io/scrape: "true"
43 prometheus.io/port: "8888"
44 env:
45 - name: NODE_NAME
46 valueFrom:
47 fieldRef:
48 fieldPath: spec.nodeName
49 # volumes:
50 # - name: hostfs
51 # hostPath:
52 # path: /
53 # volumeMounts:
54 # - name: hostfs
55 # mountPath: /hostfs
56 # readOnly: true
57 # mountPropagation: HostToContainer
58 config:
59 extensions:
60 health_check: # for k8s liveness and readiness probes
61 endpoint: 0.0.0.0:13133 # default
62
63 processors:
64 batch: # buffer up to 10000 spans, metric data points, log records for up to 5 seconds
65 send_batch_size: 10000
66 timeout: 5s
67 memory_limiter:
68 check_interval: 1s # recommended by official README
69 limit_percentage: 80 # in 1Gi memory environment, hard limit is 800Mi
70 spike_limit_percentage: 25 # in 1Gi memory environment, soft limit is 500Mi (800 - 250 = 550Mi)
71
72 service:
73 extensions:
74 - health_check
75
76 telemetry:
77 logs:
78 level: INFO
79 metrics:
80 address: 0.0.0.0:8888
81
82 pipelines:
83 metrics:
84 receivers:
85 - kubeletstats
86 # - hostmetrics
87 processors:
88 - memory_limiter
89 - batch
90 exporters:
91 - otlphttp/prometheus
92
93 receivers:
94 kubeletstats:
95 auth_type: serviceAccount
96 endpoint: https://${env:NODE_NAME}:10250
97 collection_interval: 10s
98 insecure_skip_verify: true
99 extra_metadata_labels:
100 - k8s.volume.type
101 k8s_api_config:
102 auth_type: serviceAccount
103 metric_groups:
104 - node
105 - pod
106 - container
107 - volume
108
109 # hostmetrics:
110 # collection_interval: 10s
111 # root_path: /hostfs
112 # scrapers:
113 # cpu: # CPU utilization metrics
114 # load: # CPU load metrics
115 # memory: # Memory utilization
116 # disk: # Disk I/O metrics
117 # filesystem: # File System utilization metrics
118 # network: # Network interface I/O metrics & TCP connection metrics
119 # paging: # Paging/Swap space utilization and I/O metrics
120 # processes: # Process count metrics
121 # process: # Per process CPU, Memory, and Disk I/O metrics
122 # # The following settings can be used to handle the error to work hostmetrics: 2024-05-12T01:06:30.683Z error scraperhelper/scrapercontroller.go:197 Error scraping metrics {"kind": "receiver", "name": "hostmetrics", "data_type": "metrics", "error": "error reading process executable for pid 1: readlink /hostfs/proc/1/exe: permission denied; error reading username for process \"systemd\" (pid 1): open /etc/passwd: no such file or directory;
123 # # refer: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/28661
124 # mute_process_name_error: true
125 # mute_process_exe_error: true
126 # mute_process_io_error: true
127 # mute_process_user_error: true
128 # mute_process_cgroup_error: true
129
130 exporters:
131 debug:
132 verbosity: basic # detailed, basic
133
134 otlphttp/prometheus:
135 metrics_endpoint: http://prometheus-server.cluster.svc.cluster.local:80/api/v1/otlp/v1/metrics
136 tls:
137 insecure: trueCluster Collector(Single Pod) #
- k8s events(log)
- k8s objects(metrics)
단일 replica 사용 권장인 receivers 대상으로,\ 이들 receiver는 2개 이상의 instance 사용 시 중복이 발생 가능하기 때문이라고 공식 문서에서 논한다.\ 두 receiver 모두 cluster 관점에서 추출하기 때문이라고. 이에 따라 deployment type에 1개의 replica로 설정한다.
Log | Kubernetes Objects #
주로 Kubernetes event 수집용으로 Kubernetes API server 출처의 objects(전체 목록은 kubectl api-resources 로 확인) 수집에도 사용한다.
- Receiver: Kubernetes Objects Receiver
- Exporter: Loki exporter
Metric | Kubernetes Cluster #
사실 상 Kube State Metrics의 대체로 Kubernetes API server에서 cluster level의 metric과 entity events를 추출한다.
- Receiver: Kubernetes Cluster Receiver
- Exporter: OTLP/HTTP Exporter
1apiVersion: v1
2kind: ServiceAccount
3metadata:
4 name: otel-collector-opentelemetry-collector
5---
6apiVersion: rbac.authorization.k8s.io/v1
7kind: ClusterRole
8metadata:
9 name: otel-collector-opentelemetry-collector
10rules:
11 - apiGroups:
12 - ''
13 resources:
14 - events
15 - namespaces
16 - namespaces/status
17 - nodes
18 - nodes/spec
19 - pods
20 - pods/status
21 - replicationcontrollers
22 - replicationcontrollers/status
23 - resourcequotas
24 - services
25 verbs:
26 - get
27 - list
28 - watch
29 - apiGroups:
30 - apps
31 resources:
32 - daemonsets
33 - deployments
34 - replicasets
35 - statefulsets
36 verbs:
37 - get
38 - list
39 - watch
40 - apiGroups:
41 - extensions
42 resources:
43 - daemonsets
44 - deployments
45 - replicasets
46 verbs:
47 - get
48 - list
49 - watch
50 - apiGroups:
51 - batch
52 resources:
53 - jobs
54 - cronjobs
55 verbs:
56 - get
57 - list
58 - watch
59 - apiGroups:
60 - autoscaling
61 resources:
62 - horizontalpodautoscalers
63 verbs:
64 - get
65 - list
66 - watch
67---
68apiVersion: rbac.authorization.k8s.io/v1
69kind: ClusterRoleBinding
70metadata:
71 name: otel-collector-opentelemetry-collector
72roleRef:
73 apiGroup: rbac.authorization.k8s.io
74 kind: ClusterRole
75 name: otel-collector-opentelemetry-collector
76subjects:
77 - kind: ServiceAccount
78 name: otel-collector-opentelemetry-collector
79 namespace: default
80---
81# otel-cluster-collector service accounts are created automatically
82apiVersion: opentelemetry.io/v1beta1
83kind: OpenTelemetryCollector
84metadata:
85 name: otel-cluster
86 namespace: cluster
87 labels:
88 app: otel-cluster-collector
89spec:
90 mode: deployment
91 replicas: 1
92 podAnnotations:
93 prometheus.io/scrape: "true"
94 prometheus.io/port: "8888"
95 config:
96 extensions:
97 health_check: # for k8s liveness and readiness probes
98 endpoint: 0.0.0.0:13133 # default
99
100 processors:
101 batch: # buffer up to 10000 spans, metric data points, log records for up to 5 seconds
102 send_batch_size: 10000
103 timeout: 5s
104 memory_limiter:
105 check_interval: 1s # recommended by official README
106 limit_percentage: 80 # in 1Gi memory environment, hard limit is 800Mi
107 spike_limit_percentage: 25 # in 1Gi memory environment, soft limit is 500Mi (800 - 250 = 550Mi)
108 attributes:
109 actions:
110 key: elasticsearch.index.prefix
111 value: otel-k8sobject
112 action: insert
113 service:
114 extensions:
115 - health_check
116
117 telemetry:
118 logs:
119 level: DEBUG
120 metrics:
121 address: 0.0.0.0:8888
122
123 pipelines:
124 logs:
125 receivers:
126 - k8sobjects
127 processors:
128 - memory_limiter
129 - batch
130 - attributes
131 exporters:
132 - debug
133 - elasticsearch
134
135 metrics:
136 receivers:
137 - k8s_cluster
138 processors:
139 - memory_limiter
140 - batch
141 exporters:
142 - otlphttp/prometheus
143
144 receivers:
145 k8sobjects:
146 objects:
147 - name: pods
148 mode: pull
149 - name: events
150 mode: watch
151 k8s_cluster:
152 collection_interval: 10s
153 node_conditions_to_report:
154 - Ready
155 - MemoryPressure
156 allocatable_types_to_report:
157 - cpu
158 - memory
159 - ephemeral-storage
160 - storage
161
162 exporters:
163 debug:
164 verbosity: detailed # default is basic
165
166 otlphttp/prometheus:
167 metrics_endpoint: http://prometheus-server.cluster.svc.cluster.local:80/api/v1/otlp/v1/metrics
168 tls:
169 insecure: true
170
171 elasticsearch:
172 endpoints:
173 - http://elasticsearch-es-http.cluster.svc.cluster.local:9200
174 logs_index: ""
175 logs_dynamic_index:
176 enabled: true
177 logstash_format:
178 enabled: true
179 user: anyflow
180 password: mycluster 1apiVersion: opentelemetry.io/v1beta1
2kind: OpenTelemetryCollector
3metadata:
4 name: otel-cluster-k8s-events
5 namespace: cluster
6 labels:
7 app: otel-cluster-collector
8spec:
9 mode: deployment
10 replicas: 1
11 config:
12 receivers:
13 k8s_events:
14 auth_type: serviceAccount
15
16 processors:
17 batch:
18
19 exporters:
20 loki:
21 endpoint: https://LOKI_USERNAME:ACCESS_POLICY_TOKEN@LOKI_URL/loki/api/v1/push or http://<Loki-svc>.<Loki-Namespace>.svc/loki/api/v1/push
22 service:
23 pipelines:
24 logs:
25 receivers: [k8s_events]
26 processors: [batch]
27 exporters: [loki]prometheus Collector(statefulset) #
- prometheus metrics
OTLP Collector(Deployment) #
- Traces(OTEL)
- Generic OTEL Logs
- Generic OTEL metrics
공용 receiver, exporter 공통적으로 otlp 프로토콜을 사용하고 replica 개수 제약이 없는 signal 대상 collector로서,\ 제약이 없을 경우 가장 운용에 유리한 배포 패턴인 Deployment 를 사용한다. MLT 모두를 대상으로 한다.
Trace | Generic OTEL trace #
Jaeger 및 Grafana Tempo는 OTLP Receiver를 자체적으로 지원한다.
- Receiver: OTLP Receiver
- Exporter: OTLP Exporter (gRPC)
Metric | Generic OTEL metric #
앞서 논한 metric 이외의 app level metrics 등의 여타 metric 수집을 위한 endpoint이다.
- Receiver: OTLP Receiver
- Exporter: OTLP/HTTP Exporter
Log | Generic OTEL log #
Istio의 OTel access log를 포함한 여타 log 수집을 위한 endpoint이다.
- Receiver: OTLP Receiver
- Exporter: Loki exporter
1# otel-otlp-collector service accounts are created automatically
2apiVersion: opentelemetry.io/v1beta1
3kind: OpenTelemetryCollector
4metadata:
5 name: otel-otlp
6 namespace: cluster
7 labels:
8 app: otel-otlp-collector
9spec:
10 mode: deployment
11 # replicas: 1
12 autoscaler:
13 minReplicas: 1
14 maxReplicas: 2
15 resources:
16 # requests:
17 # cpu: 10m
18 # memory: 10Mi
19 limits:
20 cpu: 500m
21 memory: 1000Mi
22 podAnnotations:
23 prometheus.io/scrape: "true"
24 prometheus.io/port: "8888"
25 config:
26 extensions:
27 health_check: # for k8s liveness and readiness probes
28 endpoint: 0.0.0.0:13133 # default
29
30 processors:
31 batch: # buffer up to 10000 spans, metric data points, log records for up to 5 seconds
32 send_batch_size: 10000
33 timeout: 5s
34 memory_limiter:
35 check_interval: 1s # recommended by official README
36 limit_percentage: 80 # in 1Gi memory environment, hard limit is 800Mi
37 spike_limit_percentage: 25 # in 1Gi memory environment, soft limit is 500Mi (800 - 250 = 550Mi)
38
39 service:
40 extensions:
41 - health_check
42
43 telemetry:
44 logs:
45 level: INFO
46 metrics:
47 address: 0.0.0.0:8888
48
49 pipelines:
50 traces:
51 receivers:
52 - otlp
53 processors:
54 - memory_limiter
55 - batch
56 exporters:
57 - debug
58 - otlp/jaeger
59
60 logs:
61 receivers:
62 - otlp
63 processors:
64 - memory_limiter
65 - batch
66 exporters:
67 - debug
68 - elasticsearch
69
70 metrics:
71 receivers:
72 - otlp
73 processors:
74 - memory_limiter
75 - batch
76 exporters:
77 - debug
78 - otlphttp/prometheus
79
80 receivers:
81 otlp:
82 protocols:
83 grpc:
84 endpoint: 0.0.0.0:4317
85 http:
86 endpoint: 0.0.0.0:4318
87
88 exporters:
89 debug:
90 verbosity: basic # detailed, basic
91
92 otlp/jaeger:
93 endpoint: jaeger-collector.istio-system.svc.cluster.local:4317
94 tls:
95 insecure: true
96
97 otlphttp/prometheus:
98 metrics_endpoint: http://prometheus-server.cluster.svc.cluster.local:80/api/v1/otlp/v1/metrics
99 tls:
100 insecure: true
101
102 elasticsearch:
103 endpoints:
104 - http://elasticsearch-es-http.cluster.svc.cluster.local:9200
105 logs_index: "istio-access-log"
106 logs_dynamic_index:
107 enabled: true
108 logstash_format:
109 enabled: true
110 user: anyflow
111 password: mycluster