Skip to content

Commit 28c7111

Browse files
committed
Add initial docs on k8s healthchecks
Signed-off-by: Nigel Jones <[email protected]>
1 parent 95fae6c commit 28c7111

File tree

1 file changed

+215
-0
lines changed

1 file changed

+215
-0
lines changed
Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
<!-- SPDX-License-Identifier: CC-BY-4.0 -->
2+
<!-- Copyright Contributors to the ODPi Egeria project 2020. -->
3+
4+
There are various API calls that will check the status of Egeria.
5+
6+
These may be typically used in a Kubernetes environment to check if Egeria is ready to service requests. Here we summarize what is available
7+
8+
## Example API calls
9+
10+
In these examples the [httpie](https://httpie.io) tool will be used as it will print both the response code, and pretty-formatted body by default. Other tools like curl may also be used, but more parsing may be required of the responses.
11+
12+
The examples here were run against the lab charts, using user 'garygeeke'. A simple security plugin is active which restricts user access to api calls.
13+
14+
### Platform
15+
This checks if the *platform* is available.
16+
17+
#### Platform is not running
18+
➜ ~ curl -k -X GET --connect-timeout 5 --max-time 5 "https://44623abc-eu-gb.lb.appdomain.cloud:9443/open-metadata/admin-services/users/admin/server-origin"
19+
Egeria OMAG Server Platform (version 4.1-SNAPSHOT)
20+
21+
#### Platform is running
22+
➜ ~ http --verify=no --pretty=format GET "https://44623abc-eu-gb.lb.appdomain.cloud:9443/open-metadata/admin-services/users/admin/server-origin"
23+
HTTP/1.1 200
24+
Connection: keep-alive
25+
Content-Length: 42
26+
Content-Type: text/plain;charset=UTF-8
27+
Date: Thu, 18 May 2023 17:10:27 GMT
28+
Keep-Alive: timeout=60
29+
30+
Egeria OMAG Server Platform (version 4.0)
31+
32+
33+
34+
### Server
35+
36+
#### Server is not known
37+
```
38+
➜ ~ http --verify=no --pretty=format GET "https://44623abc-eu-gb.lb.appdomain.cloud:9443/open-metadata/admin-services/users/admin/servers/cocoMDS99/instance/status"
39+
HTTP/1.1 200
40+
Connection: keep-alive
41+
Content-Type: application/json
42+
Date: Thu, 18 May 2023 17:08:15 GMT
43+
Keep-Alive: timeout=60
44+
Transfer-Encoding: chunked
45+
46+
{
47+
"actionDescription": "getActiveServerStatus",
48+
"class": "OMAGServerStatusResponse",
49+
"exceptionClassName": "org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException",
50+
"exceptionErrorMessage": "OMAG-MULTI-TENANT-404-001 The OMAG Server cocoMDS99 is not available to service a request from user admin",
51+
"exceptionErrorMessageId": "OMAG-MULTI-TENANT-404-001",
52+
"exceptionErrorMessageParameters": [
53+
"cocoMDS99",
54+
"admin"
55+
],
56+
"exceptionProperties": {
57+
"parameterName": "serverName",
58+
"serverName": "cocoMDS99"
59+
},
60+
"exceptionSystemAction": "The system is unable to process the request because the server is not running on the called platform.",
61+
"exceptionUserAction": "Verify that the correct server is being called on the correct platform and that this server is running. Retry the request when the server is available.",
62+
"relatedHTTPCode": 404
63+
}
64+
65+
```
66+
#### No permission for api call
67+
```
68+
➜ ~ http --verify=no --pretty=format GET "https://44623abc-eu-gb.lb.appdomain.cloud:9443/open-metadata/admin-services/users/admin/servers/cocoMDS5/instance/status"
69+
HTTP/1.1 200
70+
Connection: keep-alive
71+
Content-Type: application/json
72+
Date: Thu, 18 May 2023 17:07:28 GMT
73+
Keep-Alive: timeout=60
74+
Transfer-Encoding: chunked
75+
76+
{
77+
"actionDescription": "validateUserForServer",
78+
"class": "OMAGServerStatusResponse",
79+
"exceptionClassName": "org.odpi.openmetadata.commonservices.ffdc.exceptions.UserNotAuthorizedException",
80+
"exceptionErrorMessage": "OMAG-PLATFORM-SECURITY-403-002 User admin is not authorized to issue a request to server cocoMDS5",
81+
"exceptionSystemAction": "The system is unable to process a request from the user because they do not have access to the requested OMAG server. The request fails with a UserNotAuthorizedException exception.",
82+
"exceptionUserAction": "Determine whether the user should have access to the server. If they should have, take steps to add them to the authorized list of users. If this user should not have access, investigate where the request came from to determine if the system is under attack, or it was a mistake, or the user's tool is not configured to connect to the correct server.",
83+
"relatedHTTPCode": 403
84+
}
85+
```
86+
### Server is available
87+
```
88+
➜ ~ http --verify=no --pretty=format GET "https://44623abc-eu-gb.lb.appdomain.cloud:9443/open-metadata/admin-services/users/garygeeke/servers/cocoMDS2/instance/status"
89+
HTTP/1.1 200
90+
Connection: keep-alive
91+
Content-Type: application/json
92+
Date: Thu, 18 May 2023 17:06:46 GMT
93+
Keep-Alive: timeout=60
94+
Transfer-Encoding: chunked
95+
96+
{
97+
"class": "OMAGServerStatusResponse",
98+
"relatedHTTPCode": 200,
99+
"serverStatus": {
100+
"serverActiveStatus": "RUNNING",
101+
"serverName": "cocoMDS2",
102+
"serverType": "Metadata Access Store",
103+
"services": [
104+
{
105+
"serviceName": "Subject Area OMAS",
106+
"serviceStatus": "RUNNING"
107+
},
108+
{
109+
"serviceName": "Security Officer OMAS",
110+
"serviceStatus": "RUNNING"
111+
},
112+
{
113+
"serviceName": "Open Metadata Repository Services (OMRS)",
114+
"serviceStatus": "RUNNING"
115+
},
116+
{
117+
"serviceName": "Data Privacy OMAS",
118+
"serviceStatus": "RUNNING"
119+
},
120+
{
121+
"serviceName": "Community Profile OMAS",
122+
"serviceStatus": "RUNNING"
123+
},
124+
{
125+
"serviceName": "Asset Consumer OMAS",
126+
"serviceStatus": "RUNNING"
127+
},
128+
{
129+
"serviceName": "Asset Lineage OMAS",
130+
"serviceStatus": "RUNNING"
131+
},
132+
{
133+
"serviceName": "Open Metadata Store Services",
134+
"serviceStatus": "STARTING"
135+
},
136+
{
137+
"serviceName": "Asset Catalog OMAS",
138+
"serviceStatus": "RUNNING"
139+
},
140+
{
141+
"serviceName": "IT Infrastructure OMAS",
142+
"serviceStatus": "RUNNING"
143+
},
144+
{
145+
"serviceName": "Asset Owner OMAS",
146+
"serviceStatus": "RUNNING"
147+
},
148+
{
149+
"serviceName": "Connected Asset Services",
150+
"serviceStatus": "STARTING"
151+
},
152+
{
153+
"serviceName": "Digital Architecture OMAS",
154+
"serviceStatus": "RUNNING"
155+
},
156+
{
157+
"serviceName": "Glossary View OMAS",
158+
"serviceStatus": "RUNNING"
159+
},
160+
{
161+
"serviceName": "Governance Program OMAS",
162+
"serviceStatus": "RUNNING"
163+
},
164+
{
165+
"serviceName": "Project Management OMAS",
166+
"serviceStatus": "RUNNING"
167+
},
168+
{
169+
"serviceName": "Governance Engine OMAS",
170+
"serviceStatus": "RUNNING"
171+
},
172+
{
173+
"serviceName": "Open Integration Service",
174+
"serviceStatus": "STARTING"
175+
}
176+
]
177+
}
178+
}
179+
```
180+
181+
182+
## Interpreting the API calls
183+
184+
A timeout will occur if the platform is not running.
185+
In all other cases a HTTP 200 will be returned.
186+
187+
Additionally the server status call returns fine-grained information about all the services configured in a server.
188+
189+
In the simplest case it would be reasonable to define that the server is not available until all services are running.
190+
191+
## what about connectors?
192+
193+
Each service running on the platform may have a dependence on connectors, such as for topics/kafka, or in the case of integration, technology connectors such as to a database. Each connector may behave differently, and in many cases not report any issue for a transient error.
194+
195+
## Defining a Kubernetes health check
196+
197+
See also [Kubernetes docs](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
198+
199+
Kubernetes has 3 types of health checks
200+
- startup -- to confirm a pod has started
201+
- readiness -- to confirm a pod is ready. Typically this will then allow requests to be routed here
202+
- liveness -- to check the pod is still responding to requests in a timely fashion.
203+
204+
Typically pods will be restarted if these health checks do not pass in a specified time period.
205+
206+
Each of these checks can be of several types
207+
- tcpSocket -- this just checks for an open port.
208+
- grpc -- issues grpc call (we do no use grpc in egeria)
209+
- httpGet -- a simple GET. If return is >=200 and <400 it is successful
210+
- exec -- issues a specified command within the container
211+
212+
Looking at the checks above, since all return 200 - if anything - they will always succeed.
213+
Therefore a simple httpGet check cannot be used.
214+
215+
Instead an 'exec'

0 commit comments

Comments
 (0)