Skip to content

Commit d1e3d78

Browse files
authored
MRG: Merge pull request #541 from octue/fix/improve-crash-diagnostics
Fix and improve crash diagnostics
2 parents b026de9 + 2ee1dff commit d1e3d78

39 files changed

+1917
-1031
lines changed

docs/source/inter_service_compatibility.rst

Lines changed: 119 additions & 117 deletions
Large diffs are not rendered by default.

docs/source/testing_services.rst

Lines changed: 8 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ To emulate your children in tests, patch the :mod:`Child <octue.resources.child.
172172
app_src=app_directory_path,
173173
twine=os.path.join(app_directory_path, "twine.json"),
174174
children=children,
175-
service_id="you/your-service:latest",
175+
service_id="your-org/your-service:latest",
176176
)
177177
178178
emulated_children = [
@@ -212,7 +212,8 @@ change correspondingly (or at all). So, it's up to you to define a set of realis
212212
the child's twine - this is because the twine is only available to the real child. This is ok - you're testing your
213213
service, not the child.
214214

215-
You can create test fixtures manually or by recording messages from a real child to a JSON file. To record messages:
215+
You can create test fixtures manually or by using the ``Child.received_messages`` property after questioning a real
216+
child.
216217

217218
.. code-block:: python
218219
@@ -225,20 +226,13 @@ You can create test fixtures manually or by recording messages from a real child
225226
backend={"name": "GCPPubSubBackend", "project_name": "my-project"},
226227
)
227228
228-
result = child.ask(
229-
input_values=[1, 2, 3, 4],
230-
record_messages_to="child_messages.json",
231-
)
232-
233-
with open("child_messages.json") as f:
234-
child_messages = json.load(f)
229+
result = child.ask(input_values=[1, 2, 3, 4])
235230
236-
child_messages
231+
child.received_messages
237232
>>> [
238233
{
239234
'type': 'delivery_acknowledgement',
240235
'delivery_time': '2022-08-16 11:49:57.244263',
241-
'message_number': 0
242236
},
243237
{
244238
'type': 'log_record',
@@ -248,14 +242,11 @@ You can create test fixtures manually or by recording messages from a real child
248242
'levelname': 'INFO',
249243
...
250244
},
251-
'analysis_id': '0ce8386d-564d-47fa-9d11-3b728f557bfe',
252-
'message_number': 1
253245
},
254246
{
255247
'type': 'result',
256248
'output_values': {"some": "results"},
257249
'output_manifest': None,
258-
'message_number': 2
259250
}
260251
]
261252
@@ -266,7 +257,9 @@ You can then feed these into a child emulator to emulate one possible response o
266257
from octue.cloud.emulators import ChildEmulator
267258
268259
269-
child_emulator = ChildEmulator(messages=child_messages)
260+
child_emulator = ChildEmulator(messages=child.received_messages)
270261
271262
child_emulator.ask(input_values=[1, 2, 3, 4])
272263
>>> {"some": "results"}
264+
265+
You can also create test fixtures from :ref:`downloaded service crash diagnostics <test_fixtures_from_crash_diagnostics>`.

docs/source/troubleshooting_services.rst

Lines changed: 88 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,40 +2,27 @@
22
Troubleshooting services
33
========================
44

5-
Allowing crash diagnostics
6-
==========================
7-
A parent can give a child permission to save the following data to the cloud in the event the child fails while
8-
processing a question:
5+
Crash diagnostics
6+
=================
7+
Services save the following data to the cloud if they crash while processing a question:
98

109
- Input values
1110
- Input manifest and datasets
1211
- Child configuration values
1312
- Child configuration manifest and datasets
14-
- Messages sent from the child to the parent
13+
- Inputs to and messages received in answer to each question the service asked its children (if it has any). These are
14+
stored in the order the questions were asked.
1515

16-
The parent can give permission on a question-by-question basis by setting ``allow_save_diagnostics_data_on_crash=True``
17-
in :mod:`Child.ask <octue.resources.child.Child.ask>`. For example:
16+
.. important::
1817

19-
.. code-block:: python
20-
21-
child = Child(
22-
id="my-organisation/my-service:latest",
23-
backend={"name": "GCPPubSubBackend", "project_name": "my-project"},
24-
)
25-
26-
answer = child.ask(
27-
input_values={"height": 32, "width": 3},
28-
allow_save_diagnostics_data_on_crash=True,
29-
)
18+
For this feature to be enabled, the child must have the ``crash_diagnostics_cloud_path`` field in its service
19+
configuration (:ref:`octue.yaml <octue_yaml>` file) set to a Google Cloud Storage path.
3020

31-
For crash diagnostics to be saved, the child must have the ``crash_diagnostics_cloud_path`` field in its service
32-
configuration (:ref:`octue.yaml <octue_yaml>` file) set to a Google Cloud Storage path.
3321

3422
Accessing crash diagnostics
3523
===========================
36-
In the event of a child crash, the child will upload the crash diagnostics and send the cloud path to them to the
37-
parent as a log message. A user with credentials to access this path can use the ``octue`` CLI to retrieve the crash
38-
diagnostics data:
24+
In the event of a crash, the service will upload the crash diagnostics and send the upload path to the parent as a log
25+
message. A user with credentials to access this path can use the ``octue`` CLI to retrieve the crash diagnostics data:
3926

4027
.. code-block:: shell
4128
@@ -59,4 +46,82 @@ More information on the command:
5946
--local-path DIRECTORY The path to a directory to store the directory of
6047
diagnostics data in. Defaults to the current working
6148
directory.
49+
--download-datasets If provided, download any datasets from the crash
50+
diagnostics and update their paths in their
51+
manifests to the new local paths.
6252
-h, --help Show this message and exit.
53+
54+
.. _test_fixtures_from_crash_diagnostics:
55+
56+
Creating test fixtures from crash diagnostics
57+
=============================================
58+
You can create test fixtures directly from crash diagnostics, allowing you to recreate the exact conditions that caused
59+
your service to fail.
60+
61+
.. code-block:: python
62+
63+
from unittest.mock import patch
64+
65+
from octue import Runner
66+
from octue.utils.testing import load_test_fixture_from_crash_diagnostics
67+
68+
69+
(
70+
configuration_values,
71+
configuration_manifest,
72+
input_values,
73+
input_manifest,
74+
child_emulators,
75+
) = load_test_fixture_from_crash_diagnostics(path="path/to/downloaded/crash/diagnostics")
76+
77+
# You can explicitly specify your children here as shown or
78+
# read the same information in from your app configuration file.
79+
children = [
80+
{
81+
"key": "my_child",
82+
"id": "octue/my-child-service:latest",
83+
"backend": {
84+
"name": "GCPPubSubBackend",
85+
"project_name": "my-project",
86+
}
87+
},
88+
{
89+
"key": "another_child",
90+
"id": "octue/another-child-service:latest",
91+
"backend": {
92+
"name": "GCPPubSubBackend",
93+
"project_name": "my-project",
94+
}
95+
}
96+
]
97+
98+
runner = Runner(
99+
app_src="path/to/directory_containing_app",
100+
twine=os.path.join(app_directory_path, "twine.json"),
101+
children=children,
102+
configuration_values=configuration_values,
103+
configuration_manifest=configuration_manifest,
104+
service_id="your-org/your-service:latest",
105+
)
106+
107+
with patch("octue.runner.Child", side_effect=child_emulators):
108+
analysis = runner.run(input_values=input_values, input_manifest=input_manifest)
109+
110+
111+
Disabling crash diagnostics
112+
===========================
113+
When asking a question to a child, parents can disable crash diagnostics upload in the child on a question-by-question
114+
basis by setting ``allow_save_diagnostics_data_on_crash`` to ``False`` in :mod:`Child.ask <octue.resources.child.Child.ask>`.
115+
For example:
116+
117+
.. code-block:: python
118+
119+
child = Child(
120+
id="my-organisation/my-service:latest",
121+
backend={"name": "GCPPubSubBackend", "project_name": "my-project"},
122+
)
123+
124+
answer = child.ask(
125+
input_values={"height": 32, "width": 3},
126+
allow_save_diagnostics_data_on_crash=False,
127+
)

octue/app_loading.py

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
import importlib
2+
import logging
3+
import os
4+
import sys
5+
6+
7+
logger = logging.getLogger(__name__)
8+
9+
10+
class AppFrom:
11+
"""A context manager that imports the module "app" from a file named "app.py" in the given directory on entry (by
12+
making a temporary addition to the system path) and unloads it (by deleting it from `sys.modules`) on exit. It will
13+
issue a warning if an existing module called "app" is already loaded. Usage example:
14+
15+
```python3
16+
with AppFrom('/path/to/dir') as app:
17+
Runner().run(app)
18+
```
19+
20+
:param str app_path: path to directory containing module named "app.py".
21+
:return None:
22+
"""
23+
24+
def __init__(self, app_path="."):
25+
self.app_path = os.path.abspath(os.path.normpath(app_path))
26+
logger.debug("Initialising AppFrom context at app_path %s", self.app_path)
27+
self.app_module = None
28+
29+
def __enter__(self):
30+
# Warn on an app present on the system path
31+
if "app" in sys.modules.keys():
32+
logger.warning(
33+
"Module 'app' already on system path. Using 'AppFrom' context will yield unexpected results. Avoid "
34+
"using 'app' as a python module, except for your main entrypoint."
35+
)
36+
37+
# Insert the present directory first on the system path.
38+
sys.path.insert(0, self.app_path)
39+
40+
# Import the app from the present directory.
41+
self.app_module = importlib.import_module("app")
42+
43+
# Immediately clean up the entry to the system path (don't use "remove" because if the user has it in their
44+
# path, this'll be an unexpected side effect, and don't do it in cleanup in case the called code inserts a path)
45+
sys.path.pop(0)
46+
logger.debug("Imported app at app_path and cleaned up temporary modification to sys.path %s", self.app_path)
47+
return self
48+
49+
def __exit__(self, exc_type, exc_value, traceback):
50+
"""Unload the imported module.
51+
52+
:return None:
53+
"""
54+
try:
55+
del sys.modules["app"]
56+
logger.debug("Deleted app from sys.modules")
57+
58+
except KeyError:
59+
context_manager_name = type(self).__name__
60+
logger.warning(
61+
f"Module 'app' was already removed from the system path prior to exiting the {context_manager_name} "
62+
f"context manager. Using the {context_manager_name} context may yield unexpected results."
63+
)

octue/cli.py

Lines changed: 40 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
from octue.definitions import MANIFEST_FILENAME, VALUES_FILENAME
2121
from octue.exceptions import ServiceAlreadyExists
2222
from octue.log_handlers import apply_log_handler, get_remote_handler
23-
from octue.resources import service_backends
23+
from octue.resources import Manifest, service_backends
2424
from octue.runner import Runner
2525
from octue.utils.encoders import OctueJSONEncoder
2626
from twined import Twine
@@ -289,25 +289,61 @@ def start(service_config, revision_tag, timeout, no_rm):
289289
@click.option(
290290
"--local-path",
291291
type=click.Path(file_okay=False),
292-
default=None,
292+
default=".",
293293
help="The path to a directory to store the directory of diagnostics data in. Defaults to the current working "
294294
"directory.",
295295
)
296-
def get_crash_diagnostics(cloud_path, local_path):
296+
@click.option(
297+
"--download-datasets",
298+
is_flag=True,
299+
help="If provided, download any datasets from the crash diagnostics and update their paths in the configuration and "
300+
"input manifests to the new local paths.",
301+
)
302+
def get_crash_diagnostics(cloud_path, local_path, download_datasets):
297303
"""Download crash diagnostics for an analysis from the given directory in Google Cloud Storage. The cloud path
298304
should end in the analysis ID.
299305
300306
CLOUD_PATH: The path to the directory in Google Cloud Storage containing the diagnostics data.
301307
"""
302308
analysis_id = storage.path.split(cloud_path)[-1]
303-
local_path = os.path.join((local_path or "."), analysis_id)
309+
local_path = os.path.join(local_path, analysis_id)
310+
311+
if download_datasets:
312+
filter = None
313+
else:
314+
filter = lambda blob: any(
315+
(
316+
blob.name.endswith(f"configuration_{VALUES_FILENAME}"),
317+
blob.name.endswith(f"configuration_{MANIFEST_FILENAME}"),
318+
blob.name.endswith(f"input_{VALUES_FILENAME}"),
319+
blob.name.endswith(f"input_{MANIFEST_FILENAME}"),
320+
blob.name.endswith("questions.json"),
321+
)
322+
)
304323

305324
GoogleCloudStorageClient().download_all_files(
306325
local_path=local_path,
307326
cloud_path=cloud_path,
327+
filter=filter,
308328
recursive=True,
309329
)
310330

331+
# Update the manifests with the local paths of the datasets.
332+
if download_datasets:
333+
for manifest_type in ("configuration_manifest", "input_manifest"):
334+
manifest_path = os.path.join(local_path, manifest_type + ".json")
335+
336+
if not os.path.exists(manifest_path):
337+
continue
338+
339+
manifest = Manifest.from_file(manifest_path)
340+
341+
manifest.update_dataset_paths(
342+
path_generator=lambda dataset: os.path.join(local_path, f"{manifest_type}_datasets", dataset.name)
343+
)
344+
345+
manifest.to_file(manifest_path)
346+
311347
logger.info("Downloaded crash diagnostics from %r to %r.", cloud_path, local_path)
312348

313349

0 commit comments

Comments
 (0)