Exclude airflow runner internals from Operator failure tracebacks #20060

hterik · 2021-12-02T16:32:29Z

hterik
Dec 2, 2021

Description

Given a simple dag like this:

from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator

with DAG("instantfail",
         description="instantfail",
         tags=["example"],
         catchup=False,
         schedule_interval=None,
         start_date=days_ago(2),
        ) as dag:

    BashOperator(bash_command="echo hello && false", task_id="b")

The error log from the failing bash command, false, becomes incredibly big with lots of irrelevant information for the writer of the DAG, if that's not enough, the error is printed twice:

[2021-12-02, 17:07:01 UTC] {subprocess.py:85} INFO - Output:
[2021-12-02, 17:07:01 UTC] {subprocess.py:89} INFO - hello
[2021-12-02, 17:07:01 UTC] {subprocess.py:93} INFO - Command exited with return code 1
[2021-12-02, 17:07:01 UTC] {taskinstance.py:1703} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1332, in _run_raw_task
    self._execute_task_with_callbacks(context)
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1458, in _execute_task_with_callbacks
    result = self._execute_task(context, self.task)
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1514, in _execute_task
    result = execute_callable(context=context)
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/operators/bash.py", line 187, in execute
    raise AirflowException(
airflow.exceptions.AirflowException: Bash command failed. The command returned a non-zero exit code 1.
[2021-12-02, 17:07:01 UTC] {taskinstance.py:1270} INFO - Marking task as FAILED. dag_id=instantfail, task_id=b, execution_date=20211202T160659, start_date=20211202T160701, end_date=20211202T160701
[2021-12-02, 17:07:01 UTC] {standard_task_runner.py:88} ERROR - Failed to execute job 292 for task b
Traceback (most recent call last):
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/task/task_runner/standard_task_runner.py", line 85, in _start_by_fork
    args.func(args, dag=self.dag)
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/utils/cli.py", line 92, in wrapper
    return f(*args, **kwargs)
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 292, in task_run
    _run_task_by_selected_method(args, dag, ti)
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 107, in _run_task_by_selected_method
    _run_raw_task(args, ti)
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 180, in _run_raw_task
    ti._run_raw_task(
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/utils/session.py", line 70, in wrapper
    return func(*args, session=session, **kwargs)
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1332, in _run_raw_task
    self._execute_task_with_callbacks(context)
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1458, in _execute_task_with_callbacks
    result = self._execute_task(context, self.task)
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1514, in _execute_task
    result = execute_callable(context=context)
  File "/homedir/.local/share/virtualenvs/airflow-uWHa5xnK/lib/python3.9/site-packages/airflow/operators/bash.py", line 187, in execute
    raise AirflowException(
airflow.exceptions.AirflowException: Bash command failed. The command returned a non-zero exit code 1.
[2021-12-02, 17:07:01 UTC] {local_task_job.py:154} INFO - Task exited with return code 1
[2021-12-02, 17:07:01 UTC] {local_task_job.py:264} INFO - 0 downstream tasks scheduled from follow-on schedule check

Often operators can be be more advanced, especially PythonOperator and taskflow which frequently creates equally deep stack traces themselves, which is usually what you are interested in, but together it accumulates into several pages of error.

I'm usually the guy who thinks a stack trace is the best error message and like the context it provides, but here i think it goes a bit too far.

Why is the error printed twice?
The writer of a dag should be able to assume the execution path up to execute_callable() was sane. If i'm debugging a DAG or an Operator i don't want to debug Airflow task runner internals. I don't know how true this is but this should be the goal at least.

Expected log output should only contain

  File "airflow/operators/bash.py", line 187, in execute
    raise AirflowException(
airflow.exceptions.AirflowException: Bash command failed. The command returned a non-zero exit code 1.

Can TaskInstance._execute_task be made to catch the operators exception and rethrow as something else, that can be ignored at the end by standard_task_runner?

Or is this something that simply should be solved by logging configuration?

Use case/motivation

No response

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

potiuk · 2021-12-05T23:51:11Z

potiuk
Dec 5, 2021
Collaborator

I think it depends on whose view you take into account, I am for adding as much information as possible in the stack trace, because you never know when more stack trace is useful. If a user reports a "half hearted" stack trace I am usually not able to help as a maintainer.

Surely, If the information is duplicated - and you know how to de-duplicate it reliably - feel free to open PR for that. It does look like in this case it should be possible. But I am personally not loosing my sleep over it. I personally usually err on the side of having more information of errors than needed rather than risking I miss some information - especially in cases like that when you might expect "ANY" error and ANY environment the error might come from.

Hiding anything in this case is IMHO far too easy to loose crucial information that might help maintainers to debug problems. Surely in this case proably most of the stack trace is repetitive most of the time. But also those errors are serving a different purpose - expect unexpected. And thanks to the stack traces you wil sometimes be able to diagnose much more than you think. For example, I cannot count how many times I discovered that someone's process actuallly use a different version of the software than they thought - simply because I noticed that it's impossible to get the stacktrace with this particular line number in platform's stack-trace.

I'd rather personally focus (and I do for quite some time) on making the "known" error messages more actionable and explanatory to the users when we "know" what the issue is.

But I think getting rid of the full stack trace from logs (other than de-duplication) will make it far harder to get help by our users, which for me trumps the conciseness of the logs.

I am not sure if this is an "iissue"/feature. Maybe others woudl be interested in chiming in, so I am turning it into a discussion, but if you have any concrete PR proposals to improve the logging of particular cases (without loosing the "observability" of the issues when they appear - you are most welcome @hterik) .

0 replies

hterik · 2021-12-07T08:20:41Z

hterik
Dec 7, 2021
Author

The way is see it is you have 4 layers of users here:

Airflow developers
Operator and provider developers
DAG authors
This sortof blends into (2) given PythonOperator, BashOperator and @task, let's consider such part of layer 3 here.
End users. - Starting dags via UI or external triggers, looking at results via logs.
These might not even have Python experience at all, and will be very intimidated by a two page long stack trace.

A DAG can fail for reasons caused by any of these.

I want to focus on mainly on (4) here - If a user provides faulty input, the dag might fail. How can a dag-author provide a friendly error message to the user?

Imagine something like this, is this possible to do today without ever seeing _start_by_fork in the log (ignore that this particular example can be solved with jsonschema on input params)

@dag()
def my_dag():
    @task
    def my_task():        
        context = get_current_context()
        if context["params"]["some_param"] < 0:
            raise InvalidUserInput("some_param is not allowed to be negative")
        
        do_stuff(context["params"]["some_param"])

Note though that here i don't want to remove the stack trace entirely. For the dag author (3), who will be first line of support for the user (4), the stack trace within the dag-file still makes perfect sense, especially given errors further down the stack in do_stuff. But not the 10 layers above it.

I think this is also what you mean with improving 'known' error messages. The question is, how can authors, in (2) and (3) above, categorize errors as known. Reducing the log-noise of such would itself be a great improvement. Having some fancy UI for it even better.

1 reply

potiuk Dec 7, 2021
Collaborator

I want to focus on mainly on (4) here - If a user provides faulty input, the dag might fail. How can a dag-author provide a friendly error message to the user?

And yet comment here: we have a great new feature for precisely that case:

https://airflow.apache.org/docs/apache-airflow/stable/concepts/params.html

When (3) exposes parameters for the user (4) they use DAG params. And those params can be validated, by the json.schema. Which is one of the best places where you can provide feedback to the user that the parameters are good /wrong (before the dag has been even triggered - so feedback is instanteneous). This is a new 2.2 feature so you need to upgrade to 2.2 to make use of it.

If you cannot validate it with the json.schema (i.e. error is only apparent when the task started to run), then it's your job as the DAG writer to make sure that the error is nicely printed in the log (of course if you are able to foresee it). But (unless I am wrong - please correct me) - there is not much we can do in Airfllow to guide that. The task that will fail will be "failed", so the thing you do there - you go to he logs of the task and find out what's wrong. It's up to the (2) and (3) to make sure that the message will be "Actionable". There is not much that can be done on "Airflow" level.

Again, maybe I am mistaken, but seems to be that you are "complaining" to (1) "airflow developer" that in some cases (2) and (3) do not do a good job in communicating to (4) even if they have all the tools to do so. Which I am not sure (1) can do more than just give the tools (which we did).

potiuk · 2021-12-07T12:31:59Z

potiuk
Dec 7, 2021
Collaborator

I think this is also what you mean with improving 'known' error messages. The question is, how can authors, in (2) and (3) above, categorize errors as known. Reducing the log-noise of such would itself be a great improvement. Having some fancy UI for it even better

Very good assesment. And I tihnk all that is happening already and needs no "fixing".

I think simply both (2) and (3) should "catch the known errrors" and print the friendly message. This is already happening in a number of places where (2) and (3) thought about it deliberately and produced those messages.

Example from one of google cloud hooks:

        except NotFound:
            self.log.warning('Topic does not exist: %s', topic_path)

This is an absolute standard way of handing a "known" exception in Python,. Everyone does it when they know the unfriendly exception should be turned into a friendly message.

Do you think there is anything to do here?
Do you think the operators, provider developers and DAG authors need some guidance on this kind of practice (which I believe is an absolute standard for any developer and needs no explanation)?
Or do you think something additional needs to be done on "airflow framework" to help to make decision by the "writer" on when a problem is "known" (i.e. turned into a log message) or unknown (i.e. exception raised with stack trace).
How do you imagine this kind of "guidance" or "platform support"?

5 replies

potiuk Dec 7, 2021
Collaborator

And just to be perfectly clear - probably there are a number of places where we could turn "unknown" errors into friendler messages. And if you pin-point any of such places, feel absolutely free to make PRs to fix them. This is pretty much a continuous quest by both contributors and committers to detect and improve such places and you are most welcome to join the effort!

hterik Dec 7, 2021
Author

Even if you log a curated warning-message, in the UI it is presented in the same log-stream as as the exception, right before it, leading back to the original problem - too much errors on one page. Consider a non-developer reading this log, for them even finding the friendly message inside the 40 lines, of what in their eyes might as well have been a hexdump, is not going to be easy.
(btw, the google cloud hook failure also throws an exception right after it logs the warning, and it must, because otherwise downstream tasks will not get stopped if I'm not mistaken)

Now if the operator does log something curated, i could disable logs from standard_task_runner and task_instance via log config to avoid the traceback, but some other logs there are still relevant and this would lead to even bigger risk of relevant logs being lost. This also assumes that all operators in use have curated messages for every possible failure mode, because if they don't it would lead to errors that are really impossible to figure out. A stack trace from execute() to the point of error is never going to miss any of such.

One could also resort to the scream louder in the logs solution, by making all operator error logs go into a big banner, that should increase the chance of user seeing the curated message before they get lost in the stack trace. But this requires control over each operator, a centralized handling of this would be preferable.

[2021-12-02, 17:07:01 UTC] {my_operator.py:123} 
    -------------------------------------------
    -    ERROR:
    -    Topic does not exist: /x/y/z
    -------------------------------------------
[2021-12-02, 17:07:01 UTC] {taskinstance.py:1703} ERROR - Task failed with exception
...rest of 40 line stack trace comes here

So in conclusion i would say that yes, there is something missing in the framework that in a bulletproof and consistent way, without too much overhead on operator-authors, presents an error for the user in a clear manner without getting drowned by other log-entries that also looks like errors. I might be able to upload a proposal of how i envision things can be better, at least shouldn't log same exception twice :).

(part of this is also fault of how verbose Python stack traces are, compared to say Java where you have indentation separating the message from the stack, only one line per item and no absolute paths, maybe i should change my logformatters formatException instead 😸 )

potiuk Dec 7, 2021
Collaborator

(btw, the google cloud hook failure also throws an exception right after it logs the warning, and it must, because otherwise downstream tasks will not get stopped if I'm not mistaken)

Yep I saw it and I waited for that comment. This is precisesly what falls into your (1-4) explanation. The stack trace will be printed only if DAG write wants it - there is a flag controlling it. That's apparenly conscious choice that (2) made to give the freedom to (3) to request it so that (4) can analyse the error..

One could also resort to the scream louder in the logs solution, by making all operator error logs go into a big banner, that should increase the chance of user seeing the curated message before they get lost in the stack trace. But this requires control over each operator, a centralized handling of this would be preferable.

But we already have the way. Marking log as WARNING or ERROR is precisely this. Easy, standard, everyone does it. See: https://docs.python.org/3/howto/logging.html. It's not provided by Airflow. It's provided by Python logging framework.

And I really encourage you to read more about it's capabilities.

Structured not pre-formatted logging is the way to go. Marking messages as ERRORS makes the message important - not the way how it is formatted. This is how "airflow platform" uses logging from python (like pretty much any other python application) and gives you a lot of ways how you can customize logs content if you find the default not "good enough".

Usually enterprise users look at the logs using centralized logging system such as cloud watch etc. The Airflow UI is really a rudimentary view that provides basic functions. Usually people are looking at the logs in a structured form where they can search for SEVERITY (in this case WARNING/ERRORS level), and location, code, message and this is what logging system provides.

More. We have quite comprehensive documentation about it.

Here is the link describing logging: https://airflow.apache.org/docs/apache-airflow/2.2.2/logging-monitoring/logging-tasks.html and you will find link there on how to configure AWS, GCP, Elasticsearch, Azure - https://airflow.apache.org/docs/apache-airflow-providers/core-extensions/logging.html - all those require structured logs. But even if you do not want to use any of those "enterprise" systems and use the UI logs that Airlflow displays - it allows yoy to configre even the "local" logs available in UI in whatever way you need: https://airflow.apache.org/docs/apache-airflow/2.2.2/logging-monitoring/logging-tasks.html#advanced-configuration

Let me repeat it - this is absolutely standard way how you configure Python apps when it comes to logging. You coming from Java world might be not aware of it (though there are similar capabilities there). But we not only blindly tapped into it - we (1) also provided our users a documentation on how they configure it.

Now - it's literally up to the person who deploys Airlfow in an organisation (call it person (5) because you have not mentioned in your assesment) to decide how to format the messages. If the person thinks that it will be better to wrap ERROR messages in such "standing out" messages, it's super easy to do and platform developers (1) already thought about it and provided (above) documentation on that and delegated the task to (5): Airflow admins.

And this is the only way actually. The (5) are the only ones who actually know who are the (4) users of what (1), (2), (3) developed and what are their expectations, what logging system they usually use to find details about errors, and how they integrate with the organisation's standards. None of the (1), (2), (3) know what the (4) will be using and what expectations the (4) users will have, so it is up to the (5) to configure formatting that suits best (4) - and fits the organisation's expectations.

In conclusion, the (1) had done everything they could. They tapped into standard framework provided by Python, they gave (2) and (3) a standard interface they should use to provide structured logging, where severity of the error is marked with the logging level. And provided a rudimentary formatting as a base, and delegated to (5) the way to format it so that (4) could get the logs in the way and in the system they are used to.

This is precisely the "centralized handling" between all those.

We (1) simply cannot make any assumptions on formatting expected. That's why we chose to use a basic, typical Python formatter (but again - you can freely exchange it) as a starting point. And (5) can customize it as they see fit for the organisation.

I hope it is clearer now. And I hope you now know that you can (and should) configure your logging if you find that you need it.

malthe Jan 6, 2022
Collaborator

While it would be possible to implement a custom logging handler that would record log records in a more structured way, Python's logging library is rather text-oriented. The logging.Handler class expects a text formatter etc.

I do think the current logging output is a little unfriendly. It's HTML after all, so it feels a little dated to just render text.

And specifically for tracebacks, it would be nice to have some hide/show functionality where a user could expand an error line and see a traceback but it wouldn't be the first thing you'd see as a user trying to troubleshoot a task failure.

hterik Jan 10, 2022
Author

Sliding a bit OT here but re on text logs being rather limited, generally it would be nice if tasks could provide a more refined custom UI for their non-text-oriented outputs - not just for errors.
Today there is the task external link and the xcom-table and soon also option to add manual notes, but they are all hidden and a bit non-ergonomic with too many clicks to reach and doesn't provide a fully integrated experience. With such in place a task could opt-in to having a dedicated "error-output."
Even with a structural logging system (we do use Grafana Loki, which colors, formats and allows toggling tracebacks and filtering by levels and components), it's still rather awkward to tweak such log-filters and it's a rather steep requirement to expect everybody to run such a stack, both operationally and for users to understand it.

zachliu · 2025-02-26T22:29:53Z

zachliu
Feb 26, 2025

@hterik i feel your pain. what i did was set number_logs_exception = 1 in the EcsRunTaskOperator

...useless traceback for non-programming end users
airflow.exceptions.AirflowException: This task is not in success state - last 1 logs from Cloudwatch:
~~~~~ THERE ARE ERRORS. PLEASE SCROLL UP TO SEE THEM! ~~~~~~

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude airflow runner internals from Operator failure tracebacks #20060

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Exclude airflow runner internals from Operator failure tracebacks #20060

hterik Dec 2, 2021

Description

Use case/motivation

Related issues

Are you willing to submit a PR?

Code of Conduct

Replies: 4 comments · 6 replies

potiuk Dec 5, 2021 Collaborator

hterik Dec 7, 2021 Author

potiuk Dec 7, 2021 Collaborator

potiuk Dec 7, 2021 Collaborator

potiuk Dec 7, 2021 Collaborator

hterik Dec 7, 2021 Author

potiuk Dec 7, 2021 Collaborator

malthe Jan 6, 2022 Collaborator

hterik Jan 10, 2022 Author

zachliu Feb 26, 2025

hterik
Dec 2, 2021

Replies: 4 comments 6 replies

potiuk
Dec 5, 2021
Collaborator

hterik
Dec 7, 2021
Author

potiuk Dec 7, 2021
Collaborator

potiuk
Dec 7, 2021
Collaborator

potiuk Dec 7, 2021
Collaborator

hterik Dec 7, 2021
Author

potiuk Dec 7, 2021
Collaborator

malthe Jan 6, 2022
Collaborator

hterik Jan 10, 2022
Author

zachliu
Feb 26, 2025