Skip to content

add lineage support for output columns of all select queries availabl… #26241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

behrooz-stripe
Copy link

@behrooz-stripe behrooz-stripe commented Jul 18, 2025

Description

ReOpening #23322 since it went stale.
Column lineage info is extremely helpful but is only available for Create View, Table etc. not Select queries. This PR expands the support for lineage for all select queries as well.

Additional context and related issues

Adds column lineage to QueryCompletedEvent so it'll be available in EventListener. This PR is a follow up to this one that went stale.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@behrooz-stripe
Copy link
Author

Hey @Praveen2112 could you please provide feedback thank you!

Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check on tests once again

@@ -86,6 +87,7 @@ public class QueryInfo
private final RetryPolicy retryPolicy;
private final boolean pruned;
private final NodeVersion version;
private final List<ColumnLineageInfo> selectColumnLineageInfo;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be Optional ? In case of CREATE VIEW/CTAS kind of operations would it be empty ?

Copy link
Author

@behrooz-stripe behrooz-stripe Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, if you look at the constructor you'll see requireNonNull(selectColumnLineageInfo, "selectColumnLineageInfo is null"); so it's guaranteed to be non-null and will be simply empty for other cases (create view etc.). I also think Optional collections are non-ergonomic and cause a lot unnecssary empty checks while the data-structure itself inherently handles it. It also looks like the rest of collections in this class follow the same pattern so I didn't make it Optional. If you have strong preference on it, happy to change it.

public class ColumnLineageInfo
{
private final String name;
private final int index;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason for maintaining the index ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes because when we get visibleFields we can NOT make any assumption about the order of fields:

 /**
     * Gets only the visible fields.
     * No assumptions should be made about the order of the fields returned from this method.
     * To obtain the index of a field, call indexOf.
     */
    public Collection<Field> getVisibleFields()
    {
        return visibleFields;
    }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is particularly helpful when we have unnamed output columns.

@@ -1594,6 +1617,9 @@ protected Scope visitQuery(Query node, Optional<Scope> scope)
.build();

analysis.setScope(node, queryScope);
if (isTopLevel) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason for computing them in the StatementAnalyzer - Can we compute the same in Analysis based on getRootScope().getRelationType().getVisibleFields() ? StatementAnalyzer should capture the lineage details for all columns, but populating it for a specific set of columns should be done in Analysis - cc @martint

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, architecturally what you suggest makes more sense StatementAnalyzer should invoke lineage analysis and Analysis class should do the compute logic. Let me take a stab at it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using getRootScope().getRelationType().getVisibleFields() as you suggested in this commit
but a bunch of tests started failing. It seems like it's because scope was missing. I simply moved the logic to analysis but do pass the scope that is guaranteed to exist from StatementAnalyzer to Analysis. Here is the stack trace of missing scope.
image

@JsonCreator
public ColumnLineageInfo(@JsonProperty String outputExpression, @JsonProperty int index, @JsonProperty Set<ColumnDetail> sourceColumns)
{
this.name = outputExpression;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rnn - Can we also have a coherent naming for the field and constructor argument ?

{
this.name = outputExpression;
this.index = index;
this.sourceColumns = sourceColumns;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rnn and ImmutableList#copyOf

@behrooz-stripe
Copy link
Author

@Praveen2112 thank you for reviewing this and your feedback. I believe I addressed your questions PTAL again.

@behrooz-stripe
Copy link
Author

@Praveen2112 could you please review again? thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants