-
Notifications
You must be signed in to change notification settings - Fork 3.3k
add lineage support for output columns of all select queries availabl… #26241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…e in QueryCompletedEvent
Hey @Praveen2112 could you please provide feedback thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to check on tests once again
@@ -86,6 +87,7 @@ public class QueryInfo | |||
private final RetryPolicy retryPolicy; | |||
private final boolean pruned; | |||
private final NodeVersion version; | |||
private final List<ColumnLineageInfo> selectColumnLineageInfo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be Optional ? In case of CREATE VIEW/CTAS kind of operations would it be empty ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, if you look at the constructor you'll see requireNonNull(selectColumnLineageInfo, "selectColumnLineageInfo is null");
so it's guaranteed to be non-null and will be simply empty for other cases (create view etc.). I also think Optional collections are non-ergonomic and cause a lot unnecssary empty checks while the data-structure itself inherently handles it. It also looks like the rest of collections in this class follow the same pattern so I didn't make it Optional. If you have strong preference on it, happy to change it.
public class ColumnLineageInfo | ||
{ | ||
private final String name; | ||
private final int index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason for maintaining the index ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes because when we get visibleFields we can NOT make any assumption about the order of fields:
/**
* Gets only the visible fields.
* No assumptions should be made about the order of the fields returned from this method.
* To obtain the index of a field, call indexOf.
*/
public Collection<Field> getVisibleFields()
{
return visibleFields;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is particularly helpful when we have unnamed output columns.
@@ -1594,6 +1617,9 @@ protected Scope visitQuery(Query node, Optional<Scope> scope) | |||
.build(); | |||
|
|||
analysis.setScope(node, queryScope); | |||
if (isTopLevel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason for computing them in the StatementAnalyzer
- Can we compute the same in Analysis
based on getRootScope().getRelationType().getVisibleFields()
? StatementAnalyzer
should capture the lineage details for all columns, but populating it for a specific set of columns should be done in Analysis
- cc @martint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean, architecturally what you suggest makes more sense StatementAnalyzer
should invoke lineage analysis and Analysis
class should do the compute logic. Let me take a stab at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using getRootScope().getRelationType().getVisibleFields()
as you suggested in this commit
but a bunch of tests started failing. It seems like it's because scope was missing. I simply moved the logic to analysis but do pass the scope that is guaranteed to exist from StatementAnalyzer
to Analysis
. Here is the stack trace of missing scope.
@JsonCreator | ||
public ColumnLineageInfo(@JsonProperty String outputExpression, @JsonProperty int index, @JsonProperty Set<ColumnDetail> sourceColumns) | ||
{ | ||
this.name = outputExpression; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rnn - Can we also have a coherent naming for the field and constructor argument ?
{ | ||
this.name = outputExpression; | ||
this.index = index; | ||
this.sourceColumns = sourceColumns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rnn and ImmutableList#copyOf
@Praveen2112 thank you for reviewing this and your feedback. I believe I addressed your questions PTAL again. |
@Praveen2112 could you please review again? thank you. |
Description
ReOpening #23322 since it went stale.
Column lineage info is extremely helpful but is only available for Create View, Table etc. not Select queries. This PR expands the support for lineage for all select queries as well.
Additional context and related issues
Adds column lineage to
QueryCompletedEvent
so it'll be available inEventListener
. This PR is a follow up to this one that went stale.Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: