Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/mlflow): add dataset lineage #12837

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

yoonhyejin
Copy link
Collaborator

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 11, 2025
Copy link

codecov bot commented Mar 11, 2025

Codecov Report

Attention: Patch coverage is 28.26087% with 33 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...a-ingestion/src/datahub/ingestion/source/mlflow.py 28.26% 33 Missing ⚠️

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@yoonhyejin yoonhyejin requested a review from hsheth2 March 14, 2025 08:49
@yoonhyejin yoonhyejin marked this pull request as ready for review March 14, 2025 08:49
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Mar 14, 2025
source_type = dataset_input.dataset.source_type
dataset_tags = {k[1]: v[1] for k, v in dataset_input.tags}
dataset = dataset_input.dataset
platform = self._get_dataset_platform_from_source_type(source_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this method is being called twice

dataset_reference_urns.append(str(local_dataset_reference.urn))
# Otherwise, we create a hosted dataset reference and a hosted dataset
else:
hosted_dataset = Dataset(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo we should not be generating dataset entities for non mlflow platforms

here we should just do hosted_dataset_urn = DatasetUrn.... if that urn exists, lineage will show up by default. if it doesn't exist, they'll need to go into the UI and click "show hidden edges" to make them show up

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants