-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 253: Add file skipping with delta #254
Issue 253: Add file skipping with delta #254
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main-1.0.0 #254 +/- ##
==============================================
- Coverage 91.02% 90.65% -0.37%
==============================================
Files 95 98 +3
Lines 2528 2569 +41
Branches 323 339 +16
==============================================
+ Hits 2301 2329 +28
- Misses 227 240 +13 ☔ View full report in Codecov by Sentry. |
I will approve this PR, but before merging I am waiting to understand if benchmarking the solution is necessary. |
I gave a shot at this PR. I tested it locally with the NYC Yellow Cab trip data: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page I have fetched the parquet files from 2022 and 2023, and ingested them as a qbeast table. There are changes to the schema across the files, so I ended up executing the following to ingest the data:
I executed the following test query without the changes in this PR:
and I got bad numbers:
with the changes in this PR, the execution time becomes a lot better:
|
Great test @fpj !!! Thanks for giving it a try 💯
|
I'm not seeing any significant difference when sampling between the execution time without the changes here and with the changes. Here is what I'm seeing:
@osopardo1 Is this behavior expected? |
Yes @fpj , the behaviour is expected since the filters of sampling should be applied in the same way as before. Now, if this PR has shown improvement in WHERE clauses and no damage at SAMPLING, I think it's ready for merge 👍 |
Description
This is a rework of the query implementation. This PR uses internal Delta query engine always
except the queries with sampling clause. For the later the Qbeast engine is used.
Type of change
This is an internal change, neither API nor data format are affected.
Checklist:
How Has This Been Tested? (Optional)
DefaultFileFormatTest is added to test the new logic in the query engine