Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statistics does not work on a ParquetFile subset? #938

Closed
yohplala opened this issue Oct 28, 2024 · 2 comments · Fixed by #940
Closed

statistics does not work on a ParquetFile subset? #938

yohplala opened this issue Oct 28, 2024 · 2 comments · Fixed by #940

Comments

@yohplala
Copy link

Describe the issue:

Minimal Complete Verifiable Example:

import pandas as pd
import fastparquet as fp

fp.write(filename="small_df", data=pd.DataFrame({'a':[1,2,3]}), row_group_offsets=[0,2], file_scheme="hive")
pf = fp.ParquetFile("small_df")

# Works
pf.statistics
{'min': {'a': [1, 3]},
 'max': {'a': [2, 3]},
 'null_count': {'a': [0, 0]},
 'distinct_count': {'a': [None, None]}}

# Does not work
pf[-1].statistics
AttributeError: 'ParquetFile' object has no attribute '_statistics'

Selecting row groups should still result in a ParquetFile and statistics should work?

Environment:

  • Python version: 3.10
  • Operating System: Ubutnu
  • Install method (conda, pip, source): conda
@martindurant
Copy link
Member

You're probably in as good a place as anyone to fix this?

@yohplala
Copy link
Author

Hi @martindurant yes, I will have a look, probably by end of this week. Yesterday, the reason for this did not seem that straightforward to me, but I will investigate deeper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants