Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about using xapres with large ApRES datasets #80

Open
jkingslake opened this issue Jan 15, 2025 · 8 comments
Open

Question about using xapres with large ApRES datasets #80

jkingslake opened this issue Jan 15, 2025 · 8 comments

Comments

@jkingslake
Copy link
Member

@oraschewski started the following conversation about using xapres with large ApRES datasets. I moved it here so it can be useful to others.

@oraschewski:

I aim to switch with my ApRES processing to Python. For this, I think it would be nice to use your XApRES package. As the datasets are quite large, I need to compress them into some simpler to use format, which I assume the zarr files can help with. However, I am a bit struggling to set this up and the example notebooks that I found seem to be based on some old code and focusing cloud computing. Therefore, I was wondering, do you maybe have some example code to locally compress dat files into a zarr file?

@jkingslake:

Great news you're interested in xapres!
Yes, the notebooks that directly deal with getting the data into a zarr format are probably using out of date implementation of the code The principles are still valid though.

I have been working on some documentation recently: ldeo-glaciology.github.io/xapres/. Maybe the best way forward is to get the code running with a subset of the data using the up-to-date guides in the documentation linked above and then I can help with the trickier thing of getting it into zarr if you need it.

How big is the data?
Do you plan on putting it in a google bucket?
Where are you doing the computing? A hub like cryocloud is the best way to deal with big data in the cloud I think.

I've got a few notes online about the issues around getting big ApRES data into zarr, but I haven't got it all in one place yet. I can send these over soon too.

The main challenges are getting the chunks sizes right and avoiding having to load all the data into memory at one time.

Let me know if this sounds ok to you.

P. S. Here are some notes on how we dealt with the case when we couldn't fit all the data in memory: https://book.cryointhecloud.com/tutorials/ARCOdata_writingZarrs.html
A key bit of context which isn't really in those notes is that appending to a zarr store didn't seem to work properly, so we were forced to developed the method described in those notes.

I think a page in the documentation about dealing with big data using the latest version of the code will be a useful addition. I'll work on that next

@oraschewski
Copy link

Thank you for the quick and detailed reply, this sounds great! The documentation is already really helpful and so far xapres worked quite convenient to load in the data. As you suggested, I will start working on a subset, while looking forward to your documentation on dealing with big data.

Our timeseries data for one site and year have a size of ~11 GB. I am probably going to do the computing on a virtual desktop infrastructure, which mimics a local computer, but runs on a server. Therefore, I assume additional cloud computing will not be needed, but I can't really tell, before testing it. Because of this, I thought it is probably best to first figure out how to work locally with zarr files.

Finally, I noticed that the latest release of the package was in 2023 and some code examples in the documentation are currently not working when you import xapres via pip. Probably you have a new release planned, anyway, and I solved this by manually installing the current version, but I just thought to let you know.

@jkingslake
Copy link
Member Author

Yes, I recently released a new version (0.3.0) that should have all the features described in the docs.

As your data is 'only' 11 GB you can probably load all the chirps into memory at once, but might struggle to compute the profiles in one go without using dask. I had a very similar situation with some ApRES data from Thwaites recently. My workflow is described here: https://github.com/ldeo-glaciology/TG_apres/blob/main/notebooks/dats_to_zarr_unattended.ipynb

Working locally might be a good way of doing it, yes.

@jkingslake
Copy link
Member Author

Correction: I have had some difficulty getting the release to work properly so I recommend continuing to use the GitHub version which is the latest version.

Looking forward to hearing how you get on! 😀

@jkingslake
Copy link
Member Author

The issue with the release has been solved (#95).

If you need it
pip install xapres
will get the latest version that works as described in the docs.

@oraschewski
Copy link

Sorry for the late reply, this and next week I am forced to mainly work on another project.

Thank you for sharing the notebook, by strictly following this code, I managed to store the data as a zarr file. My main problem was that I only had tried to store the data after further processing, but then the to_zarr function is not defined anymore. So operating with zarr works now for a subdataset. In February, I will try to expand it to the whole dataset.

Great, thank you that you fixed the release!

@jkingslake
Copy link
Member Author

That's great you got the data saved as a zarr.

Just for the future, if your processing outputs results as xarray datasets or dataarrays you should be able to save to zarr in just the same way.

@oraschewski
Copy link

Hi Jonny,

I finally could get back into the processing of my time series and got everything working regarding storing the data as zarr as subsets and then combining them into one zarr after executing the FFT processing of the subsets. I mostly followed your example scripts but adjusted it a bit to operate locally. I attache the code below if it's helpful to anyone.

And you are right, storing dataarrays to zarr works fine. I just had trouble to reload them, because then the built-in processing methods would not work for me anymore. Probably I just did something wrong with the reloading, but I noticed that I don't really need it anyway. With this, I think this issue can be closed.

Thank you again for your help!

# %%
# Initialize script
"""
This script uses the xapres package by Jonny Kingslake to preprocess ApRES time series data.

Created on 11/02/2025

@author: Falk Oraschewski ([email protected])
"""

# Load packages
import sys
import os
#import gcsfs
#import fsspec
#import json
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt

sys.path.insert(0, os.path.join(os.path.dirname(__file__), '../auxiliary_functions'))
import xapres as xa # type: ignore

##############################################################################################
########################################## Settings ##########################################
##############################################################################################

# Define parent data directory
dirData = '../../data'
# Define relative path to timeseries data
dirDataTS = 'data_raw/Site/ApRES_timeseries'
# Define relative path for processed data
dirDataTSProc = 'data_proc/Site'
dirDataTSProcAll  = 'all.zarr'
# Define processing settings
settings = {'pad_factor': 4, 'crop_chirp_start': 0, 'crop_chirp_end': 1, 'max_range': 150, 'stack':True}

# Define chunk sizes
chunkTime = 350
chunkChirp = 10000
chunkRange = 200

##############################################################################################
##################################### Initialize script ######################################
##############################################################################################

# Handle relative paths
dirDataTS = os.path.normpath(os.path.join(os.path.dirname(__file__), dirData, dirDataTS))
dirDataTSProc = os.path.normpath(os.path.join(os.path.dirname(__file__), dirData, dirDataTSProc))
dirDataTSProcAll = os.path.normpath(os.path.join(dirDataTSProc, dirDataTSProcAll))

# %%
# Process raw data.
if not os.path.exists(dirDataTSProcAll):
    DataAll = xa.load.from_dats()

    # List files
    nFiles = len(DataAll.list_files(directory=dirDataTS))

    # Process data in subsets of 50 files
    nSubset = 50
    for n in range(0, nFiles, nSubset):
        print(f"\nProcessing file subset number {n//nSubset} of folder {dirDataTS}")
        DataSubset = xa.load.generate_xarray(directory=dirDataTS,
                                            file_numbers_to_process = range(n,min(n+nSubset, nFiles),1),
                                            addProfileToDs_kwargs = settings)

        # Stack each burst and remove attenuator setting dimension
        DataSubset = DataSubset.mean(dim='chirp_num').isel(attenuator_setting_pair=0)

        # Save subset
        DataSubset.to_zarr(os.path.join(dirDataTSProc, f'subset/subset_{n//nSubset}.zarr'))
    
    # Combine all subsets
    DataAll = xr.open_mfdataset(os.path.join(dirDataTSProc, f'subset/subset_*'),
                            #chunks = {}, 
                            engine = 'zarr', 
                            consolidated = False,
                            parallel = True)
    
    # Rechunk variables to uniform chunk size.
    DataAll = DataAll.chunk(chunks=dict(time=chunkTime, chirp_time=chunkChirp, profile_range=chunkRange))
    # Update chunk encoding of data variables
    DataAll.burst_number.encoding['chunks'] = chunkTime
    DataAll.profile.encoding['chunks'] = (chunkTime, chunkRange)
    DataAll.chirp.encoding['chunks'] = (chunkTime, chunkChirp)

    # Save the dataset to the Zarr
    DataAll.to_zarr(dirDataTSProcAll, mode='w')#, safe_chunks=False)
    print(f"\nSaving dataset to Zarr folder: {dirDataTSProc}")
else:
    print(f"\nLoading existing Zarr folder: {dirDataTSProc}")
    # Load the dataset from the Zarr
    DataAll = xr.open_dataset(dirDataTSProcAll,
              chunks={},
              engine='zarr', 
              consolidated=False) 

# %%
# Check if it works: Processing and plotting of displacement timeseries.
DataAll.profile.displacement_timeseries(bin_size = 40, offset = 5)\
       .velocity\
       .plot(figsize = (20,5), y='bin_depth', x='time', yincrease = False, vmin=-10, vmax=10, cmap ='RdBu_r', ylim = (0, 150));

@jkingslake
Copy link
Member Author

That's great news!
So glad you got this working. Its very useful to have the script here for reference, so thanks for sharing that.

When you mention

the built-in processing methods would not work for me anymore

after loading the results into memory for a zarr, do you mean that when you called a method, e.g., .displacement_timeseries(), it just wasnt recognized as a method? This will happen if you dont load the xapres package into the session, because these methods are added to xarray dataarrays and datasets when the package is loaded.

In other words, there is nothing special about the xarray dataarrays and datasets that the xapres package produces, but the package does add these methods when the package is loaded. So if you save the data, then start a new session, and then reload the data from the zarr, the methods will not be there (the methods are not saved with the data in the zarrs unfortunately).

If something different is happening and the methods arent working for a different reason, I would like to work out why that's happening.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants