-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read from vcz #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow!!!
Amazing @Will-Tyler, you are a legend!
I don't grok the C++ or R details, but the overall shape looks just right. A few minor comments.
The big question is now, does this perform? We could try it on our simulated genotype data, comparing the VCF/BCF and VCZ performance. That would be a fantastic demo for the paper if we could do it.
src/VCF.cpp
Outdated
@@ -1,256 +1,257 @@ | |||
// [[Rcpp::depends(RcppArmadillo)]] | |||
#include <RcppArmadillo.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the changes in this file just formatting/C++ version stuff, or is there something substantive that needed to be changed in the VCF backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have committed this by accident. I reformatted a lot of the files while working on this to help me understand the code, but I did not mean to include them. I will fix this.
src/VCZ.cpp
Outdated
{"path", t_vczFileName + "/variant_position"}}}, | ||
}) | ||
.value(); | ||
m_variant_position_array = ChunkCacher<int8_t>(input_spec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean variant_positions must be int8_t values? This is too small if so. int32_t is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose the element types to match data types in the Zarr arrays as they are stored on my computer. I will see if using larger element types works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need int32 to work with any reasonable benchmark data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to note that while working on the performance testing, I found that the element type template parameter needs to match the Zarr array's data type. The element type template parameter is a type template parameter that TensorStore uses as the data type of the TensorStore store. See here for example. Essentially, the TensorStore code is parameterized around the element type, which can be specified at compile time by the user.
TensorStore has an error if I try to use int32_t
to read an <i1
Zarr array. The current implementation requires the user to update the code, specifically the arguments to the type template parameters, to match the data types of the respective Zarr arrays. I currently don't know how to access data through TensorStore without specifying the element type somewhere at compile time.
ce2eb21
to
1abc5f1
Compare
Thank you for the kind words! 😄 This was a challenging but fun project and a great learning experience! I think I will need some binary phenotype data to go along with the simulated genotype data to produce a null-hypothesis model. The SAIGE paper describes creating simulated phenotypes in the methods sections, but I wonder if you have a recommended approach for generating simulated binary phenotypes? References
|
We can generate some quantitative traits on the base simulated data using tstrait. This would be quick and easy, and I'd be happy to set up the infrastructure for doing that. A fun thing to do, to really show that we can cover all the bases of input data formats etc, would be to use the simulated HAPNEST data, which includes very realistic phenotypes. The genetic data is in plink format, but I've already converted this using the (unfinished) plink2zarr program that's in bio2zarr. However, I don't think we need particularly realistic traits in the first instance just to get some rough ideas of how this works. Just using a random value from a Gaussian distribution for each sample should be fine as a first pass, to give us basic order-of-magnitude numbers on how well VCZ performs in comparison to VCF and BCF. |
I got a chance to play around with the performance—this PR's original implementation was quite slow compared to reading from VCF. As was the case with the TensorStore afdist benchmark, it seems that using TensorStore's array access operator is inefficient. I changed the implementation to access the TensorStore array using pointer arithmetic, which significantly improved the performance. I also realized that the sequential access pattern was making the implementation load and reload the same chunks while iterating through the samples and then the variant sites. I changed the implementation so that it loads the first dimension (variant sites) in chunks but loads the other dimensions (samples, ploidy) in their entirety. ResultsI ran Step 2 and used the With VCF:
With VCZ:
I compared the output and confirmed that the numerical output is the same. DiscussionI think the results show some promise. However, I don't think only loading the first dimension in chunks will scale well without changing the chunk layout. I can try loading data in the first dimension one-at-a-time and the other dimensions in chunks to see how that performs if the current approach is not acceptable. I would imagine that the VCZ performance is machine-dependent since that is what we observed with the afdist implementations. Another thing to note is that I had to convert a subset of the genotype data to PLINK format for Step 1. This PR only enables users to read from VCZ for Step 2. Let me know if Step 1 needs to use VCZ as well. To summarize, these are my questions:
I think with these questions clarified and if the implementation still performs well, I can look into finalizing this demonstration. |
This is fantastic @Will-Tyler! I guess if we compare against BCF we're looking at roughly the same performance as VCZ? Let's discuss tomorrow at the sgkit-dev meeting, but here's some quick thoughts:
|
Overview
These changes enables SAIGE to read genotype data from VCF Zarr groups. The changes use TensorStore to read data.
Details
Build process
Writing R extensions describes the build process for R extensions. Essentially, the R CMD INSTALL command uses make to build the extension. Package authors can customize the build process using a makefile called "Makevars" or a script called "configure".
SAIGE uses cget to install some C++ dependencies, e.g. savvy, but cget does not work for TensorStore because TensorStore does not have install targets. Instead, I use cmake to take care of building TensorStore and compiling and linking the source code.
In these changes, cmake builds a dynamic library called "vcz" that contains the procedures for reading variant calling data from VCZ. I then use Makevars to build the cmake project and link the SAIGE library to the vcz library.
Code organization
SAIGE has a pair of source and header files for each file format that SAIGE supports, e.g. "VCF.cpp" and "VCF.hpp" for VCF files. I follow this convention and add "VCZ.cpp" and "VCZ.hpp". To avoid including TensorStore header files in VCZ.hpp, I use the pointer-to-implementation (pimpl) technique. This is ideal so that the other SAIGE files, which include the VCZ.hpp header, do not need to lookup the TensorStore header files.
The VCZ class implementation closely resembles the VCF class implementation.
C++ Standard
In these changes, I use the C++ 17 standard, which is what TensorStore requires. I update the r-base package to version 4.3.0 because R version 4.3.0 uses C++ 17 by default. It might be possible to keep SAIGE on the C++ 11 standard and use the 17 standard for the vcz library, but I have not tried this yet.
Testing
I compared the output of the second step of Example 1 of the single-variant test examples using VCF and VCZ. I used bio2zarr to create a VCZ copy of the VCF file. The relevant numerical columns in the output match.
Limitations
Currently, TensorStore does not support reading arrays with object data type (open issue). This means that TensorStore cannot read the VCF fields that are stored as objects, e.g. variant_alleles, sample_id. I handle this limitation by using fill values for fields that are stored as objects. These fields are not necessary for the statistical tests.
The statistical models that SAIGE uses may only use a subset of the samples that are available in the VCF/VCZ file. For VCF files, SAIGE selects the samples from the VCF file that are used in the model. Because TensorStore cannot read the sample IDs, these changes do not adopt the same approach for VCZ. Instead, these changes use the first$n$ samples, where $n$ is the number of samples in the statistical model. This could lead to errors if the model does not use the first $n$ samples in the VCZ file.
It appears that SAIGE might have some support for Windows. I do not focus on supporting Windows in these changes.
References
Tagging @jeromekelleher for review.