Read from vcz #1

Will-Tyler · 2024-09-23T16:50:14Z

Overview

These changes enables SAIGE to read genotype data from VCF Zarr groups. The changes use TensorStore to read data.

Details

Build process

Writing R extensions describes the build process for R extensions. Essentially, the R CMD INSTALL command uses make to build the extension. Package authors can customize the build process using a makefile called "Makevars" or a script called "configure".

SAIGE uses cget to install some C++ dependencies, e.g. savvy, but cget does not work for TensorStore because TensorStore does not have install targets. Instead, I use cmake to take care of building TensorStore and compiling and linking the source code.

In these changes, cmake builds a dynamic library called "vcz" that contains the procedures for reading variant calling data from VCZ. I then use Makevars to build the cmake project and link the SAIGE library to the vcz library.

Code organization

SAIGE has a pair of source and header files for each file format that SAIGE supports, e.g. "VCF.cpp" and "VCF.hpp" for VCF files. I follow this convention and add "VCZ.cpp" and "VCZ.hpp". To avoid including TensorStore header files in VCZ.hpp, I use the pointer-to-implementation (pimpl) technique. This is ideal so that the other SAIGE files, which include the VCZ.hpp header, do not need to lookup the TensorStore header files.

The VCZ class implementation closely resembles the VCF class implementation.

C++ Standard

In these changes, I use the C++ 17 standard, which is what TensorStore requires. I update the r-base package to version 4.3.0 because R version 4.3.0 uses C++ 17 by default. It might be possible to keep SAIGE on the C++ 11 standard and use the 17 standard for the vcz library, but I have not tried this yet.

Testing

I compared the output of the second step of Example 1 of the single-variant test examples using VCF and VCZ. I used bio2zarr to create a VCZ copy of the VCF file. The relevant numerical columns in the output match.

Limitations

Currently, TensorStore does not support reading arrays with object data type (open issue). This means that TensorStore cannot read the VCF fields that are stored as objects, e.g. variant_alleles, sample_id. I handle this limitation by using fill values for fields that are stored as objects. These fields are not necessary for the statistical tests.

The statistical models that SAIGE uses may only use a subset of the samples that are available in the VCF/VCZ file. For VCF files, SAIGE selects the samples from the VCF file that are used in the model. Because TensorStore cannot read the sample IDs, these changes do not adopt the same approach for VCZ. Instead, these changes use the first $n$ samples, where $n$ is the number of samples in the statistical model. This could lead to errors if the model does not use the first $n$ samples in the VCZ file.

It appears that SAIGE might have some support for Windows. I do not focus on supporting Windows in these changes.

References

Tagging @jeromekelleher for review.

jeromekelleher

Wow!!!

Amazing @Will-Tyler, you are a legend!

I don't grok the C++ or R details, but the overall shape looks just right. A few minor comments.

The big question is now, does this perform? We could try it on our simulated genotype data, comparing the VCF/BCF and VCZ performance. That would be a fantastic demo for the paper if we could do it.

jeromekelleher · 2024-09-24T08:08:43Z

src/VCF.cpp

@@ -1,256 +1,257 @@
 // [[Rcpp::depends(RcppArmadillo)]]
-#include <RcppArmadillo.h>


Are the changes in this file just formatting/C++ version stuff, or is there something substantive that needed to be changed in the VCF backend?

I must have committed this by accident. I reformatted a lot of the files while working on this to help me understand the code, but I did not mean to include them. I will fix this.

jeromekelleher · 2024-09-24T08:10:58Z

src/VCZ.cpp

+                               {"path", t_vczFileName + "/variant_position"}}},
+                         })
+                         .value();
+        m_variant_position_array = ChunkCacher<int8_t>(input_spec);


Does this mean variant_positions must be int8_t values? This is too small if so. int32_t is safe.

I chose the element types to match data types in the Zarr arrays as they are stored on my computer. I will see if using larger element types works.

We'd need int32 to work with any reasonable benchmark data

Just want to note that while working on the performance testing, I found that the element type template parameter needs to match the Zarr array's data type. The element type template parameter is a type template parameter that TensorStore uses as the data type of the TensorStore store. See here for example. Essentially, the TensorStore code is parameterized around the element type, which can be specified at compile time by the user.

TensorStore has an error if I try to use int32_t to read an <i1 Zarr array. The current implementation requires the user to update the code, specifically the arguments to the type template parameters, to match the data types of the respective Zarr arrays. I currently don't know how to access data through TensorStore without specifying the element type somewhere at compile time.

Will-Tyler · 2024-09-24T23:10:24Z

Thank you for the kind words! 😄 This was a challenging but fun project and a great learning experience!

I think I will need some binary phenotype data to go along with the simulated genotype data to produce a null-hypothesis model. The SAIGE paper describes creating simulated phenotypes in the methods sections, but I wonder if you have a recommended approach for generating simulated binary phenotypes?

References

SAIGE paper
- See "Data simulation" in methods. It describes creating simulated binary phenotypes.
SAIGE docs
- See "Input files" for phenotype file details.
Example phenotype file
- This is an example phenotype file in SAIGE.

jeromekelleher · 2024-09-26T08:44:48Z

We can generate some quantitative traits on the base simulated data using tstrait. This would be quick and easy, and I'd be happy to set up the infrastructure for doing that.

A fun thing to do, to really show that we can cover all the bases of input data formats etc, would be to use the simulated HAPNEST data, which includes very realistic phenotypes. The genetic data is in plink format, but I've already converted this using the (unfinished) plink2zarr program that's in bio2zarr.

However, I don't think we need particularly realistic traits in the first instance just to get some rough ideas of how this works. Just using a random value from a Gaussian distribution for each sample should be fine as a first pass, to give us basic order-of-magnitude numbers on how well VCZ performs in comparison to VCF and BCF.

Will-Tyler · 2024-09-28T16:26:07Z

I got a chance to play around with the performance—this PR's original implementation was quite slow compared to reading from VCF. As was the case with the TensorStore afdist benchmark, it seems that using TensorStore's array access operator is inefficient. I changed the implementation to access the TensorStore array using pointer arithmetic, which significantly improved the performance.

I also realized that the sequential access pattern was making the implementation load and reload the same chunks while iterating through the samples and then the variant sites. I changed the implementation so that it loads the first dimension (variant sites) in chunks but loads the other dimensions (samples, ploidy) in their entirety.

Results

I ran Step 2 and used the $10^4$ samples dataset from the simulated genotype data.

With VCF:

user	system	elapsed
753.565	5.797	763.069

With VCZ:

user	system	elapsed
583.001	25.572	582.334

I compared the output and confirmed that the numerical output is the same.

Discussion

I think the results show some promise. However, I don't think only loading the first dimension in chunks will scale well without changing the chunk layout. I can try loading data in the first dimension one-at-a-time and the other dimensions in chunks to see how that performs if the current approach is not acceptable.

I would imagine that the VCZ performance is machine-dependent since that is what we observed with the afdist implementations.

Another thing to note is that I had to convert a subset of the genotype data to PLINK format for Step 1. This PR only enables users to read from VCZ for Step 2. Let me know if Step 1 needs to use VCZ as well.

To summarize, these are my questions:

Is only loading the first dimension in chunks acceptable?
Does Step 1 also need to use VCZ? Or is it sufficient to demonstrate Step 2 with VCZ?

I think with these questions clarified and if the implementation still performs well, I can look into finalizing this demonstration.

jeromekelleher · 2024-09-29T21:45:27Z

This is fantastic @Will-Tyler! I guess if we compare against BCF we're looking at roughly the same performance as VCZ?

Let's discuss tomorrow at the sgkit-dev meeting, but here's some quick thoughts:

Only loading in variant-chunks is totally fine. I think it's a necessity for providing a support layer for VCZ in things that assume that we look at things variant-by-variant. I agree we should probably change our chunk layout though, as "long" variant chunks are not good in this case.
Just doing the variant-association tests using VCZ is totally sufficient. This is just proof-of-concept.

Will-Tyler added 19 commits September 9, 2024 17:47

Create genotype_100markers.vcz

5c30bbd

Add vczFile option to step2 script

2f0daae

Add vczFile to SPAGMMATtest

159bcad

Add vczFile parameter to setGenoInput and checkGenoInput

88e2005

Invert if-statement

b6e91b5

Add VCZ case to SAIGE.Marker

0194feb

Add C++ code skeleton

d34a6df

Setup tensorstore

bc283db

TensorStore working!

8abf240

Improve Makevars

ea2f3ce

Add VCZ method implementations

07d40e5

Use pimpl idiom to hide tensorstore

ac6ab94

Add debug flags in Makevars

de35a62

Update r-base version

b16ec33

Load utils library (helps debugging)

ff494cf

Make build an order-only prereq

6066e57

Disable optimization in cmake

0cbb47f

Remove cmake verbose

449345c

Code is reading with TS and terminating successfully

161d3c2

jeromekelleher reviewed Sep 24, 2024

View reviewed changes

Will-Tyler added 6 commits September 24, 2024 14:48

Output with VCZ is similar to output with VCF

38bc16b

Add ChunkCacher class

8affc87

Rechunk example VCZ file

c2d5999

Implement chunk caching

0562576

Update genotype_100markers.vcz

e5ac26f

Change build mode to release

1abc5f1

Will-Tyler force-pushed the read-from-vcz branch from ce2eb21 to 1abc5f1 Compare September 24, 2024 14:48

Improve CMakeLists.txt

b772fac

Make VCZ faster with pointer arithmetic

21dfcbe

Tidy up VCZ.cpp

498f5c1

Will-Tyler merged commit 2f0ad0d into main Oct 1, 2024

jeromekelleher mentioned this pull request Oct 24, 2024

Add section discussing SAIGE prototype sgkit-dev/vcf-zarr-publication#173

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read from vcz #1

Read from vcz #1

Will-Tyler commented Sep 23, 2024 •

edited

Loading

jeromekelleher left a comment

jeromekelleher Sep 24, 2024

Will-Tyler Sep 24, 2024

jeromekelleher Sep 24, 2024

Will-Tyler Sep 24, 2024

jeromekelleher Sep 24, 2024

Will-Tyler Sep 28, 2024

Will-Tyler commented Sep 24, 2024

jeromekelleher commented Sep 26, 2024

Will-Tyler commented Sep 28, 2024

jeromekelleher commented Sep 29, 2024

		@@ -1,256 +1,257 @@
		// [[Rcpp::depends(RcppArmadillo)]]
		#include <RcppArmadillo.h>

Read from vcz #1

Read from vcz #1

Conversation

Will-Tyler commented Sep 23, 2024 • edited Loading

Overview

Details

Build process

Code organization

C++ Standard

Testing

Limitations

References

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Sep 24, 2024

Choose a reason for hiding this comment

Will-Tyler Sep 24, 2024

Choose a reason for hiding this comment

jeromekelleher Sep 24, 2024

Choose a reason for hiding this comment

Will-Tyler Sep 24, 2024

Choose a reason for hiding this comment

jeromekelleher Sep 24, 2024

Choose a reason for hiding this comment

Will-Tyler Sep 28, 2024

Choose a reason for hiding this comment

Will-Tyler commented Sep 24, 2024

References

jeromekelleher commented Sep 26, 2024

Will-Tyler commented Sep 28, 2024

Results

Discussion

jeromekelleher commented Sep 29, 2024

Will-Tyler commented Sep 23, 2024 •

edited

Loading