Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read from vcz #1

Merged
merged 28 commits into from
Oct 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
5c30bbd
Create genotype_100markers.vcz
Will-Tyler Sep 9, 2024
2f0daae
Add vczFile option to step2 script
Will-Tyler Sep 9, 2024
159bcad
Add vczFile to SPAGMMATtest
Will-Tyler Sep 9, 2024
88e2005
Add vczFile parameter to setGenoInput and checkGenoInput
Will-Tyler Sep 9, 2024
b6e91b5
Invert if-statement
Will-Tyler Sep 10, 2024
0194feb
Add VCZ case to SAIGE.Marker
Will-Tyler Sep 10, 2024
d34a6df
Add C++ code skeleton
Will-Tyler Sep 10, 2024
bc283db
Setup tensorstore
Will-Tyler Sep 12, 2024
8abf240
TensorStore working!
Will-Tyler Sep 14, 2024
ea2f3ce
Improve Makevars
Will-Tyler Sep 17, 2024
07d40e5
Add VCZ method implementations
Will-Tyler Sep 18, 2024
ac6ab94
Use pimpl idiom to hide tensorstore
Will-Tyler Sep 18, 2024
de35a62
Add debug flags in Makevars
Will-Tyler Sep 18, 2024
b16ec33
Update r-base version
Will-Tyler Sep 18, 2024
ff494cf
Load utils library (helps debugging)
Will-Tyler Sep 18, 2024
6066e57
Make build an order-only prereq
Will-Tyler Sep 19, 2024
0cbb47f
Disable optimization in cmake
Will-Tyler Sep 19, 2024
449345c
Remove cmake verbose
Will-Tyler Sep 20, 2024
161d3c2
Code is reading with TS and terminating successfully
Will-Tyler Sep 20, 2024
38bc16b
Output with VCZ is similar to output with VCF
Will-Tyler Sep 20, 2024
8affc87
Add ChunkCacher class
Will-Tyler Sep 21, 2024
c2d5999
Rechunk example VCZ file
Will-Tyler Sep 22, 2024
0562576
Implement chunk caching
Will-Tyler Sep 22, 2024
e5ac26f
Update genotype_100markers.vcz
Will-Tyler Sep 23, 2024
1abc5f1
Change build mode to release
Will-Tyler Sep 23, 2024
b772fac
Improve CMakeLists.txt
Will-Tyler Sep 27, 2024
21dfcbe
Make VCZ faster with pointer arithmetic
Will-Tyler Sep 28, 2024
498f5c1
Tidy up VCZ.cpp
Will-Tyler Sep 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 2 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
thirdParty/cget
thirdParty/tensorstore-0.1.65
src/build
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
**/.DS_Store
/archived
thirdParty/cget
thirdParty/zlib-1.2.8
thirdParty/bgen/.lock-waf_linux2_build
thirdParty/bgen/build
thirdParty/bgen/.hg
src/*.dylib
src/*.o
src/*.so
src/backup/*
src/build
extdata/input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.raw
*.swp
11 changes: 11 additions & 0 deletions R/Geno.R
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ checkGenoInput = function(bgenFile = "",
vcfFile = "",
vcfFileIndex = "",
vcfField = "DS",
vczFile = "",
savFile = "",
savFileIndex = "",
sampleFile = "",
Expand Down Expand Up @@ -60,6 +61,9 @@ checkGenoInput = function(bgenFile = "",
}
dosageFileType = "vcf"

}else if (vczFile != "") {
Check_File_Exist(vczFile)
dosageFileType <- "vcz"
}else if (savFile != "") {
Check_File_Exist(savFile, "savFile")
#Check_File_Exist(savFileIndex, "savFileIndex")
Expand Down Expand Up @@ -124,6 +128,7 @@ setGenoInput = function(bgenFile = "",
vcfFile = "",
vcfFileIndex = "",
vcfField = "DS",
vczFile = "",
savFile = "",
savFileIndex = "",
sampleFile = "",
Expand All @@ -142,6 +147,7 @@ setGenoInput = function(bgenFile = "",
vcfFile = vcfFile,
vcfFileIndex = vcfFileIndex,
vcfField = vcfField,
vczFile = vczFile,
savFile = savFile,
savFileIndex = savFileIndex,
bedFile = bedFile,
Expand Down Expand Up @@ -437,6 +443,11 @@ if(FALSE){
# markerInfo[,POS:=NULL]
# }

if (dosageFileType == "vcz") {
setVCZobjInCPP(vczFile, t_SampleInModel = sampleInModel)
markerInfo = NULL
}

genoList = list(markerInfo = markerInfo, genoType = dosageFileType, anyInclude = anyInclude)
return(genoList)
}
Expand Down
16 changes: 16 additions & 0 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,10 @@ setVCFobjInCPP <- function(t_vcfFileName, t_vcfFileIndex, t_vcfField, t_SampleIn
invisible(.Call('_SAIGE_setVCFobjInCPP', PACKAGE = 'SAIGE', t_vcfFileName, t_vcfFileIndex, t_vcfField, t_SampleInModel))
}

setVCZobjInCPP <- function(t_vczFileName, t_SampleInModel) {
invisible(.Call('_SAIGE_setVCZobjInCPP', PACKAGE = 'SAIGE', t_vczFileName, t_SampleInModel))
}

setSAIGEobjInCPP <- function(t_XVX, t_XXVX_inv, t_XV, t_XVX_inv_XV, t_Sigma_iXXSigma_iX, t_X, t_S_a, t_res, t_mu2, t_mu, t_varRatio_sparse, t_varRatio_null, t_cateVarRatioMinMACVecExclude, t_cateVarRatioMaxMACVecInclude, t_SPA_Cutoff, t_tauvec, t_traitType, t_y, t_impute_method, t_flagSparseGRM, t_isFastTest, t_pval_cutoff_for_fastTest, t_locationMat, t_valueVec, t_dimNum, t_isCondition, t_condition_genoIndex, t_is_Firth_beta, t_pCutoffforFirth, t_offset, t_resout) {
invisible(.Call('_SAIGE_setSAIGEobjInCPP', PACKAGE = 'SAIGE', t_XVX, t_XXVX_inv, t_XV, t_XVX_inv_XV, t_Sigma_iXXSigma_iX, t_X, t_S_a, t_res, t_mu2, t_mu, t_varRatio_sparse, t_varRatio_null, t_cateVarRatioMinMACVecExclude, t_cateVarRatioMaxMACVecInclude, t_SPA_Cutoff, t_tauvec, t_traitType, t_y, t_impute_method, t_flagSparseGRM, t_isFastTest, t_pval_cutoff_for_fastTest, t_locationMat, t_valueVec, t_dimNum, t_isCondition, t_condition_genoIndex, t_is_Firth_beta, t_pCutoffforFirth, t_offset, t_resout))
}
Expand Down Expand Up @@ -117,6 +121,18 @@ move_forward_iterator_Vcf <- function(i) {
invisible(.Call('_SAIGE_move_forward_iterator_Vcf', PACKAGE = 'SAIGE', i))
}

set_iterator_inVcz <- function(chrom, beg_pd, end_pd) {
invisible(.Call('_SAIGE_set_iterator_inVcz', PACKAGE = 'SAIGE', chrom, beg_pd, end_pd))
}

check_Vcz_end <- function() {
.Call('_SAIGE_check_Vcz_end', PACKAGE = 'SAIGE')
}

move_forward_iterator_Vcz <- function(i) {
invisible(.Call('_SAIGE_move_forward_iterator_Vcz', PACKAGE = 'SAIGE', i))
}

fast_logistf_fit <- function(x, y, weight, offset, firth, col_fit, init, maxit, maxstep, maxhs, lconv, gconv, xconv, isfirthconverge) {
.Call('_SAIGE_fast_logistf_fit', PACKAGE = 'SAIGE', x, y, weight, offset, firth, col_fit, init, maxit, maxstep, maxhs, lconv, gconv, xconv, isfirthconverge)
}
Expand Down
85 changes: 58 additions & 27 deletions R/SAIGE_SPATest_Marker.R
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,54 @@ SAIGE.Marker = function(traitType,
}

## set up an object for genotype
if(genoType != "vcf"){
if(genoType == "vcf"){
if(!isAnyInclude){
if(chrom == ""){
stop("chrom needs to be specified for single-variant assoc tests when using VCF as input\n")
}else{
set_iterator_inVcf("", chrom, 1, 250000000)
}
}
if(outIndex > 1){
move_forward_iterator_Vcf(outIndex*nMarkersEachChunk)
}
isVcfEnd = check_Vcf_end()
if(!isVcfEnd){
#outIndex = 1
genoIndex = rep("0", nMarkersEachChunk)
genoIndex_prev = rep("0", nMarkersEachChunk)
#nChunks = outIndex + 1
is_marker_test = TRUE
i = outIndex
}else{
is_marker_test = FALSE
stop("No markers are left in VCF")
}
} else if (genoType == "vcz") {
if (!isAnyInclude) {
if (chrom == "") {
stop("chrom needs to be specified for single-variant assoc tests when using VCZ as input\n")
} else {
set_iterator_inVcz(chrom, 1, 250000000)
}
}

if (outIndex > 1) {
move_forward_iterator_Vcz(outIndex * nMarkersEachChunk)
}

isVczEnd <- check_Vcz_end()

if (!isVczEnd) {
genoIndex <- rep("0", nMarkersEachChunk)
genoIndex_prev <- rep("0", nMarkersEachChunk)
is_marker_test <- TRUE
i <- outIndex
} else {
is_marker_test <- FALSE
stop("No markers are left in VCZ")
}
}else{
#markerInfo = objGeno$markerInfo
if(LOCO){
genoIndex = genoIndex[which(CHROM == chrom)]
Expand Down Expand Up @@ -71,30 +118,6 @@ SAIGE.Marker = function(traitType,
is_marker_test = TRUE
i = outIndex
}

}else{
if(!isAnyInclude){
if(chrom == ""){
stop("chrom needs to be specified for single-variant assoc tests when using VCF as input\n")
}else{
set_iterator_inVcf("", chrom, 1, 250000000)
}
}
if(outIndex > 1){
move_forward_iterator_Vcf(outIndex*nMarkersEachChunk)
}
isVcfEnd = check_Vcf_end()
if(!isVcfEnd){
#outIndex = 1
genoIndex = rep("0", nMarkersEachChunk)
genoIndex_prev = rep("0", nMarkersEachChunk)
#nChunks = outIndex + 1
is_marker_test = TRUE
i = outIndex
}else{
is_marker_test = FALSE
stop("No markers are left in VCF")
}
}

chrom = "InitialChunk"
Expand All @@ -105,7 +128,7 @@ SAIGE.Marker = function(traitType,
#time_left = system.time({


if(genoType != "vcf"){
if(genoType != "vcf" && genoType != "vcz"){
tempList = genoIndexList[[i]]
genoIndex = as.character(format(tempList$genoIndex, scientific = FALSE))
tempChrom = tempList$chrom
Expand Down Expand Up @@ -133,7 +156,7 @@ SAIGE.Marker = function(traitType,
#print(ptm)
#print("gc()")
#print(gc())
if(genoType != "vcf"){
if(genoType != "vcf" && genoType != "vcz"){
cat(paste0("(",Sys.time(),") ---- Analyzing Chunk ", i, "/", nChunks, ": chrom ", chrom," ---- \n"))
}else{
cat(paste0("(",Sys.time(),") ---- Analyzing Chunk ", i, " : chrom ", chrom," ---- \n"))
Expand All @@ -150,6 +173,8 @@ SAIGE.Marker = function(traitType,

if(genoType == "vcf"){
isEnd_Output = check_Vcf_end()
}else if (genoType == "vcz") {
isEnd_Output = check_Vcz_end()
}else{
isEnd_Output = (i==nChunks)
}
Expand Down Expand Up @@ -189,6 +214,12 @@ SAIGE.Marker = function(traitType,
if(isVcfEnd){
is_marker_test = FALSE
}
}else if(genoType == "vcz") {
isVczEnd = check_Vcz_end()
cat("isVczEnd ", isVczEnd, "\n")
if (isVczEnd) {
is_marker_test = FALSE
}
}else{
if(i > nChunks){
is_marker_test = FALSE
Expand Down
7 changes: 5 additions & 2 deletions R/SAIGE_Test_main.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
#' @param vcfFile character. Path to vcf file
#' @param vcfFileIndex character. Path to vcf index file. Indexed by tabix. Path to index for vcf file by tabix, .csi file using 'tabix --csi -p vcf file.vcf.gz'
#' @param vcfField character. genotype field in vcf file to use. "DS" for dosages or "GT" for genotypes. By default, "DS".
#' @param vczFile character. Path to VCZ group.
#' @param savFile character. Path to sav file
#' @param savFileIndex character. Path to index for sav file .s1r
#' @param bedFile character. Path to bed file (PLINK)
Expand Down Expand Up @@ -64,6 +65,7 @@ SPAGMMATtest = function(bgenFile = "",
vcfFile = "",
vcfFileIndex = "",
vcfField = "DS",
vczFile = "",
savFile = "",
savFileIndex = "",
bedFile="",
Expand Down Expand Up @@ -334,14 +336,15 @@ SPAGMMATtest = function(bgenFile = "",
}

nsample = length(obj.model$y)
cateVarRatioMaxMACVecInclude = c(cateVarRatioMaxMACVecInclude, nsample)
cateVarRatioMaxMACVecInclude = c(cateVarRatioMaxMACVecInclude, nsample)

#in Geno.R
objGeno = setGenoInput(bgenFile = bgenFile,
bgenFileIndex = bgenFileIndex,
vcfFile = vcfFile, #not activate yet
vcfFileIndex = vcfFileIndex,
vcfField = vcfField,
vczFile = vczFile,
savFile = savFile,
savFileIndex = savFileIndex,
sampleFile = sampleFile,
Expand Down
2 changes: 1 addition & 1 deletion conda_env/createCondaEnvSAIGE_steps.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
conda create -n RSAIGE r-essentials r-base=4.1.2 python=3.10
conda create -n RSAIGE r-essentials r-base=4.3.0 python=3.10
conda activate RSAIGE
conda install -c anaconda cmake
conda install -c conda-forge gettext lapack r-matrix
Expand Down
5 changes: 5 additions & 0 deletions extdata/input/genotype_100markers.vcz/.zattrs

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions extdata/input/genotype_100markers.vcz/.zgroup
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"zarr_format": 2
}
451 changes: 451 additions & 0 deletions extdata/input/genotype_100markers.vcz/.zmetadata

Large diffs are not rendered by default.

25 changes: 25 additions & 0 deletions extdata/input/genotype_100markers.vcz/call_genotype/.zarray
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"chunks": [
10,
1000,
2
],
"compressor": {
"blocksize": 0,
"clevel": 7,
"cname": "zstd",
"id": "blosc",
"shuffle": 2
},
"dimension_separator": "/",
"dtype": "|i1",
"fill_value": null,
"filters": null,
"order": "C",
"shape": [
100,
10000,
2
],
"zarr_format": 2
}
8 changes: 8 additions & 0 deletions extdata/input/genotype_100markers.vcz/call_genotype/.zattrs
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"_ARRAY_DIMENSIONS": [
"variants",
"samples",
"ploidy"
],
"description": ""
}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
25 changes: 25 additions & 0 deletions extdata/input/genotype_100markers.vcz/call_genotype_mask/.zarray
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"chunks": [
10,
1000,
2
],
"compressor": {
"blocksize": 0,
"clevel": 7,
"cname": "zstd",
"id": "blosc",
"shuffle": 2
},
"dimension_separator": "/",
"dtype": "|b1",
"fill_value": null,
"filters": null,
"order": "C",
"shape": [
100,
10000,
2
],
"zarr_format": 2
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"_ARRAY_DIMENSIONS": [
"variants",
"samples",
"ploidy"
],
"description": ""
}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
23 changes: 23 additions & 0 deletions extdata/input/genotype_100markers.vcz/call_genotype_phased/.zarray
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"chunks": [
10,
1000
],
"compressor": {
"blocksize": 0,
"clevel": 7,
"cname": "zstd",
"id": "blosc",
"shuffle": 2
},
"dimension_separator": "/",
"dtype": "|b1",
"fill_value": null,
"filters": null,
"order": "C",
"shape": [
100,
10000
],
"zarr_format": 2
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"_ARRAY_DIMENSIONS": [
"variants",
"samples"
],
"description": ""
}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading