Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve k-mer filter #11

Closed
johnlees opened this issue Jan 9, 2020 · 3 comments
Closed

Improve k-mer filter #11

johnlees opened this issue Jan 9, 2020 · 3 comments
Labels
enhancement New feature or request long-term Less urgent features that require more research

Comments

@johnlees
Copy link
Member

johnlees commented Jan 9, 2020

Improving on #4. This is left for future work, for now.

The current general hash table approach is a reasonable first trade-off between accuracy (it's totally accurate), fast (it's quite slow, but could be worse) and memory use (~1Gb -> will fit on a raspberry pi). As this filtering is the main resource user for read sketching, but changing it will not change sketch results, this is a potential area for future improvement.

Possible improvements include:

  • Having different filter versions which can be changed depending on available resources
  • For speed but more memory, a minimum perfect hash function could be tried (probably needs k limited < 31?, and some loss in accuracy if collision detection isn't work it. Masking the first half of the 64-bit hash would be one option). Discussion on gramtools suggest this may not be the best.
  • Using sparse_map for memory restricted applications
  • Using an approximate filter, e.g. khmer's count-min or squeakr's CQF

Other references:

@johnlees johnlees added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request long-term Less urgent features that require more research and removed bug Something isn't working documentation Improvements or additions to documentation labels Jan 9, 2020
@johnlees
Copy link
Member Author

johnlees commented Jan 9, 2020

nthash may simplify this.

It might even be worth trying this instead of rollinghash in the main function, but this choice needs to be finalized now as it will change sketch results

@johnlees
Copy link
Member Author

johnlees commented Jan 9, 2020

nthash (6a4a4a2) looks to be slightly quicker on both tests, and has similar memory use. This might be due to by poor seqio.hpp move code for the next and out, or just that they integrated all of this with c strings anyway.

But whatever the case I will switch to this, especially as:

  • Any speed boost is welcome
  • The code becomes a lot simpler
  • Multiple hashes can be generated trivially and efficiently, which could come in handy with a future filter replacement
        rollinghash
        
        
        Command being timed: "./sketch_test 12754_5#71 12754_5#71.contigs_velvet.fa 12754_5#72 12754_5#72.contigs_velvet.fa rfiles.txt"
        User time (seconds): 73.21
        System time (seconds): 0.69
        Percent of CPU this job got: 363%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.35
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 48844
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 34150
        Voluntary context switches: 1307
        Involuntary context switches: 177
        Swaps: 0
        File system inputs: 0
        File system outputs: 39040
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
        
        Command being timed: "./read_test 12673_8_26 12673_8#26.fastq.gz 12754_4_66 12754_4#66_1.fastq.gz 12754_4#66_2.fastq.gz"
        User time (seconds): 316.31
        System time (seconds): 9.38
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 5:26.13
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1522004
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1573477
        Voluntary context switches: 47
        Involuntary context switches: 334
        Swaps: 0
        File system inputs: 392920
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
        
        nthash
        
        Command being timed: "./sketch_test 12754_5#71 12754_5#71.contigs_velvet.fa 12754_5#72 12754_5#72.contigs_velvet.fa rfiles.txt"
        User time (seconds): 64.84
        System time (seconds): 0.54
        Percent of CPU this job got: 363%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.99
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 49308
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 34260
        Voluntary context switches: 1786
        Involuntary context switches: 204
        Swaps: 0
        File system inputs: 0
        File system outputs: 39104
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
        
        Command being timed: "./read_test 12673_8_26 12673_8#26.fastq.gz 12754_4_66 12754_4#66_1.fastq.gz 12754_4#66_2.fastq.gz"
        User time (seconds): 252.22
        System time (seconds): 10.66
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:23.58
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1522152
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1569866
        Voluntary context switches: 63
        Involuntary context switches: 180
        Swaps: 0
        File system inputs: 1606592
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@johnlees
Copy link
Member Author

johnlees commented Feb 7, 2020

Closed in 7029c0c

@johnlees johnlees closed this as completed Feb 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request long-term Less urgent features that require more research
Projects
None yet
Development

No branches or pull requests

1 participant