Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slop support for phrase queries #1241

Merged
merged 19 commits into from
Mar 7, 2022

Conversation

halvorboe
Copy link
Contributor

Duplicate of #1110 (because of permissions)

Implements slop for phrase queries (#1068).

Currently the phrase query only supports exact matching. The algorithm used does a set intersection between the position of the last term in the phrase and the positions of all the prior terms in the phrase offset by their position in the phrase.

Looked at the Lucene implementation and the description in "Information Retrieval: Implementing and evaluating search engines".

I modified the current algorithm slightly to keep track of:

  • The start of each candidate phrase. Stored in a separate array.
  • The first possible end of the phrase. (searching for a b . in a c b d e it stores the 0 as the start and 1 as the end). There is never any benefit from picking a later instance of the term. This is stored in what was previously called left.

I'll test the efficiency and try to optimize, but the is no change in complexity.

@codecov-commenter
Copy link

codecov-commenter commented Dec 26, 2021

Codecov Report

Merging #1241 (5d79e28) into main (e5e252c) will increase coverage by 0.04%.
The diff coverage is 89.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1241      +/-   ##
==========================================
+ Coverage   94.20%   94.24%   +0.04%     
==========================================
  Files         206      206              
  Lines       34962    35193     +231     
==========================================
+ Hits        32936    33169     +233     
+ Misses       2026     2024       -2     
Impacted Files Coverage Δ
src/query/phrase_query/mod.rs 91.89% <70.00%> (-8.11%) ⬇️
src/query/phrase_query/phrase_scorer.rs 95.29% <98.01%> (+2.28%) ⬆️
src/query/phrase_query/phrase_query.rs 94.28% <100.00%> (+0.34%) ⬆️
src/query/phrase_query/phrase_weight.rs 75.51% <100.00%> (+1.59%) ⬆️
src/query/fuzzy_query.rs 93.70% <0.00%> (-2.21%) ⬇️
src/query/union.rs 98.79% <0.00%> (-0.61%) ⬇️
src/query/term_query/term_weight.rs 94.93% <0.00%> (-0.25%) ⬇️
src/snippet/mod.rs 92.79% <0.00%> (-0.24%) ⬇️
src/query/query_parser/query_parser.rs 96.60% <0.00%> (-0.07%) ⬇️
src/postings/stacker/term_hashmap.rs 99.37% <0.00%> (-0.05%) ⬇️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5e252c...5d79e28. Read the comment docs.

@fulmicoton fulmicoton self-requested a review December 29, 2021 10:36
}
}
for i in 0..count {
left[i] = right[i];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused... Why would we fill left and right , if it is to trash left in the end?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are only interested in right.
Can't we just write right into left, and keep right non-mutable?

while left_index < left_len && right_index < right_len {
let left_val = left[left_index];
let right_val = right[right_index];
match left_val.cmp(&right_val) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual three cases are

  • left_val < right - slop
  • right_slop <= left_val <= right
  • left_val > right

match left_val.cmp(&right_val) {
Ordering::Less => {
if right_val - left[left_index] <= max_distance_to_begin {
if is_temporary {
Copy link
Collaborator

@fulmicoton fulmicoton Jan 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the condition is switch to the ternary choice above, the is_temporary contraption can be avoided with an inner loop.

Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments inline

@halvorboe halvorboe requested a review from fulmicoton January 23, 2022 16:44
Comment on lines 167 to 171
loop {
if left_index + 1 >= left_len {
// current one is the best as there are no more values.
break;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
loop {
if left_index + 1 >= left_len {
// current one is the best as there are no more values.
break;
}
while left_index + 1 < left_len {

@codecov-commenter
Copy link

codecov-commenter commented Feb 5, 2022

Codecov Report

Merging #1241 (b118df0) into main (d18ac13) will increase coverage by 0.23%.
The diff coverage is 96.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1241      +/-   ##
==========================================
+ Coverage   93.98%   94.22%   +0.23%     
==========================================
  Files         204      209       +5     
  Lines       34539    35454     +915     
==========================================
+ Hits        32461    33405     +944     
+ Misses       2078     2049      -29     
Impacted Files Coverage Δ
src/collector/count_collector.rs 100.00% <ø> (ø)
src/collector/filter_collector_wrapper.rs 84.84% <ø> (ø)
src/collector/mod.rs 48.21% <ø> (ø)
src/directory/error.rs 1.96% <0.00%> (-6.93%) ⬇️
src/directory/tests.rs 99.09% <ø> (-0.03%) ⬇️
src/functional_test.rs 0.00% <0.00%> (ø)
src/query/term_query/term_scorer.rs 73.89% <0.00%> (+0.29%) ⬆️
src/schema/flags.rs 56.25% <ø> (ø)
src/query/phrase_query/phrase_query.rs 94.28% <66.66%> (+0.34%) ⬆️
src/query/phrase_query/mod.rs 91.83% <79.48%> (-8.17%) ⬇️
... and 255 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5e252c...b118df0. Read the comment docs.

@halvorboe halvorboe requested a review from fulmicoton February 5, 2022 23:46
@halvorboe
Copy link
Contributor Author

@fulmicoton Do you have time to take a look?

@fulmicoton fulmicoton merged commit cedced5 into quickwit-oss:main Mar 7, 2022
@fulmicoton
Copy link
Collaborator

Thank you @halvorboe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants