-
-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slop support for phrase queries #1241
Slop support for phrase queries #1241
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1241 +/- ##
==========================================
+ Coverage 94.20% 94.24% +0.04%
==========================================
Files 206 206
Lines 34962 35193 +231
==========================================
+ Hits 32936 33169 +233
+ Misses 2026 2024 -2
Continue to review full report at Codecov.
|
} | ||
} | ||
for i in 0..count { | ||
left[i] = right[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused... Why would we fill left
and right
, if it is to trash left
in the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are only interested in right.
Can't we just write right into left, and keep right non-mutable?
while left_index < left_len && right_index < right_len { | ||
let left_val = left[left_index]; | ||
let right_val = right[right_index]; | ||
match left_val.cmp(&right_val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual three cases are
left_val < right - slop
right_slop <= left_val <= right
left_val > right
match left_val.cmp(&right_val) { | ||
Ordering::Less => { | ||
if right_val - left[left_index] <= max_distance_to_begin { | ||
if is_temporary { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the condition is switch to the ternary choice above, the is_temporary contraption can be avoided with an inner loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments inline
Co-authored-by: Paul Masurel <[email protected]>
Co-authored-by: Paul Masurel <[email protected]>
…into hfb/phrase_match_slop
loop { | ||
if left_index + 1 >= left_len { | ||
// current one is the best as there are no more values. | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loop { | |
if left_index + 1 >= left_len { | |
// current one is the best as there are no more values. | |
break; | |
} | |
while left_index + 1 < left_len { |
Co-authored-by: Paul Masurel <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #1241 +/- ##
==========================================
+ Coverage 93.98% 94.22% +0.23%
==========================================
Files 204 209 +5
Lines 34539 35454 +915
==========================================
+ Hits 32461 33405 +944
+ Misses 2078 2049 -29
Continue to review full report at Codecov.
|
@fulmicoton Do you have time to take a look? |
Thank you @halvorboe |
Duplicate of #1110 (because of permissions)
Implements slop for phrase queries (#1068).
Currently the phrase query only supports exact matching. The algorithm used does a set intersection between the position of the last term in the phrase and the positions of all the prior terms in the phrase offset by their position in the phrase.
Looked at the Lucene implementation and the description in "Information Retrieval: Implementing and evaluating search engines".
I modified the current algorithm slightly to keep track of:
a b .
ina c b d e
it stores the 0 as the start and 1 as the end). There is never any benefit from picking a later instance of the term. This is stored in what was previously called left.I'll test the efficiency and try to optimize, but the is no change in complexity.