You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I will be shortly starting the implementation of a new query type on a fork. I'd like to hear the opinion of the community and the maintainers on it so that it is consistent with the project's needs and standards from the beginning.
Phrase with Gaps
The project I'm working on requires running a large volume of queries that look something like this.
____ is a type of electron microscope.
Optical tweezers can be used to measure _____ between _____ and DNA.
____ batteries are the most efficient energy-storage medium for a temperature of ____.
Generally, efficiently retrieving sentences that match patterns with gaps. Or from a different perspective, the intersection of multiple phrase queries, where the phrases must appear in a given order and have one or more arbitrary terms between them.
Implementation
I plan to follow the existing architecture for PhraseQuery closely with minor extensions.
With relevant modifications to PhraseScorer and PhraseWeight.
As described above, I believe it would be enough to simply reuse the current phrase query algorithm, execute a query for each phrase separated by a gap, and then verify their order and distancing. However, I'm open to suggestions of more efficient algorithms.
Generalizations
These extensions are in principle not in the scope of this PR, since I don't strictly need them for my use case, and I believe that the current specification of the feature is already generic enough to be useful to other users. I will be somewhat flexible though if I see that a more general version is in demand.
Bounded Gap Lengths: Gaps may have min_len and max_len. My plan would be equivalent to min_len: Some(1) and max_len: None. I can see that particularly min_len: None could be useful for optional terms in phrases.
Slop: It is a related concept, with the main difference being that slop: 1 on a two-term phrase allows having either an arbitrary term in the middle or reversing the order of the terms.
Spans: The Span API from Lucene/Elasticsearch is also very related to what I'm proposing, I believe slop is implemented on top of it.
Regex Phrases: We could consider defining a whole regex-style API for specifying patterns of terms. It might look something like this.
I'd like to know if such a feature would be useful to others, if similar requests have been made before, and if such a thing belongs within the scope of tantivy in the first place.
Any suggestions for how to implement this efficiently are most welcome.
What requirements should this fulfill for a pull-request to be accepted? I plan to implement this regardless of it being merged, but it would be nice if it would also help others.
Perhaps I'm missing some existing functionality that makes this redundant. Or it might be more appropriate to implement this as a filter on top of results from tantivy, instead of inside tantivy. Let me know.
The implementation of PhraseQuery is already a good guide, but reiterating which parts of the codebase I would need to extend for this would be helpful.
Again, I'm somewhat flexible to suggestions for expanding the specification to support wider use-cases, as long as it doesn't increase the workload substantially.
The text was updated successfully, but these errors were encountered:
I will be shortly starting the implementation of a new query type on a fork. I'd like to hear the opinion of the community and the maintainers on it so that it is consistent with the project's needs and standards from the beginning.
Phrase with Gaps
The project I'm working on requires running a large volume of queries that look something like this.
Generally, efficiently retrieving sentences that match patterns with gaps. Or from a different perspective, the intersection of multiple phrase queries, where the phrases must appear in a given order and have one or more arbitrary terms between them.
Implementation
I plan to follow the existing architecture for
PhraseQuery
closely with minor extensions.For instance:
With relevant modifications to
PhraseScorer
andPhraseWeight
.As described above, I believe it would be enough to simply reuse the current phrase query algorithm, execute a query for each phrase separated by a gap, and then verify their order and distancing. However, I'm open to suggestions of more efficient algorithms.
Generalizations
These extensions are in principle not in the scope of this PR, since I don't strictly need them for my use case, and I believe that the current specification of the feature is already generic enough to be useful to other users. I will be somewhat flexible though if I see that a more general version is in demand.
min_len
andmax_len
. My plan would be equivalent tomin_len: Some(1)
andmax_len: None
. I can see that particularlymin_len: None
could be useful for optional terms in phrases.slop: 1
on a two-term phrase allows having either an arbitrary term in the middle or reversing the order of the terms.Questions
PhraseQuery
is already a good guide, but reiterating which parts of the codebase I would need to extend for this would be helpful.The text was updated successfully, but these errors were encountered: