Proposal for new query type: phrase with gaps #1273

oersted · 2022-02-15T15:48:57Z

I will be shortly starting the implementation of a new query type on a fork. I'd like to hear the opinion of the community and the maintainers on it so that it is consistent with the project's needs and standards from the beginning.

Phrase with Gaps

The project I'm working on requires running a large volume of queries that look something like this.

____ is a type of electron microscope.
Optical tweezers can be used to measure _____ between _____ and DNA.
____ batteries are the most efficient energy-storage medium for a temperature of ____.

Generally, efficiently retrieving sentences that match patterns with gaps. Or from a different perspective, the intersection of multiple phrase queries, where the phrases must appear in a given order and have one or more arbitrary terms between them.

Implementation

I plan to follow the existing architecture for PhraseQuery closely with minor extensions.

For instance:

enum GapTerm {
    Gap,
    Term(Term),
}

#[derive(Clone, Debug)]
pub struct GapPhraseQuery {
    field: Field,
    phrase_terms: Vec<(usize, GapTerm)>,
}

With relevant modifications to PhraseScorer and PhraseWeight.

As described above, I believe it would be enough to simply reuse the current phrase query algorithm, execute a query for each phrase separated by a gap, and then verify their order and distancing. However, I'm open to suggestions of more efficient algorithms.

Generalizations

These extensions are in principle not in the scope of this PR, since I don't strictly need them for my use case, and I believe that the current specification of the feature is already generic enough to be useful to other users. I will be somewhat flexible though if I see that a more general version is in demand.

Bounded Gap Lengths: Gaps may have min_len and max_len. My plan would be equivalent to min_len: Some(1) and max_len: None. I can see that particularly min_len: None could be useful for optional terms in phrases.
Slop: It is a related concept, with the main difference being that slop: 1 on a two-term phrase allows having either an arbitrary term in the middle or reversing the order of the terms.
Spans: The Span API from Lucene/Elasticsearch is also very related to what I'm proposing, I believe slop is implemented on top of it.
Regex Phrases: We could consider defining a whole regex-style API for specifying patterns of terms. It might look something like this.

enum Quantifier {
    One,
    OneOrNone,  // ?
    NoneOrMore,  // *
    OneOrMore,  // +
    Range(usize, usize),  // {min, max}
}

struct RegexTerm {
    term: Option<Term>,  // None matches any term, a wildcard or "." in a conventional regex.
    quantifier: Quantifier,
}

#[derive(Clone, Debug)]
pub struct RegexPhraseQuery {
    field: Field,
    phrase_terms: Vec<(usize, RegexTerm)>,
}

Questions

I'd like to know if such a feature would be useful to others, if similar requests have been made before, and if such a thing belongs within the scope of tantivy in the first place.
Any suggestions for how to implement this efficiently are most welcome.
What requirements should this fulfill for a pull-request to be accepted? I plan to implement this regardless of it being merged, but it would be nice if it would also help others.
Perhaps I'm missing some existing functionality that makes this redundant. Or it might be more appropriate to implement this as a filter on top of results from tantivy, instead of inside tantivy. Let me know.
The implementation of PhraseQuery is already a good guide, but reiterating which parts of the codebase I would need to extend for this would be helpful.
Again, I'm somewhat flexible to suggestions for expanding the specification to support wider use-cases, as long as it doesn't increase the workload substantially.

The text was updated successfully, but these errors were encountered:

fulmicoton · 2022-03-07T02:11:46Z

Did you have a look at @halvorboe 's work on #1241

oersted · 2022-03-07T10:18:06Z

I haven't no, thanks for the reference, I'll take a look.

PSeitz · 2024-10-24T10:38:23Z

For exact gaps you could (mis)use the recently added RegexPhraseQuery #2516

e.g. "Optical tweezers can be used to measure .* between .* and DNA"

Note that RegexPhraseQuery is currently not optimized for a match all term and it would be terribly inefficient. It should work though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for new query type: phrase with gaps #1273

Proposal for new query type: phrase with gaps #1273

oersted commented Feb 15, 2022

fulmicoton commented Mar 7, 2022

oersted commented Mar 7, 2022

PSeitz commented Oct 24, 2024 •

edited

Loading

Proposal for new query type: phrase with gaps #1273

Proposal for new query type: phrase with gaps #1273

Comments

oersted commented Feb 15, 2022

Phrase with Gaps

Implementation

Generalizations

Questions

fulmicoton commented Mar 7, 2022

oersted commented Mar 7, 2022

PSeitz commented Oct 24, 2024 • edited Loading

PSeitz commented Oct 24, 2024 •

edited

Loading