Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for new query type: phrase with gaps #1273

Open
oersted opened this issue Feb 15, 2022 · 3 comments
Open

Proposal for new query type: phrase with gaps #1273

oersted opened this issue Feb 15, 2022 · 3 comments

Comments

@oersted
Copy link

oersted commented Feb 15, 2022

I will be shortly starting the implementation of a new query type on a fork. I'd like to hear the opinion of the community and the maintainers on it so that it is consistent with the project's needs and standards from the beginning.

Phrase with Gaps

The project I'm working on requires running a large volume of queries that look something like this.

____ is a type of electron microscope.
Optical tweezers can be used to measure _____ between _____ and DNA.
____ batteries are the most efficient energy-storage medium for a temperature of ____.

Generally, efficiently retrieving sentences that match patterns with gaps. Or from a different perspective, the intersection of multiple phrase queries, where the phrases must appear in a given order and have one or more arbitrary terms between them.

Implementation

I plan to follow the existing architecture for PhraseQuery closely with minor extensions.

For instance:

enum GapTerm {
    Gap,
    Term(Term),
}

#[derive(Clone, Debug)]
pub struct GapPhraseQuery {
    field: Field,
    phrase_terms: Vec<(usize, GapTerm)>,
}

With relevant modifications to PhraseScorer and PhraseWeight.

As described above, I believe it would be enough to simply reuse the current phrase query algorithm, execute a query for each phrase separated by a gap, and then verify their order and distancing. However, I'm open to suggestions of more efficient algorithms.

Generalizations

These extensions are in principle not in the scope of this PR, since I don't strictly need them for my use case, and I believe that the current specification of the feature is already generic enough to be useful to other users. I will be somewhat flexible though if I see that a more general version is in demand.

  • Bounded Gap Lengths: Gaps may have min_len and max_len. My plan would be equivalent to min_len: Some(1) and max_len: None. I can see that particularly min_len: None could be useful for optional terms in phrases.
  • Slop: It is a related concept, with the main difference being that slop: 1 on a two-term phrase allows having either an arbitrary term in the middle or reversing the order of the terms.
  • Spans: The Span API from Lucene/Elasticsearch is also very related to what I'm proposing, I believe slop is implemented on top of it.
  • Regex Phrases: We could consider defining a whole regex-style API for specifying patterns of terms. It might look something like this.
enum Quantifier {
    One,
    OneOrNone,  // ?
    NoneOrMore,  // *
    OneOrMore,  // +
    Range(usize, usize),  // {min, max}
}

struct RegexTerm {
    term: Option<Term>,  // None matches any term, a wildcard or "." in a conventional regex.
    quantifier: Quantifier,
}

#[derive(Clone, Debug)]
pub struct RegexPhraseQuery {
    field: Field,
    phrase_terms: Vec<(usize, RegexTerm)>,
}

Questions

  • I'd like to know if such a feature would be useful to others, if similar requests have been made before, and if such a thing belongs within the scope of tantivy in the first place.
  • Any suggestions for how to implement this efficiently are most welcome.
  • What requirements should this fulfill for a pull-request to be accepted? I plan to implement this regardless of it being merged, but it would be nice if it would also help others.
  • Perhaps I'm missing some existing functionality that makes this redundant. Or it might be more appropriate to implement this as a filter on top of results from tantivy, instead of inside tantivy. Let me know.
  • The implementation of PhraseQuery is already a good guide, but reiterating which parts of the codebase I would need to extend for this would be helpful.
  • Again, I'm somewhat flexible to suggestions for expanding the specification to support wider use-cases, as long as it doesn't increase the workload substantially.
@fulmicoton
Copy link
Collaborator

Did you have a look at @halvorboe 's work on #1241

@oersted
Copy link
Author

oersted commented Mar 7, 2022

I haven't no, thanks for the reference, I'll take a look.

@PSeitz
Copy link
Contributor

PSeitz commented Oct 24, 2024

For exact gaps you could (mis)use the recently added RegexPhraseQuery #2516

e.g. "Optical tweezers can be used to measure .* between .* and DNA"

Note that RegexPhraseQuery is currently not optimized for a match all term and it would be terribly inefficient. It should work though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants