-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Optimize](invert index) Optimize multiple terms conjunction query #23871
Conversation
clang-tidy review says "All clean, LGTM! 👍" |
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
@@ -32,7 +32,8 @@ Status compact_column(int32_t index_id, int src_segment_num, int dest_segment_nu | |||
std::vector<uint32_t> dest_segment_num_rows) { | |||
lucene::store::Directory* dir = | |||
DorisCompoundDirectory::getDirectory(fs, index_writer_path.c_str(), false); | |||
auto index_writer = _CLNEW lucene::index::IndexWriter(dir, nullptr, true /* create */, | |||
lucene::analysis::SimpleAnalyzer<char> analyzer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use simple analyzer for all case?
std::wstring str_tokens; | ||
str_tokens += std::to_wstring(static_cast<int32_t>(query_type)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move it into InvertedIndexQueryCache::CacheKey::encode()
std::wstring str_tokens; | ||
str_tokens += std::to_wstring(static_cast<int32_t>(query_type)); | ||
for (auto& token : analyse_result) { | ||
str_tokens += token; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add whitespace seperator
SCOPED_RAW_TIMER(&stats->inverted_index_searcher_search_timer); | ||
if (query_type == InvertedIndexQueryType::MATCH_ANY_QUERY || | ||
query_type == InvertedIndexQueryType::EQUAL_QUERY) { | ||
index_searcher->_search(query.get(), [&term_match_bitmap](DocRange* docRange) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use doc_range consistently
int32_t big = iterators[iterators.size() - 1].docFreq(); | ||
if (little == 0) { | ||
_useSkip = true; | ||
} else if ((big / little) > 1000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it a session variable
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
clang-tidy review says "All clean, LGTM! 👍" |
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
clang-tidy review says "All clean, LGTM! 👍" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by anyone and no changes requested. |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
run p0 |
@@ -394,6 +394,8 @@ public class SessionVariable implements Serializable, Writable { | |||
public static final String ENABLE_MEMTABLE_ON_SINK_NODE = | |||
"enable_memtable_on_sink_node"; | |||
|
|||
public static final String CONJUNCTION_RATIO = "conjunction_ratio"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is not easy to understand. one suggestion: inverted_index_conjunction_opt_threshold.
BTW, add document for this session var.
@@ -240,6 +240,8 @@ struct TQueryOptions { | |||
|
|||
// A tag used to distinguish fe start epoch. | |||
82: optional i64 fe_process_uuid = 0; | |||
|
|||
83: optional i32 conjunction_ratio = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name
std::wstring str_tokens; | ||
for (auto& token : analyse_result) { | ||
str_tokens += token; | ||
str_tokens += L"_"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'a b_c' and 'a_b c' will be the same key. This kind of problem can be avoid by using whitespace can.
const std::unique_ptr<lucene::search::Query>& query, | ||
const std::shared_ptr<roaring::Roaring>& term_match_bitmap); | ||
|
||
Status match_all_search(OlapReaderStatistics* stats, RuntimeState* runtime_state, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function name index_search and match_all_search is not symmetrical. suggestion: normal_index_search, match_all_index_search
private: | ||
void search_by_bitmap(roaring::Roaring& roaring); | ||
|
||
void search_by_skiplist(roaring::Roaring& roaring) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's confusing to put search_by_skiplist and next_doc impl in .h
} | ||
} | ||
|
||
int32_t next_doc() { return do_next(_lead1.nextDoc()); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next_doc is only used once and is simple, is it necessary to create a new function for it?
std::vector<TermIterator> _others; | ||
|
||
std::vector<Term*> _terms; | ||
std::vector<TermDocs*> _termDocs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_term_docs
IndexReader* _reader = nullptr; | ||
IndexVersion _indexVersion = IndexVersion::kV0; | ||
int32_t _conjunction_ratio = 1000; | ||
bool _useSkip = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_use_skiplist
(From new machine)TeamCity pipeline, clickbench performance test result: |
run buildall |
TeamCity be ut coverage result: |
(From new machine)TeamCity pipeline, clickbench performance test result: |
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
(From new machine)TeamCity pipeline, clickbench performance test result: |
run beut |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
2 similar comments
clang-tidy review says "All clean, LGTM! 👍" |
clang-tidy review says "All clean, LGTM! 👍" |
TeamCity be ut coverage result: |
TeamCity be ut coverage result: |
(From new machine)TeamCity pipeline, clickbench performance test result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Proposed changes
Issue Number: close #xxx
Further comments
If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...