Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Optimize](invert index) Optimize multiple terms conjunction query #23871

Merged
merged 1 commit into from
Sep 8, 2023

Conversation

zzzxl1993
Copy link
Contributor

Proposed changes

Issue Number: close #xxx

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

@@ -32,7 +32,8 @@ Status compact_column(int32_t index_id, int src_segment_num, int dest_segment_nu
std::vector<uint32_t> dest_segment_num_rows) {
lucene::store::Directory* dir =
DorisCompoundDirectory::getDirectory(fs, index_writer_path.c_str(), false);
auto index_writer = _CLNEW lucene::index::IndexWriter(dir, nullptr, true /* create */,
lucene::analysis::SimpleAnalyzer<char> analyzer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use simple analyzer for all case?

std::wstring str_tokens;
str_tokens += std::to_wstring(static_cast<int32_t>(query_type));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move it into InvertedIndexQueryCache::CacheKey::encode()

std::wstring str_tokens;
str_tokens += std::to_wstring(static_cast<int32_t>(query_type));
for (auto& token : analyse_result) {
str_tokens += token;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add whitespace seperator

SCOPED_RAW_TIMER(&stats->inverted_index_searcher_search_timer);
if (query_type == InvertedIndexQueryType::MATCH_ANY_QUERY ||
query_type == InvertedIndexQueryType::EQUAL_QUERY) {
index_searcher->_search(query.get(), [&term_match_bitmap](DocRange* docRange) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use doc_range consistently

int32_t big = iterators[iterators.size() - 1].docFreq();
if (little == 0) {
_useSkip = true;
} else if ((big / little) > 1000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it a session variable

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Member

@airborne12 airborne12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

PR approved by anyone and no changes requested.

@zzzxl1993
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

clang-tidy review says "All clean, LGTM! 👍"

@zzzxl1993
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

clang-tidy review says "All clean, LGTM! 👍"

@zzzxl1993
Copy link
Contributor Author

run p0

@@ -394,6 +394,8 @@ public class SessionVariable implements Serializable, Writable {
public static final String ENABLE_MEMTABLE_ON_SINK_NODE =
"enable_memtable_on_sink_node";

public static final String CONJUNCTION_RATIO = "conjunction_ratio";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is not easy to understand. one suggestion: inverted_index_conjunction_opt_threshold.
BTW, add document for this session var.

@@ -240,6 +240,8 @@ struct TQueryOptions {

// A tag used to distinguish fe start epoch.
82: optional i64 fe_process_uuid = 0;

83: optional i32 conjunction_ratio = 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name

std::wstring str_tokens;
for (auto& token : analyse_result) {
str_tokens += token;
str_tokens += L"_";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'a b_c' and 'a_b c' will be the same key. This kind of problem can be avoid by using whitespace can.

const std::unique_ptr<lucene::search::Query>& query,
const std::shared_ptr<roaring::Roaring>& term_match_bitmap);

Status match_all_search(OlapReaderStatistics* stats, RuntimeState* runtime_state,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function name index_search and match_all_search is not symmetrical. suggestion: normal_index_search, match_all_index_search

private:
void search_by_bitmap(roaring::Roaring& roaring);

void search_by_skiplist(roaring::Roaring& roaring) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's confusing to put search_by_skiplist and next_doc impl in .h

}
}

int32_t next_doc() { return do_next(_lead1.nextDoc()); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next_doc is only used once and is simple, is it necessary to create a new function for it?

std::vector<TermIterator> _others;

std::vector<Term*> _terms;
std::vector<TermDocs*> _termDocs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_term_docs

IndexReader* _reader = nullptr;
IndexVersion _indexVersion = IndexVersion::kV0;
int32_t _conjunction_ratio = 1000;
bool _useSkip = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_use_skiplist

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 48.49 seconds
stream load tsv: 530 seconds loaded 74807831229 Bytes, about 134 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162326794 Bytes

@zzzxl1993
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.99% (7891/21335)
Line Coverage: 29.02% (63366/218376)
Region Coverage: 27.92% (32902/117823)
Branch Coverage: 24.50% (16900/68974)
Coverage Report: http://coverage.selectdb-in.cc/coverage/17bebb48b749ee1d48edf573aa6357b3fb43e6f8_17bebb48b749ee1d48edf573aa6357b3fb43e6f8/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 50.55 seconds
stream load tsv: 527 seconds loaded 74807831229 Bytes, about 135 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162008498 Bytes

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

clang-tidy review says "All clean, LGTM! 👍"

@zzzxl1993
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 48.27 seconds
stream load tsv: 532 seconds loaded 74807831229 Bytes, about 134 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162235322 Bytes

@zzzxl1993
Copy link
Contributor Author

run beut

@zzzxl1993
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

clang-tidy review says "All clean, LGTM! 👍"

2 similar comments
@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.91% (7897/21398)
Line Coverage: 28.92% (63373/219120)
Region Coverage: 27.86% (32902/118113)
Branch Coverage: 24.44% (16895/69140)
Coverage Report: http://coverage.selectdb-in.cc/coverage/afc8ab3567d03c2f9f78dcb1005e1cf43e11ce60_afc8ab3567d03c2f9f78dcb1005e1cf43e11ce60/report/index.html

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.90% (7895/21398)
Line Coverage: 28.91% (63353/219120)
Region Coverage: 27.84% (32884/118113)
Branch Coverage: 24.43% (16893/69140)
Coverage Report: http://coverage.selectdb-in.cc/coverage/afc8ab3567d03c2f9f78dcb1005e1cf43e11ce60_afc8ab3567d03c2f9f78dcb1005e1cf43e11ce60/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 48.16 seconds
stream load tsv: 536 seconds loaded 74807831229 Bytes, about 133 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 29.4 seconds inserted 10000000 Rows, about 340K ops/s
storage size: 17162037598 Bytes

Copy link
Contributor

@xiaokang xiaokang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants