Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improve](inverted index) improve match performance without index #24751

Merged
merged 2 commits into from
Sep 22, 2023

Conversation

airborne12
Copy link
Member

Proposed changes

Issue Number: close #xxx

improve performance for match without index, from 2 min 34.54 sec to 9.16s, on httplogs dataset.

mysql> select COUNT() from wc_httplogs_without_index where request match "GET";                                                                  +-----------+
| count(*)  |
+-----------+
| 245552656 |
+-----------+
1 row in set (29.29 sec)

mysql> select COUNT() from wc_httplogs where request match "GET";
+-----------+
| count(*)  |
+-----------+
| 245552656 |
+-----------+
1 row in set (0.51 sec)

mysql> select COUNT() from wc_httplogs_without_index where request match "GET";
+-----------+
| count(*)  |
+-----------+
| 245552656 |
+-----------+
1 row in set (2 min 34.54 sec)

mysql> select COUNT() from wc_httplogs_without_index where request match "GET";                                                                 +-----------+
| count(*)  |
+-----------+
| 245552656 |
+-----------+
1 row in set (9.16 sec)

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

@airborne12
Copy link
Member Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.42% (8024/22031)
Line Coverage: 28.72% (64431/224366)
Region Coverage: 27.62% (33461/121141)
Branch Coverage: 24.27% (17119/70534)
Coverage Report: http://coverage.selectdb-in.cc/coverage/f692b6dde483badf2628225f2cd4779c969e54fa_f692b6dde483badf2628225f2cd4779c969e54fa/report/index.html

@airborne12
Copy link
Member Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@airborne12
Copy link
Member Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

@zzzxl1993 zzzxl1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@airborne12
Copy link
Member Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -282,18 +294,18 @@ Status FullTextIndexReader::query(OlapReaderStatistics* stats, RuntimeState* run
if (query_type == InvertedIndexQueryType::MATCH_PHRASE_QUERY ||
query_type == InvertedIndexQueryType::MATCH_ALL_QUERY ||
query_type == InvertedIndexQueryType::EQUAL_QUERY) {
std::wstring wstr_tokens;
std::string str_tokens;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'str_tokens' is not initialized [cppcoreguidelines-init-variables]

Suggested change
std::string str_tokens;
std::string str_tokens = 0;

auto reader = doris::segment_v2::InvertedIndexReader::create_reader(
&inverted_index_ctx, tokenize_str.to_string());

std::vector<std::string> query_tokens =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'query_tokens' is not initialized [cppcoreguidelines-init-variables]

Suggested change
std::vector<std::string> query_tokens =
std::vector<std::string> query_tokens = 0 =

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.40% (8023/22043)
Line Coverage: 28.68% (64435/224648)
Region Coverage: 27.60% (33470/121283)
Branch Coverage: 24.26% (17126/70598)
Coverage Report: http://coverage.selectdb-in.cc/coverage/352610bc410aeeea24bee97bc4aa6815030b769f_352610bc410aeeea24bee97bc4aa6815030b769f/report/index.html

@airborne12
Copy link
Member Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.40% (8023/22044)
Line Coverage: 28.68% (64428/224678)
Region Coverage: 27.58% (33458/121309)
Branch Coverage: 24.24% (17118/70618)
Coverage Report: http://coverage.selectdb-in.cc/coverage/518f046751faf3c763d5c491ffd2349836a420ac_518f046751faf3c763d5c491ffd2349836a420ac/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.49 seconds
stream load tsv: 575 seconds loaded 74807831229 Bytes, about 124 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162249393 Bytes

Copy link
Contributor

@xiaokang xiaokang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants