Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement](fix)change ordinary type null value is \N,complex type null value is null #24207

Merged
merged 5 commits into from
Sep 16, 2023

Conversation

hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Sep 11, 2023

Proposed changes

For null values in ordinary types, we use \N to represent them;
for null values in nested types, we use null to represent them, just like the json format.

example:
If you have three nullable columns
a : int, b : string, c : map<string,int>
data:

\N,hello world,\N
1,\N,{"cmake":2,"null":11}
9,"\N",{"\N":null,null:0}
\N,"null",{null:null}
null,null,null

if you set trim_double_quotes = true
you will get :

NULL,hello world,NULL
1,NULL,{"cmake":2,"null":11}
9,\N,{"\N":NULL,NULL:0}
NULL,null,{NULL:NULL}
NULL,null,NULL

if you set trim_double_quotes = false
you will get :

NULL,hello world,NULL
1,\N,{"cmake":2,"null":11}
9,"\N",{"\N":NULL,NULL:0}
NULL,"null",{NULL:NULL}
NULL,null,NULL

in csv(text) for normal type: we only recognize \N for null , so
for not char family type, like int, if we put null literal ,
it will parse fail, and make result null,not just because it equals \N.
for char family type, like string, if we put null literal, it will parse success,
and "null" literal will be stored in doris.

For strings in the json complex type, we remove double quotes by default.

Because when querying complex types, such as selecting complexColumn from table,
we will add double quotes to the strings in the complex type.

For the map<string,int> column, insert { "abc" : 1, "hello",2 }.
If you do not remove the double quotes, it will display {""abc"":1,""hello"": 2 },
remove the double quotes to display { "abc" : 1, "hello",2 }.

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.55 seconds
stream load tsv: 583 seconds loaded 74807831229 Bytes, about 122 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17162358405 Bytes

@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 48.25 seconds
stream load tsv: 583 seconds loaded 74807831229 Bytes, about 122 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17162249491 Bytes

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.92% (7917/21443)
Line Coverage: 28.96% (63647/219753)
Region Coverage: 27.88% (33021/118438)
Branch Coverage: 24.46% (16961/69334)
Coverage Report: http://coverage.selectdb-in.cc/coverage/9039fff5c5de53c6f32de6f0ec7f5dc1cd4c1962_9039fff5c5de53c6f32de6f0ec7f5dc1cd4c1962/report/index.html

morningman
morningman previously approved these changes Sep 12, 2023
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
But please describe more in your PR comment about how to handle null in text

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Sep 12, 2023
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

auto& null_column = assert_cast<ColumnNullable&>(column);
// TODO(Amory) make null literal configurable

if (!(options.converted_from_string && slice.trim_quote())) {
//for map<string,string> type : {"abc","NULL"} , the NULL is string , instead of null values
if (slice.size == 4 && slice[0] == 'N' && slice[1] == 'U' && slice[2] == 'L' &&
slice[3] == 'L') {
if (nesting_level >= 2 && slice.size == 4 && slice[0] == 'n' && slice[1] == 'u' &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment to describe the logic here. Better give some example

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Sep 14, 2023
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.94% (7970/21578)
Line Coverage: 28.90% (63879/221008)
Region Coverage: 27.83% (33152/119111)
Branch Coverage: 24.39% (17000/69706)
Coverage Report: http://coverage.selectdb-in.cc/coverage/8bbd14c6636ff01dca07d4f396006b646d6ea6dd_8bbd14c6636ff01dca07d4f396006b646d6ea6dd/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.96 seconds
stream load tsv: 599 seconds loaded 74807831229 Bytes, about 119 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17162423649 Bytes

* for null values in nested types, we use null to represent them, just like the json format.
*
* example:
* If you have three nullable columns
Copy link
Contributor

@amorynan amorynan Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a note: null -> int -> NULL | null -> char family -> "null"
in csv(text) for normal type: we only recognize \N for null
so
for not char family type, like int, if we put null literal , it will parse fail, and make result null,not just because it equals \N
for char family type, like string, if we put null literal, it will parse success, and "null" literal will be stored in doris

@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.84% (7977/21654)
Line Coverage: 28.86% (64003/221769)
Region Coverage: 27.78% (33196/119502)
Branch Coverage: 24.37% (17031/69888)
Coverage Report: http://coverage.selectdb-in.cc/coverage/c6d7df96fce2b1079cee8d892530c5f45c1f80d8_c6d7df96fce2b1079cee8d892530c5f45c1f80d8/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.75 seconds
stream load tsv: 598 seconds loaded 74807831229 Bytes, about 119 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.1 seconds inserted 10000000 Rows, about 343K ops/s
storage size: 17162265867 Bytes

@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.85% (7979/21654)
Line Coverage: 28.87% (64023/221769)
Region Coverage: 27.79% (33208/119502)
Branch Coverage: 24.37% (17031/69888)
Coverage Report: http://coverage.selectdb-in.cc/coverage/e28841e1421bce02d3bc3ae8cfe3c2b59d0d9891_e28841e1421bce02d3bc3ae8cfe3c2b59d0d9891/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 51.84 seconds
stream load tsv: 599 seconds loaded 74807831229 Bytes, about 119 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162200224 Bytes

Copy link
Contributor

@amorynan amorynan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Sep 16, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants