Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](load) The NodeChannel should be canceled when failed to add block #37500

Merged
merged 1 commit into from
Jul 9, 2024

Conversation

liaoxin01
Copy link
Contributor

Proposed changes

F20240704 15:35:33.724236 2556376 vtablet_writer.cpp:614] Check failed: block.rows() == request->tablet_ids_size() block rows: 12192, tablet_ids_size: 8128
*** Check failure stack trace: ***
@ 0x5612d9ebf696 google::LogMessage::SendToLog()
@ 0x5612d9ebc0e0 google::LogMessage::Flush()
@ 0x5612d9ebfed9 google::LogMessageFatal::~LogMessageFatal()
@ 0x5612d96a1770 doris::vectorized::VNodeChannel::try_send_pending_block()
@ 0x5612d0541e98 doris::ThreadPool::dispatch_thread()
@ 0x5612d0537251 doris::Thread::supervise_thread()
@ 0x7f02d4061ac3 (unknown)
@ 0x7f02d40f3850 (unknown)
@ (nil) (unknown)

The reason for this issue is due to a failed return from append_to_block_by_selector. The reason for the failure here is that the memory exceeded the limit. The previous column append was successful, while the subsequent columns failed to allocate memory. The failure was directly returned from here, and the subsequent _cur_add_block_request was not executed.
However, if the NodeChannel is not cancelled, the next add block will succeed, causing the block's rows to have an additional batch size (4064) compared to the tablet id's size, ultimately triggering the failure of the check.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@liaoxin01
Copy link
Contributor Author

run buildall

Copy link
Contributor

github-actions bot commented Jul 8, 2024

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 8, 2024
Copy link
Contributor

github-actions bot commented Jul 8, 2024

PR approved by anyone and no changes requested.

Copy link
Contributor

github-actions bot commented Jul 8, 2024

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 39573 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 01c01574ae852a22bd900ef8155ad680aa83bdf0, data reload: false

------ Round 1 ----------------------------------
q1	17607	4305	4292	4292
q2	2030	190	195	190
q3	10449	1211	1083	1083
q4	10186	755	805	755
q5	7506	2652	2644	2644
q6	218	138	134	134
q7	960	592	593	592
q8	9245	2078	2067	2067
q9	9024	6513	6446	6446
q10	8956	3668	3701	3668
q11	445	236	232	232
q12	466	226	221	221
q13	17767	2954	2951	2951
q14	273	213	239	213
q15	520	480	502	480
q16	527	377	370	370
q17	958	635	663	635
q18	8052	7475	7465	7465
q19	7886	1422	1440	1422
q20	668	315	331	315
q21	4883	3066	3114	3066
q22	379	340	332	332
Total cold run time: 119005 ms
Total hot run time: 39573 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4386	4250	4236	4236
q2	360	250	268	250
q3	2949	2910	2878	2878
q4	2015	1645	1726	1645
q5	5601	5506	5454	5454
q6	226	129	135	129
q7	2183	1914	1862	1862
q8	3311	3385	3434	3385
q9	8669	8658	8826	8658
q10	4100	3813	3743	3743
q11	595	487	512	487
q12	850	682	649	649
q13	15845	3155	3163	3155
q14	301	277	282	277
q15	545	512	488	488
q16	497	418	438	418
q17	1828	1528	1530	1528
q18	8058	7838	7764	7764
q19	1736	1610	1702	1610
q20	2250	1963	1868	1868
q21	4957	5020	5275	5020
q22	627	549	539	539
Total cold run time: 71889 ms
Total hot run time: 56043 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173557 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 01c01574ae852a22bd900ef8155ad680aa83bdf0, data reload: false

query1	906	370	362	362
query2	6444	2313	2290	2290
query3	6641	206	215	206
query4	27586	17517	17217	17217
query5	3709	473	473	473
query6	279	168	185	168
query7	4600	288	285	285
query8	343	295	303	295
query9	8556	2409	2375	2375
query10	578	286	278	278
query11	10604	10067	9991	9991
query12	117	88	84	84
query13	1659	369	371	369
query14	10232	7229	7787	7229
query15	237	185	190	185
query16	7524	302	322	302
query17	1727	601	539	539
query18	1654	284	281	281
query19	201	153	154	153
query20	89	82	81	81
query21	213	126	130	126
query22	4365	4103	3998	3998
query23	33925	33632	33562	33562
query24	11077	2907	2921	2907
query25	618	402	431	402
query26	1024	159	150	150
query27	2280	268	282	268
query28	6814	2141	2098	2098
query29	933	664	651	651
query30	256	161	155	155
query31	961	768	761	761
query32	100	58	57	57
query33	770	301	317	301
query34	1007	472	489	472
query35	719	617	639	617
query36	1124	994	994	994
query37	142	82	87	82
query38	2956	2805	2826	2805
query39	889	855	850	850
query40	212	124	119	119
query41	59	54	53	53
query42	114	99	106	99
query43	581	552	561	552
query44	1227	721	734	721
query45	200	164	163	163
query46	1076	712	717	712
query47	1865	1787	1805	1787
query48	370	294	297	294
query49	946	412	409	409
query50	769	391	390	390
query51	6882	6855	6695	6695
query52	113	92	88	88
query53	355	282	289	282
query54	891	452	449	449
query55	74	71	73	71
query56	286	265	274	265
query57	1141	1057	1053	1053
query58	273	243	249	243
query59	3508	3109	3153	3109
query60	297	273	280	273
query61	114	92	100	92
query62	808	623	639	623
query63	334	290	291	290
query64	9254	2148	1656	1656
query65	3130	3102	3119	3102
query66	746	325	329	325
query67	15676	14969	14918	14918
query68	8581	535	533	533
query69	763	439	368	368
query70	1354	1148	1146	1146
query71	566	276	299	276
query72	8546	5559	5410	5410
query73	2158	318	314	314
query74	6073	5550	5416	5416
query75	5245	2622	2663	2622
query76	5257	1014	856	856
query77	829	302	299	299
query78	9608	9137	8984	8984
query79	8666	506	518	506
query80	1100	475	510	475
query81	577	217	224	217
query82	713	134	133	133
query83	339	167	165	165
query84	267	84	86	84
query85	1334	323	325	323
query86	400	321	289	289
query87	3353	3087	3043	3043
query88	4710	2343	2356	2343
query89	538	382	394	382
query90	2042	223	188	188
query91	132	103	100	100
query92	58	48	50	48
query93	6527	514	498	498
query94	1244	212	212	212
query95	411	306	315	306
query96	616	269	263	263
query97	3178	3042	3036	3036
query98	212	197	201	197
query99	1534	1273	1295	1273
Total cold run time: 302091 ms
Total hot run time: 173557 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.72 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 01c01574ae852a22bd900ef8155ad680aa83bdf0, data reload: false

query1	0.04	0.04	0.03
query2	0.08	0.04	0.04
query3	0.22	0.04	0.04
query4	1.68	0.06	0.07
query5	0.50	0.50	0.48
query6	1.14	0.73	0.73
query7	0.02	0.01	0.02
query8	0.05	0.04	0.04
query9	0.56	0.51	0.50
query10	0.52	0.53	0.54
query11	0.15	0.11	0.11
query12	0.15	0.12	0.12
query13	0.59	0.58	0.58
query14	0.76	0.78	0.78
query15	0.85	0.82	0.81
query16	0.36	0.35	0.36
query17	1.03	0.98	1.04
query18	0.24	0.21	0.22
query19	1.85	1.74	1.80
query20	0.01	0.01	0.00
query21	15.41	0.75	0.64
query22	4.28	6.81	2.08
query23	18.25	1.39	1.27
query24	1.93	0.26	0.22
query25	0.17	0.09	0.08
query26	0.30	0.21	0.21
query27	0.46	0.24	0.24
query28	13.33	1.02	0.99
query29	12.65	3.29	3.24
query30	0.26	0.06	0.07
query31	2.88	0.40	0.39
query32	3.25	0.47	0.48
query33	2.86	2.92	2.89
query34	17.01	4.37	4.44
query35	4.42	4.40	4.39
query36	0.66	0.47	0.46
query37	0.19	0.15	0.15
query38	0.15	0.14	0.14
query39	0.04	0.03	0.04
query40	0.15	0.12	0.12
query41	0.09	0.05	0.05
query42	0.06	0.05	0.05
query43	0.04	0.04	0.04
Total cold run time: 109.64 s
Total hot run time: 30.72 s

@dataroaring dataroaring merged commit 1772f78 into apache:master Jul 9, 2024
26 of 30 checks passed
liaoxin01 added a commit to liaoxin01/doris that referenced this pull request Jul 9, 2024
…ck (apache#37500)

## Proposed changes

F20240704 15:35:33.724236 2556376 vtablet_writer.cpp:614] Check failed:
block.rows() == request->tablet_ids_size() block rows: 12192,
tablet_ids_size: 8128
*** Check failure stack trace: ***
    @     0x5612d9ebf696  google::LogMessage::SendToLog()
    @     0x5612d9ebc0e0  google::LogMessage::Flush()
    @     0x5612d9ebfed9  google::LogMessageFatal::~LogMessageFatal()
@ 0x5612d96a1770
doris::vectorized::VNodeChannel::try_send_pending_block()
    @     0x5612d0541e98  doris::ThreadPool::dispatch_thread()
    @     0x5612d0537251  doris::Thread::supervise_thread()
    @     0x7f02d4061ac3  (unknown)
    @     0x7f02d40f3850  (unknown)
    @              (nil)  (unknown)

The reason for this issue is due to a failed return from
`append_to_block_by_selector`. The reason for the failure here is that
the memory exceeded the limit. The previous column append was
successful, while the subsequent columns failed to allocate memory. The
failure was directly returned from here, and the subsequent
_cur_add_block_request was not executed.
However, if the NodeChannel is not cancelled, the next add block will
succeed, causing the block's rows to have an additional batch size
(4064) compared to the tablet id's size, ultimately triggering the
failure of the check.

<!--Describe your changes.-->
dataroaring pushed a commit that referenced this pull request Jul 17, 2024
…ck (#37500)

## Proposed changes

F20240704 15:35:33.724236 2556376 vtablet_writer.cpp:614] Check failed:
block.rows() == request->tablet_ids_size() block rows: 12192,
tablet_ids_size: 8128
*** Check failure stack trace: ***
    @     0x5612d9ebf696  google::LogMessage::SendToLog()
    @     0x5612d9ebc0e0  google::LogMessage::Flush()
    @     0x5612d9ebfed9  google::LogMessageFatal::~LogMessageFatal()
@ 0x5612d96a1770
doris::vectorized::VNodeChannel::try_send_pending_block()
    @     0x5612d0541e98  doris::ThreadPool::dispatch_thread()
    @     0x5612d0537251  doris::Thread::supervise_thread()
    @     0x7f02d4061ac3  (unknown)
    @     0x7f02d40f3850  (unknown)
    @              (nil)  (unknown)

The reason for this issue is due to a failed return from
`append_to_block_by_selector`. The reason for the failure here is that
the memory exceeded the limit. The previous column append was
successful, while the subsequent columns failed to allocate memory. The
failure was directly returned from here, and the subsequent
_cur_add_block_request was not executed.
However, if the NodeChannel is not cancelled, the next add block will
succeed, causing the block's rows to have an additional batch size
(4064) compared to the tablet id's size, ultimately triggering the
failure of the check.

<!--Describe your changes.-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.5-merged dev/3.0.1-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants