Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](cloud) fix routine load job stuck if commit transaction failed #40539

Merged
merged 1 commit into from
Sep 11, 2024
Merged

[fix](cloud) fix routine load job stuck if commit transaction failed #40539

merged 1 commit into from
Sep 11, 2024

Conversation

sollhui
Copy link
Contributor

@sollhui sollhui commented Sep 9, 2024

At the before commit stage, a write lock will be added. If the commit transaction fails, the thread will return directly and the write lock will no longer be released which cause job stuck.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@sollhui
Copy link
Contributor Author

sollhui commented Sep 9, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 38505 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 212b9389d2da755f59b3f9a41f23ddd0769579c9, data reload: false

------ Round 1 ----------------------------------
q1	17831	4579	4412	4412
q2	2695	190	212	190
q3	12119	1142	1215	1142
q4	11245	724	833	724
q5	8478	2842	2824	2824
q6	221	136	138	136
q7	975	604	598	598
q8	9327	2067	2043	2043
q9	7236	6560	6557	6557
q10	7008	2233	2235	2233
q11	490	243	249	243
q12	401	229	222	222
q13	18118	3115	3066	3066
q14	282	235	237	235
q15	536	488	482	482
q16	515	453	435	435
q17	983	654	695	654
q18	7449	6867	6807	6807
q19	1375	1044	1021	1021
q20	682	334	326	326
q21	3980	3126	3120	3120
q22	1149	1035	1037	1035
Total cold run time: 113095 ms
Total hot run time: 38505 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4349	4300	4284	4284
q2	391	276	275	275
q3	2875	2683	2661	2661
q4	1977	1710	1677	1677
q5	5401	5417	5436	5417
q6	245	140	134	134
q7	2116	1703	1743	1703
q8	3190	3348	3332	3332
q9	8476	8398	8467	8398
q10	3502	3244	3248	3244
q11	617	518	496	496
q12	800	613	602	602
q13	8490	3079	3100	3079
q14	310	277	275	275
q15	532	484	483	483
q16	533	485	484	484
q17	1819	1509	1500	1500
q18	7780	7399	7532	7399
q19	1681	1373	1563	1373
q20	2086	1795	1863	1795
q21	5495	5433	5286	5286
q22	1139	1091	1054	1054
Total cold run time: 63804 ms
Total hot run time: 54951 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 191479 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 212b9389d2da755f59b3f9a41f23ddd0769579c9, data reload: false

query1	914	375	380	375
query2	6495	1888	1879	1879
query3	6659	215	227	215
query4	33889	23171	23124	23124
query5	4163	508	492	492
query6	254	190	163	163
query7	4589	299	308	299
query8	275	215	219	215
query9	8444	2485	2495	2485
query10	443	267	266	266
query11	17743	15096	15215	15096
query12	151	98	97	97
query13	1618	396	357	357
query14	8656	6400	6987	6400
query15	223	175	167	167
query16	7853	472	487	472
query17	1576	559	563	559
query18	2039	307	283	283
query19	199	158	142	142
query20	120	109	109	109
query21	201	110	103	103
query22	4516	4278	4132	4132
query23	34274	33756	33544	33544
query24	11132	2858	2870	2858
query25	646	379	390	379
query26	1445	161	153	153
query27	3015	277	280	277
query28	7845	2037	2038	2037
query29	921	415	420	415
query30	329	156	152	152
query31	977	770	757	757
query32	96	54	57	54
query33	764	289	287	287
query34	1010	474	480	474
query35	871	750	712	712
query36	1073	920	940	920
query37	163	94	85	85
query38	3981	3906	3957	3906
query39	1462	1379	1409	1379
query40	268	115	122	115
query41	48	49	46	46
query42	121	95	93	93
query43	493	469	468	468
query44	1188	768	749	749
query45	195	167	167	167
query46	1105	764	745	745
query47	1905	1808	1829	1808
query48	373	290	296	290
query49	1125	466	452	452
query50	830	405	418	405
query51	6964	6948	6939	6939
query52	99	90	92	90
query53	261	198	184	184
query54	1049	460	464	460
query55	80	77	78	77
query56	297	277	310	277
query57	1209	1100	1068	1068
query58	234	222	225	222
query59	2929	2841	2731	2731
query60	290	269	260	260
query61	100	99	101	99
query62	833	658	690	658
query63	228	180	184	180
query64	5278	701	664	664
query65	3229	3147	3159	3147
query66	1431	342	344	342
query67	15821	15479	15413	15413
query68	3103	854	844	844
query69	429	317	323	317
query70	1199	1190	1174	1174
query71	348	337	338	337
query72	6158	3520	2599	2599
query73	603	591	581	581
query74	9039	9037	8998	8998
query75	3155	2950	2906	2906
query76	1845	854	846	846
query77	477	404	400	400
query78	9350	9352	9175	9175
query79	893	912	883	883
query80	892	814	805	805
query81	443	263	262	262
query82	271	266	266	266
query83	195	194	194	194
query84	243	111	108	108
query85	645	394	382	382
query86	311	311	310	310
query87	4403	4369	4381	4369
query88	4363	4124	4107	4107
query89	375	374	373	373
query90	1484	326	319	319
query91	124	134	126	126
query92	76	75	73	73
query93	911	951	926	926
query94	570	364	388	364
query95	424	410	460	410
query96	473	475	473	473
query97	3107	3117	3096	3096
query98	229	222	222	222
query99	1384	1266	1281	1266
Total cold run time: 286529 ms
Total hot run time: 191479 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.89 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 212b9389d2da755f59b3f9a41f23ddd0769579c9, data reload: false

query1	0.04	0.04	0.04
query2	0.09	0.04	0.04
query3	0.22	0.05	0.06
query4	1.86	0.09	0.09
query5	0.52	0.53	0.50
query6	1.30	0.74	0.74
query7	0.02	0.02	0.01
query8	0.05	0.05	0.05
query9	0.55	0.50	0.49
query10	0.55	0.57	0.54
query11	0.16	0.12	0.12
query12	0.15	0.12	0.12
query13	0.61	0.59	0.60
query14	1.39	1.45	1.44
query15	0.85	0.82	0.86
query16	0.37	0.37	0.39
query17	1.05	1.04	1.06
query18	0.22	0.19	0.20
query19	1.82	1.83	1.81
query20	0.01	0.01	0.01
query21	15.40	0.67	0.67
query22	4.55	7.34	2.08
query23	18.33	1.29	1.25
query24	2.18	0.21	0.22
query25	0.16	0.09	0.07
query26	0.27	0.19	0.18
query27	0.08	0.09	0.08
query28	13.25	1.03	1.01
query29	12.58	3.37	3.34
query30	0.24	0.05	0.05
query31	2.88	0.40	0.39
query32	3.24	0.48	0.48
query33	2.96	2.99	3.01
query34	17.13	4.46	4.55
query35	4.54	4.50	4.54
query36	0.67	0.47	0.49
query37	0.18	0.15	0.16
query38	0.15	0.15	0.15
query39	0.05	0.04	0.04
query40	0.16	0.13	0.14
query41	0.10	0.05	0.05
query42	0.06	0.05	0.04
query43	0.05	0.04	0.04
Total cold run time: 111.04 s
Total hot run time: 31.89 s

@sollhui
Copy link
Contributor Author

sollhui commented Sep 10, 2024

run buildall

1 similar comment
@sollhui
Copy link
Contributor Author

sollhui commented Sep 10, 2024

run buildall

@sollhui
Copy link
Contributor Author

sollhui commented Sep 10, 2024

run buildall

@sollhui
Copy link
Contributor Author

sollhui commented Sep 10, 2024

run buildall

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Sep 11, 2024
Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-H: Total hot run time: 38365 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2eaef5a015b9ead8b39ce48bad65913ae90f2b73, data reload: false

------ Round 1 ----------------------------------
q1	17645	4345	4357	4345
q2	2018	190	183	183
q3	11790	990	1119	990
q4	10521	739	720	720
q5	7765	2853	2807	2807
q6	222	142	141	141
q7	946	622	627	622
q8	9514	2084	2062	2062
q9	7490	6534	6533	6533
q10	6996	2201	2293	2201
q11	449	256	249	249
q12	416	227	241	227
q13	17775	3086	3099	3086
q14	283	239	241	239
q15	533	493	486	486
q16	531	439	437	437
q17	990	673	704	673
q18	7350	6930	6933	6930
q19	1392	1052	1090	1052
q20	672	330	338	330
q21	4041	3125	3060	3060
q22	1125	992	1017	992
Total cold run time: 110464 ms
Total hot run time: 38365 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4361	4263	4341	4263
q2	392	285	275	275
q3	2886	2636	2627	2627
q4	1950	1672	1689	1672
q5	5648	5672	5745	5672
q6	237	140	137	137
q7	2270	1825	1879	1825
q8	3313	3443	3484	3443
q9	8879	8896	8906	8896
q10	3603	3422	3361	3361
q11	628	546	517	517
q12	854	655	639	639
q13	12362	3351	3249	3249
q14	330	293	291	291
q15	531	494	512	494
q16	549	542	492	492
q17	1854	1541	1528	1528
q18	8076	7754	7924	7754
q19	1742	1617	1517	1517
q20	2141	1938	1926	1926
q21	5794	5545	5506	5506
q22	1169	1082	1035	1035
Total cold run time: 69569 ms
Total hot run time: 57119 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 198040 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2eaef5a015b9ead8b39ce48bad65913ae90f2b73, data reload: false

query1	1264	879	926	879
query2	6313	1959	1956	1956
query3	10653	4058	4016	4016
query4	60186	27652	23235	23235
query5	5035	500	498	498
query6	395	176	159	159
query7	5624	306	291	291
query8	306	213	208	208
query9	7799	2501	2507	2501
query10	406	294	264	264
query11	17386	15179	15522	15179
query12	156	102	119	102
query13	1445	404	399	399
query14	10545	7565	7640	7565
query15	209	185	176	176
query16	6453	398	471	398
query17	1091	557	562	557
query18	1204	301	297	297
query19	193	150	155	150
query20	116	119	116	116
query21	209	104	107	104
query22	4718	4422	4364	4364
query23	34785	33639	33358	33358
query24	6039	2903	2895	2895
query25	493	387	403	387
query26	621	154	151	151
query27	1621	273	283	273
query28	3952	2037	2012	2012
query29	634	423	397	397
query30	242	154	158	154
query31	934	769	775	769
query32	72	53	57	53
query33	450	280	284	280
query34	888	483	470	470
query35	850	733	708	708
query36	1062	923	957	923
query37	142	86	83	83
query38	4010	3930	3928	3928
query39	1444	1532	1413	1413
query40	203	119	115	115
query41	48	48	46	46
query42	122	96	96	96
query43	516	482	493	482
query44	1130	763	748	748
query45	201	165	169	165
query46	1130	763	740	740
query47	1869	1780	1812	1780
query48	374	304	305	304
query49	766	459	441	441
query50	813	427	423	423
query51	7049	6818	6908	6818
query52	103	85	91	85
query53	249	182	184	182
query54	571	463	450	450
query55	76	75	72	72
query56	274	257	253	253
query57	1196	1092	1060	1060
query58	215	234	225	225
query59	2988	2875	2960	2875
query60	291	264	260	260
query61	133	100	96	96
query62	748	655	657	655
query63	225	190	187	187
query64	1356	662	664	662
query65	3264	3164	3216	3164
query66	680	334	364	334
query67	15747	15408	15348	15348
query68	1544	861	840	840
query69	419	330	331	330
query70	1204	1151	1122	1122
query71	341	344	342	342
query72	4605	3519	3600	3519
query73	593	579	586	579
query74	9112	8870	9051	8870
query75	3075	2937	3012	2937
query76	952	874	867	867
query77	414	419	406	406
query78	9398	9344	9234	9234
query79	910	894	851	851
query80	796	809	804	804
query81	460	261	262	261
query82	266	265	261	261
query83	194	190	190	190
query84	194	109	106	106
query85	582	414	445	414
query86	314	325	312	312
query87	4384	4398	4360	4360
query88	4241	4131	4099	4099
query89	373	367	370	367
query90	787	313	314	313
query91	126	122	134	122
query92	78	78	77	77
query93	913	917	916	916
query94	400	348	362	348
query95	437	421	417	417
query96	471	474	472	472
query97	3120	3119	3067	3067
query98	243	238	243	238
query99	1302	1265	1283	1265
Total cold run time: 294530 ms
Total hot run time: 198040 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.71 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 2eaef5a015b9ead8b39ce48bad65913ae90f2b73, data reload: false

query1	0.05	0.04	0.04
query2	0.08	0.04	0.04
query3	0.23	0.05	0.05
query4	1.67	0.08	0.08
query5	0.51	0.50	0.49
query6	1.13	0.74	0.73
query7	0.02	0.01	0.01
query8	0.04	0.04	0.05
query9	0.54	0.50	0.49
query10	0.54	0.56	0.55
query11	0.16	0.12	0.11
query12	0.15	0.13	0.13
query13	0.60	0.60	0.59
query14	1.39	1.41	1.43
query15	0.86	0.84	0.82
query16	0.38	0.38	0.38
query17	1.06	1.04	1.07
query18	0.22	0.22	0.21
query19	1.98	1.75	1.75
query20	0.01	0.01	0.01
query21	15.39	0.66	0.65
query22	4.24	7.11	2.07
query23	18.24	1.38	1.30
query24	2.15	0.22	0.21
query25	0.15	0.09	0.08
query26	0.27	0.17	0.18
query27	0.08	0.08	0.08
query28	13.21	1.03	1.01
query29	12.62	3.41	3.36
query30	0.24	0.06	0.06
query31	2.86	0.41	0.40
query32	3.23	0.48	0.49
query33	3.00	3.00	3.01
query34	17.06	4.41	4.45
query35	4.48	4.54	4.43
query36	0.67	0.47	0.47
query37	0.19	0.16	0.15
query38	0.15	0.14	0.14
query39	0.05	0.03	0.04
query40	0.16	0.12	0.12
query41	0.08	0.04	0.05
query42	0.06	0.06	0.04
query43	0.04	0.04	0.04
Total cold run time: 110.24 s
Total hot run time: 31.71 s

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 8e37fb4 into apache:master Sep 11, 2024
25 of 28 checks passed
dataroaring pushed a commit that referenced this pull request Sep 12, 2024
…40539)

At the before commit stage, a write lock will be added. If the commit
transaction fails, the thread will return directly and the write lock
will no longer be released which cause job stuck.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.2-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants