Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](cloud) fix routine load job progress cache incorrect in cloud mode #39313

Merged
merged 1 commit into from
Aug 14, 2024
Merged

[fix](cloud) fix routine load job progress cache incorrect in cloud mode #39313

merged 1 commit into from
Aug 14, 2024

Conversation

sollhui
Copy link
Contributor

@sollhui sollhui commented Aug 13, 2024

Routine load job progress cache incorrect in cloud mode in the following scenario:

  1. schedule thread update cloud progress, get old transaction value in fdb.
  2. routine load task commit transaction and update progress cache.
  3. update cloud progress RPC return, and change progress value to old value which is incorrect.

This PR solves the problem that may occur in the storage computation separation mode by not allowing small values to overwrite large values.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@sollhui
Copy link
Contributor Author

sollhui commented Aug 13, 2024

run buildall

@github-actions github-actions bot added the doing label Aug 13, 2024
@doris-robot
Copy link

TPC-H: Total hot run time: 40003 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 75adeeb5aa8af74b8adc7f2cbb389890b5295c50, data reload: false

------ Round 1 ----------------------------------
q1	17688	4466	4287	4287
q2	2017	179	174	174
q3	10550	1182	1146	1146
q4	10179	628	754	628
q5	7749	2827	2788	2788
q6	228	139	137	137
q7	956	597	586	586
q8	9341	2086	2065	2065
q9	8743	6613	6554	6554
q10	7067	2186	2251	2186
q11	471	242	248	242
q12	401	219	216	216
q13	17766	2984	2992	2984
q14	278	232	230	230
q15	522	484	488	484
q16	518	382	388	382
q17	974	643	675	643
q18	8297	7566	7574	7566
q19	4959	1061	1085	1061
q20	695	323	346	323
q21	5964	4582	4291	4291
q22	1104	1044	1030	1030
Total cold run time: 116467 ms
Total hot run time: 40003 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4457	4292	4295	4292
q2	385	275	259	259
q3	2841	2703	2637	2637
q4	1977	1735	1698	1698
q5	5607	5677	5636	5636
q6	251	131	131	131
q7	2139	1763	1758	1758
q8	3247	3433	3393	3393
q9	8887	8862	8730	8730
q10	3464	3179	3280	3179
q11	611	520	492	492
q12	819	624	622	622
q13	16399	3169	3177	3169
q14	331	279	296	279
q15	536	479	488	479
q16	506	451	444	444
q17	1829	1528	1533	1528
q18	7912	7768	7339	7339
q19	1693	1612	1629	1612
q20	2067	1845	1797	1797
q21	5359	5151	5027	5027
q22	1139	1026	1002	1002
Total cold run time: 72456 ms
Total hot run time: 55503 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 186090 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 75adeeb5aa8af74b8adc7f2cbb389890b5295c50, data reload: false

query1	913	383	369	369
query2	6440	1904	1906	1904
query3	6661	209	216	209
query4	27818	23359	23189	23189
query5	4319	500	473	473
query6	272	155	157	155
query7	4596	298	291	291
query8	248	198	194	194
query9	8828	2445	2418	2418
query10	427	282	271	271
query11	17315	14848	15144	14848
query12	151	104	104	104
query13	1653	379	375	375
query14	10173	7710	7629	7629
query15	230	168	167	167
query16	7661	516	482	482
query17	1553	580	559	559
query18	1977	300	292	292
query19	189	143	149	143
query20	111	105	106	105
query21	212	101	101	101
query22	4566	4243	4170	4170
query23	33789	33074	33381	33074
query24	12066	2887	2857	2857
query25	688	400	391	391
query26	1816	161	160	160
query27	2821	276	275	275
query28	7665	2049	2063	2049
query29	1110	426	418	418
query30	303	154	145	145
query31	954	728	767	728
query32	98	55	57	55
query33	752	306	286	286
query34	908	466	483	466
query35	850	738	721	721
query36	1094	932	945	932
query37	227	80	81	80
query38	4005	3771	3831	3771
query39	1462	1407	1410	1407
query40	279	117	123	117
query41	47	46	51	46
query42	112	100	98	98
query43	496	464	443	443
query44	1262	746	726	726
query45	193	166	163	163
query46	1096	730	767	730
query47	1896	1797	1801	1797
query48	364	285	295	285
query49	1189	416	413	413
query50	808	415	399	399
query51	6800	6832	6718	6718
query52	101	89	93	89
query53	260	193	180	180
query54	835	449	442	442
query55	75	73	74	73
query56	277	245	254	245
query57	1177	1059	1062	1059
query58	235	229	259	229
query59	2929	2759	2760	2759
query60	306	274	277	274
query61	127	99	95	95
query62	849	652	635	635
query63	215	187	181	181
query64	10589	2280	1711	1711
query65	3206	3149	3147	3147
query66	1395	334	323	323
query67	15442	15197	15039	15039
query68	4925	544	545	544
query69	689	364	300	300
query70	1080	1139	1092	1092
query71	522	268	276	268
query72	7845	2271	2073	2073
query73	779	330	323	323
query74	9143	8643	8731	8643
query75	3893	2686	2702	2686
query76	3704	982	951	951
query77	672	310	306	306
query78	9699	8985	9140	8985
query79	3534	537	518	518
query80	2116	486	543	486
query81	586	224	221	221
query82	1186	133	139	133
query83	285	150	148	148
query84	278	83	77	77
query85	1630	278	273	273
query86	457	308	303	303
query87	4397	4200	4289	4200
query88	4443	2433	2425	2425
query89	429	295	287	287
query90	2017	198	196	196
query91	125	115	102	102
query92	65	51	50	50
query93	4958	542	540	540
query94	1065	296	259	259
query95	362	270	271	270
query96	669	271	268	268
query97	3227	3029	3017	3017
query98	224	201	199	199
query99	1543	1306	1268	1268
Total cold run time: 306028 ms
Total hot run time: 186090 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.16 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 75adeeb5aa8af74b8adc7f2cbb389890b5295c50, data reload: false

query1	0.05	0.04	0.04
query2	0.08	0.04	0.05
query3	0.23	0.06	0.05
query4	1.66	0.08	0.07
query5	0.50	0.49	0.47
query6	1.14	0.74	0.74
query7	0.02	0.01	0.02
query8	0.04	0.04	0.04
query9	0.54	0.48	0.49
query10	0.54	0.54	0.54
query11	0.16	0.12	0.11
query12	0.15	0.12	0.12
query13	0.60	0.62	0.59
query14	0.77	0.79	0.77
query15	0.84	0.82	0.82
query16	0.36	0.38	0.37
query17	0.97	0.99	1.03
query18	0.23	0.21	0.22
query19	1.76	1.69	1.68
query20	0.01	0.02	0.01
query21	15.39	0.75	0.66
query22	4.48	6.81	2.39
query23	18.31	1.41	1.26
query24	2.13	0.22	0.22
query25	0.14	0.07	0.08
query26	0.30	0.21	0.22
query27	0.45	0.22	0.21
query28	13.28	1.02	1.00
query29	12.61	3.30	3.27
query30	0.23	0.04	0.05
query31	2.90	0.41	0.40
query32	3.26	0.49	0.49
query33	2.99	3.02	2.95
query34	17.26	4.38	4.37
query35	4.51	4.40	4.52
query36	0.66	0.48	0.48
query37	0.18	0.16	0.15
query38	0.16	0.15	0.15
query39	0.04	0.03	0.04
query40	0.17	0.12	0.14
query41	0.09	0.05	0.05
query42	0.07	0.05	0.05
query43	0.04	0.04	0.04
Total cold run time: 110.3 s
Total hot run time: 31.16 s

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Aug 13, 2024
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit d9c8ff0 into apache:master Aug 14, 2024
30 of 31 checks passed
wyxxxcat pushed a commit to wyxxxcat/doris that referenced this pull request Aug 14, 2024
…ode (apache#39313)

Routine load job progress cache incorrect in cloud mode in the following
scenario:
1. schedule thread update cloud progress, get old transaction value in
fdb.
2. routine load task commit transaction and update progress cache.
3. update cloud progress RPC return, and change progress value to old
value which is incorrect.

This PR solves the problem that may occur in the storage computation
separation mode by not allowing small values to overwrite large values.
dataroaring pushed a commit that referenced this pull request Aug 17, 2024
…ode (#39313)

Routine load job progress cache incorrect in cloud mode in the following
scenario:
1. schedule thread update cloud progress, get old transaction value in
fdb.
2. routine load task commit transaction and update progress cache.
3. update cloud progress RPC return, and change progress value to old
value which is incorrect.

This PR solves the problem that may occur in the storage computation
separation mode by not allowing small values to overwrite large values.
dataroaring pushed a commit that referenced this pull request Jan 7, 2025
…46149)


In cloud mode, routine load loss data when fe master node restart.

When updating progress, in order to avoid small values covering large
values, we introduced pr #39313, Due
to the pr that the routine load replays progress metadata by first
obtaining the set default offset and then pulling metadata from meta
service to update the local value, if the metadata pulled from meta
service is not larger than the set default offset, the correct value
cannot be assigned to memory.

To solve this problem, pulling metadata from meta service when restart,
determine whether to obtain default offset from Kafka based on the
pulled value.
github-actions bot pushed a commit that referenced this pull request Jan 7, 2025
…46149)


In cloud mode, routine load loss data when fe master node restart.

When updating progress, in order to avoid small values covering large
values, we introduced pr #39313, Due
to the pr that the routine load replays progress metadata by first
obtaining the set default offset and then pulling metadata from meta
service to update the local value, if the metadata pulled from meta
service is not larger than the set default offset, the correct value
cannot be assigned to memory.

To solve this problem, pulling metadata from meta service when restart,
determine whether to obtain default offset from Kafka based on the
pulled value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.2-merged doing reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants