Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](cloud) dup FDCache reset before FileCache dtor causing crash #48915

Merged
merged 1 commit into from
Mar 13, 2025

Conversation

freemandealer
Copy link
Contributor

There is another FDCache reset near line 788, and this previous dup one deconstruct its mutex before BlockFileCache destory in FileCacheFactory thus causing BE crash on following stack:

SIGSEGV address not mapped to object (@0x68) received by PID 222738 (TID 222917 OR 0x7505dc800640) from PID 104; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /mnt/disk3/pipeline/repo/selectdb-core_selectdb-second-branch/be/src/common/signal_handler.h:421 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /home/qatest/jdk/java-17-openjdk-amd64/lib/server/libjvm.so 2# JVM_handle_linux_signal in /home/qatest/jdk/java-17-openjdk-amd64/lib/server/libjvm.so 3# 0x0000750666842520 in /lib/x86_64-linux-gnu/libc.so.6 4# __pthread_rwlock_wrlock at ./nptl/pthread_rwlock_wrlock.c:26 5# doris::io::FDCache::remove_file_reader(std::pair<doris::io::UInt128Wrapper, unsigned long> const&) at /mnt/disk3/pipeline/repo/selectdb-core_selectdb-second-branch/be/src/io/cache/fs_file_cache_storag e.cpp:86
6# doris::io::FSFileCacheStorage::remove(doris::io::FileCacheKey const&) at /mnt/disk3/pipeline/repo/selectdb-core_selectdb-second-branch/be/src/io/cache/fs_file_cache_storage.cpp:203

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

There is another FDCache reset near line 788, and this previous dup one
deconstruct its mutex before BlockFileCache destory in FileCacheFactory
thus causing BE crash on following stack:

SIGSEGV address not mapped to object (@0x68) received by PID 222738 (TID 222917 OR 0x7505dc800640) from PID 104; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /mnt/disk3/pipeline/repo/selectdb-core_selectdb-second-branch/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /home/qatest/jdk/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in /home/qatest/jdk/java-17-openjdk-amd64/lib/server/libjvm.so
3# 0x0000750666842520 in /lib/x86_64-linux-gnu/libc.so.6
4# __pthread_rwlock_wrlock at ./nptl/pthread_rwlock_wrlock.c:26
5# doris::io::FDCache::remove_file_reader(std::pair<doris::io::UInt128Wrapper, unsigned long> const&) at /mnt/disk3/pipeline/repo/selectdb-core_selectdb-second-branch/be/src/io/cache/fs_file_cache_storag
e.cpp:86
6# doris::io::FSFileCacheStorage::remove(doris::io::FileCacheKey const&) at /mnt/disk3/pipeline/repo/selectdb-core_selectdb-second-branch/be/src/io/cache/fs_file_cache_storage.cpp:203

Signed-off-by: zhengyu <[email protected]>
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@freemandealer
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32753 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2cc53f2726591519f65d5c802a42fada2d198849, data reload: false

------ Round 1 ----------------------------------
q1	17617	5255	5093	5093
q2	2053	284	174	174
q3	10449	1246	725	725
q4	10210	1005	524	524
q5	7514	2386	2347	2347
q6	194	163	132	132
q7	903	758	615	615
q8	9307	1280	1105	1105
q9	4921	4923	4912	4912
q10	6839	2318	1904	1904
q11	473	277	255	255
q12	362	366	219	219
q13	17782	3681	3103	3103
q14	244	234	211	211
q15	536	481	482	481
q16	637	632	586	586
q17	578	879	361	361
q18	7111	6396	6315	6315
q19	2060	951	568	568
q20	317	321	192	192
q21	2738	2157	1972	1972
q22	1040	1009	959	959
Total cold run time: 103885 ms
Total hot run time: 32753 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5256	5134	5168	5134
q2	234	327	232	232
q3	2152	2686	2327	2327
q4	1482	1812	1393	1393
q5	4244	4148	4196	4148
q6	215	166	124	124
q7	1924	1956	1785	1785
q8	2640	2633	2590	2590
q9	7252	7320	7092	7092
q10	3053	3222	2792	2792
q11	564	503	485	485
q12	724	728	612	612
q13	3482	3970	3316	3316
q14	286	305	274	274
q15	525	474	461	461
q16	652	691	652	652
q17	1157	1628	1352	1352
q18	7740	7597	7395	7395
q19	831	810	964	810
q20	1958	2055	1878	1878
q21	5642	5043	4812	4812
q22	1055	1072	1007	1007
Total cold run time: 53068 ms
Total hot run time: 50671 ms

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 11, 2025
Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-DS: Total hot run time: 192678 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2cc53f2726591519f65d5c802a42fada2d198849, data reload: false

query1	1456	1005	981	981
query2	6244	1984	1951	1951
query3	11101	4644	4575	4575
query4	54334	25448	23463	23463
query5	5290	535	492	492
query6	420	203	189	189
query7	5413	498	294	294
query8	328	268	253	253
query9	7482	2651	2638	2638
query10	432	305	260	260
query11	15481	15062	14919	14919
query12	154	109	106	106
query13	1373	543	409	409
query14	10478	6516	6472	6472
query15	208	185	193	185
query16	7044	656	496	496
query17	1130	748	583	583
query18	1539	426	336	336
query19	205	197	162	162
query20	126	122	126	122
query21	216	138	109	109
query22	4558	4698	4523	4523
query23	33909	33313	33594	33313
query24	5807	2440	2418	2418
query25	473	455	400	400
query26	743	283	155	155
query27	1856	489	334	334
query28	2797	2456	2500	2456
query29	570	553	412	412
query30	272	229	186	186
query31	902	881	815	815
query32	74	63	59	59
query33	464	361	298	298
query34	752	866	512	512
query35	831	856	789	789
query36	970	1022	920	920
query37	117	101	77	77
query38	4193	4191	4389	4191
query39	1502	1528	1467	1467
query40	203	113	103	103
query41	51	49	51	49
query42	124	103	100	100
query43	509	525	519	519
query44	1308	796	795	795
query45	177	172	162	162
query46	842	1031	649	649
query47	1887	1892	1811	1811
query48	393	415	320	320
query49	689	524	420	420
query50	721	759	421	421
query51	4254	4298	4217	4217
query52	105	102	98	98
query53	244	264	194	194
query54	501	503	436	436
query55	77	81	80	80
query56	304	268	267	267
query57	1150	1197	1099	1099
query58	255	231	236	231
query59	2925	3031	2862	2862
query60	300	277	271	271
query61	157	122	119	119
query62	738	742	729	729
query63	229	193	197	193
query64	1948	1026	702	702
query65	4518	4308	4331	4308
query66	735	397	291	291
query67	15714	15602	15231	15231
query68	7117	871	505	505
query69	542	297	263	263
query70	1201	1161	1071	1071
query71	497	285	265	265
query72	5851	3675	3830	3675
query73	1113	751	341	341
query74	9110	9178	8899	8899
query75	3704	3210	2703	2703
query76	4374	1198	748	748
query77	623	366	275	275
query78	10090	10072	9354	9354
query79	2589	834	596	596
query80	671	530	438	438
query81	482	252	219	219
query82	452	126	95	95
query83	290	164	150	150
query84	309	93	69	69
query85	790	470	308	308
query86	377	304	290	290
query87	4453	4467	4376	4376
query88	3358	2279	2243	2243
query89	407	323	285	285
query90	1965	207	210	207
query91	140	136	110	110
query92	79	59	56	56
query93	1238	1068	588	588
query94	689	426	302	302
query95	347	269	253	253
query96	491	570	274	274
query97	3336	3402	3275	3275
query98	223	236	207	207
query99	1447	1406	1241	1241
Total cold run time: 300657 ms
Total hot run time: 192678 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.53 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 2cc53f2726591519f65d5c802a42fada2d198849, data reload: false

query1	0.03	0.03	0.02
query2	0.08	0.03	0.04
query3	0.23	0.06	0.06
query4	1.63	0.11	0.11
query5	0.55	0.53	0.53
query6	1.20	0.72	0.72
query7	0.02	0.01	0.01
query8	0.04	0.03	0.03
query9	0.59	0.52	0.52
query10	0.57	0.62	0.57
query11	0.15	0.11	0.10
query12	0.15	0.11	0.11
query13	0.62	0.60	0.60
query14	2.73	2.84	2.85
query15	0.92	0.85	0.85
query16	0.38	0.38	0.39
query17	1.03	1.05	1.07
query18	0.22	0.20	0.20
query19	1.85	1.82	1.98
query20	0.01	0.01	0.02
query21	15.37	0.88	0.55
query22	0.77	1.29	0.96
query23	14.69	1.41	0.59
query24	6.92	1.62	1.04
query25	0.52	0.25	0.12
query26	0.48	0.16	0.13
query27	0.04	0.05	0.05
query28	10.78	0.88	0.43
query29	12.59	4.05	3.34
query30	0.25	0.08	0.06
query31	2.83	0.57	0.38
query32	3.22	0.55	0.46
query33	2.99	3.01	3.03
query34	15.66	5.15	4.48
query35	4.61	4.56	4.58
query36	0.67	0.50	0.50
query37	0.09	0.06	0.06
query38	0.05	0.04	0.03
query39	0.02	0.02	0.02
query40	0.17	0.13	0.13
query41	0.09	0.02	0.02
query42	0.04	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 105.88 s
Total hot run time: 31.53 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 46.89% (12539/26744)
Line Coverage 36.60% (107181/292824)
Region Coverage 35.65% (54790/153682)
Branch Coverage 31.06% (27610/88894)

@gavinchou gavinchou merged commit 1062a76 into apache:master Mar 13, 2025
27 of 29 checks passed
github-actions bot pushed a commit that referenced this pull request Mar 13, 2025
…48915)

There is another FDCache reset near line 788, and this previous dup one
deconstruct its mutex before BlockFileCache destory in FileCacheFactory
thus causing BE crash on following stack:

SIGSEGV address not mapped to object (@0x68) received by PID 222738 (TID
222917 OR 0x7505dc800640) from PID 104; stack trace: *** 0#
doris::signal::(anonymous namespace)::FailureSignalHandler(int,
siginfo_t*, void*) at
/mnt/disk3/pipeline/repo/selectdb-core_selectdb-second-branch/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0]
in /home/qatest/jdk/java-17-openjdk-amd64/lib/server/libjvm.so 2#
JVM_handle_linux_signal in
/home/qatest/jdk/java-17-openjdk-amd64/lib/server/libjvm.so 3#
0x0000750666842520 in /lib/x86_64-linux-gnu/libc.so.6 4#
__pthread_rwlock_wrlock at ./nptl/pthread_rwlock_wrlock.c:26 5#
doris::io::FDCache::remove_file_reader(std::pair<doris::io::UInt128Wrapper,
unsigned long> const&) at
/mnt/disk3/pipeline/repo/selectdb-core_selectdb-second-branch/be/src/io/cache/fs_file_cache_storag
e.cpp:86
6# doris::io::FSFileCacheStorage::remove(doris::io::FileCacheKey const&)
at
/mnt/disk3/pipeline/repo/selectdb-core_selectdb-second-branch/be/src/io/cache/fs_file_cache_storage.cpp:203
dataroaring pushed a commit that referenced this pull request Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.5-merged p0_c reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants