DAOS-14845 object: retry migration for retriable failure #13590

wangdi1 · 2024-01-11T00:12:50Z

To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure.

Required-githooks: true
Features: rebuild

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2024-01-11T00:13:11Z

Bug-tracker data:
Ticket title is 'timeout for mdtest after killing one rank FTEST_erasurecode.EcodOnlineRebuildMdtest.1-./erasurecode/online_rebuild_mdtest.py'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-14845

daosbuild1 · 2024-01-11T07:50:19Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/1/execution/node/1180/log

To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Required-githooks: true Features: ec rebuild Signed-off-by: Di Wang <[email protected]>

github-actions · 2024-02-21T02:04:30Z

Bug-tracker data:
Ticket title is 'timeout for mdtest after killing one rank FTEST_erasurecode.EcodOnlineRebuildMdtest.1-./erasurecode/online_rebuild_mdtest.py'
Status is 'Resolved'
Labels: 'ci_impact,pr_test,release/2.4'
https://daosio.atlassian.net/browse/DAOS-14845

Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. Some minor fixes for migration. Required-githooks: true Signed-off-by: Di Wang <[email protected]>

github-actions · 2024-02-22T05:29:10Z

Bug-tracker data:
Ticket title is 'timeout for mdtest after killing one rank FTEST_erasurecode.EcodOnlineRebuildMdtest.1-./erasurecode/online_rebuild_mdtest.py'
Status is 'Resolved'
Labels: 'ci_impact,pr_test,release/2.4'
https://daosio.atlassian.net/browse/DAOS-14845

daosbuild1 · 2024-02-22T21:35:31Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/3/execution/node/1452/log

Update migrate max ULT control Required-githooks: true Signed-off-by: Di Wang <[email protected]>

github-actions · 2024-02-23T04:07:39Z

Ticket title is 'timeout for mdtest after killing one rank FTEST_erasurecode.EcodOnlineRebuildMdtest.1-./erasurecode/online_rebuild_mdtest.py'
Status is 'Resolved'
Labels: 'ci_impact,pr_test,release/2.4'
https://daosio.atlassian.net/browse/DAOS-14845

fix typo Features: rebuild Required-githooks: true Signed-off-by: Di Wang <[email protected]>

daosbuild1 · 2024-02-23T13:09:41Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/5/execution/node/1199/log

Required-githooks: true

fix the segfault. Required-githooks: true Signed-off-by: Di Wang <[email protected]>

daosbuild1 · 2024-02-24T00:03:05Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/6/execution/node/1173/log

daosbuild1 · 2024-02-25T15:55:30Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13590/7/display/redirect

daosbuild1 · 2024-02-26T01:52:47Z

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/7/execution/node/1531/log

NiuYawei · 2024-03-06T02:58:00Z

src/object/srv_obj_migrate.c

+
+		tgt_cnt = atomic_load(&tls->mpt_obj_ult_cnts[tgt_idx]) +
+			  atomic_load(&tls->mpt_dkey_ult_cnts[tgt_idx]);
+	}


This kind of "while (check_condition) { wait }" is quite inefficient where there are large amount of waiters.

Do you know how many waiters there could be? If there are at most tens of waiters, I think we can live with it; If there could be hundreds or thousands of waiters, the overhead of the unnecessary cycle of "wakeup -> recheck -> go back to wait" for most waiters will kill the performance badly, to make it worse, lots of contention will be generated by these atomic operations.

seems at most one waiter per tgt per pool @wangdi1 ?

liuxuezhao · 2024-03-06T09:34:08Z

src/object/srv_obj_migrate.c

+
+		tgt_cnt = atomic_load(&tls->mpt_obj_ult_cnts[tgt_idx]) +
+			  atomic_load(&tls->mpt_dkey_ult_cnts[tgt_idx]);
+	}


seems at most one waiter per tgt per pool @wangdi1 ?

liuxuezhao · 2024-03-06T09:35:47Z

src/object/srv_obj_migrate.c

+	rc = dsc_obj_fetch(oh, eph, &mrone->mo_dkey, iod_num, iods, sgls,
+			   NULL, flags, NULL, csum_iov_fetch);
+	if (rc == -DER_TIMEDOUT &&
+	    tls->mpt_version + 1 >= tls->mpt_pool->spc_map_version) {


right, seems impossible to be "tls->mpt_version + 1 > tls->mpt_pool->spc_map_version", maybe I am wrong.

daosbuild1 · 2024-03-06T20:14:40Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13590/13/display/redirect

wangdi1 · 2024-03-06T22:38:07Z

job failed due to env issue. needs re-trigger.

Merge branch 'master' into wangdi/rebuild_timeout

daosbuild1 · 2024-03-07T07:40:46Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/14/execution/node/1175/log

daosbuild1 · 2024-03-08T01:39:05Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/15/execution/node/1198/log

Features: rebuild Merge branch 'master' into wangdi/rebuild_timeout

cdavis28 · 2024-03-11T17:31:30Z

@wangdi1 I know time is limited. Will we get this soon?

daosbuild1 · 2024-03-11T21:45:39Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/16/execution/node/1405/log

jolivier23 · 2024-03-11T22:25:41Z

@wangdi1 I know time is limited. Will we get this soon?

@cdavis28 looks like his latest was run with Allow-unstable-test: true. That should at least help to get full results. I haven't been able to get my simple patch through CI successfully in the last week.

Merge branch 'master' into wangdi/rebuild_timeout

daosbuild1 · 2024-03-14T18:42:49Z

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/17/execution/node/1405/log

daosbuild1 · 2024-03-15T04:17:27Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/17/execution/node/1549/log

wangdi1 · 2024-03-15T06:06:37Z

failure due to DAOS-15124 and DAOS-15127.

daosbuild1 · 2024-03-15T09:17:11Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/17/execution/node/1528/log

kccain · 2024-03-15T11:16:56Z

failure due to DAOS-15124 and DAOS-15127.

also, the rebuild/container_create_race.py test failure is known, https://daosio.atlassian.net/browse/DAOS-15002

jolivier23 · 2024-03-15T14:16:18Z

DAOS-15127 is marked as resolved

To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Optimize migrate memory usage - Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. - Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. - Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. Features: rebuild Required-githooks: true Signed-off-by: Di Wang <[email protected]>

To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Optimize migrate memory usage - Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. - Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. - Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. Test-tag: ec_offline_rebuild Test-repeat: 3 Required-githooks: true Signed-off-by: Di Wang <[email protected]>

…3993) To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Optimize migrate memory usage - Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. - Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. - Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. - DAOS-14845 object: fix a bug with mpt_inflight_size one dkey migrate possible exceed the mpt_inflight_max_size, in this case original code possibly cause the dkey migrate ULT dead loop and then rebuild cannot complete. Example log - "migrate_one_ult() mrone 0x7f3c91fe1ec0 wait start 0/33554432", that case will cause the ULT wait again after wakeup until shutdown. Signed-off-by: Di Wang <[email protected]> Signed-off-by: Xuezhao Liu <[email protected]> Co-authored-by: Xuezhao Liu <[email protected]>

* DAOS-14845 object: retry migration for retriable failure To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Optimize migrate memory usage - Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. - Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. - Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. Change-Id: I3b426542f6a5b196fc0e7cabb680d4ff9b1db65c Signed-off-by: Di Wang <[email protected]>

wangdi1 requested a review from liuxuezhao January 11, 2024 00:12

wangdi1 requested review from jolivier23 and gnailzenh January 11, 2024 00:18

liuxuezhao previously approved these changes Jan 15, 2024

View reviewed changes

jolivier23 previously approved these changes Jan 19, 2024

View reviewed changes

wangdi1 dismissed stale reviews from jolivier23 and liuxuezhao via 39d2656 February 21, 2024 02:04

wangdi1 force-pushed the wangdi/rebuild_timeout branch from 6dc07c4 to 39d2656 Compare February 21, 2024 02:04

wangdi1 requested review from a team as code owners February 21, 2024 02:04

jolivier23 previously approved these changes Feb 21, 2024

View reviewed changes

wangdi1 dismissed jolivier23’s stale review via ee449fa February 22, 2024 05:28

DAOS-14845 object: update migrate max ult size control

2164a67

Update migrate max ULT control Required-githooks: true Signed-off-by: Di Wang <[email protected]>

DAOS-14845 object: fix typo

5608b58

fix typo Features: rebuild Required-githooks: true Signed-off-by: Di Wang <[email protected]>

wangdi1 requested review from jolivier23 and liuxuezhao February 23, 2024 04:38

wangdi1 added 2 commits February 23, 2024 19:05

Merge branch 'master' into wangdi/rebuild_timeout

799fb82

Required-githooks: true

DAOS-14845 rebuild: fix the segfault

ebb0f24

fix the segfault. Required-githooks: true Signed-off-by: Di Wang <[email protected]>

NiuYawei reviewed Mar 6, 2024

View reviewed changes

wangdi1 requested a review from liuxuezhao March 6, 2024 03:49

liuxuezhao approved these changes Mar 6, 2024

View reviewed changes

Features: rebuild

761ce52

Merge branch 'master' into wangdi/rebuild_timeout

Allow-unstable-test: true

6bad46e

Features: rebuild Merge branch 'master' into wangdi/rebuild_timeout

gnailzenh approved these changes Mar 12, 2024

View reviewed changes

Features: rebuild

e97101f

Merge branch 'master' into wangdi/rebuild_timeout

gnailzenh merged commit 5656098 into master Mar 15, 2024
44 of 49 checks passed

gnailzenh deleted the wangdi/rebuild_timeout branch March 15, 2024 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-14845 object: retry migration for retriable failure #13590

DAOS-14845 object: retry migration for retriable failure #13590

wangdi1 commented Jan 11, 2024

github-actions bot commented Jan 11, 2024

daosbuild1 commented Jan 11, 2024

github-actions bot commented Feb 21, 2024

github-actions bot commented Feb 22, 2024

daosbuild1 commented Feb 22, 2024

github-actions bot commented Feb 23, 2024

daosbuild1 commented Feb 23, 2024

daosbuild1 commented Feb 24, 2024

daosbuild1 commented Feb 25, 2024

daosbuild1 commented Feb 26, 2024

NiuYawei Mar 6, 2024

liuxuezhao Mar 6, 2024 •

edited

Loading

liuxuezhao Mar 6, 2024 •

edited

Loading

liuxuezhao Mar 6, 2024

daosbuild1 commented Mar 6, 2024

wangdi1 commented Mar 6, 2024

daosbuild1 commented Mar 7, 2024

daosbuild1 commented Mar 8, 2024

cdavis28 commented Mar 11, 2024

daosbuild1 commented Mar 11, 2024

jolivier23 commented Mar 11, 2024

daosbuild1 commented Mar 14, 2024

daosbuild1 commented Mar 15, 2024

wangdi1 commented Mar 15, 2024

daosbuild1 commented Mar 15, 2024

kccain commented Mar 15, 2024

jolivier23 commented Mar 15, 2024

DAOS-14845 object: retry migration for retriable failure #13590

DAOS-14845 object: retry migration for retriable failure #13590

Conversation

wangdi1 commented Jan 11, 2024

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Jan 11, 2024

daosbuild1 commented Jan 11, 2024

github-actions bot commented Feb 21, 2024

github-actions bot commented Feb 22, 2024

daosbuild1 commented Feb 22, 2024

github-actions bot commented Feb 23, 2024

daosbuild1 commented Feb 23, 2024

daosbuild1 commented Feb 24, 2024

daosbuild1 commented Feb 25, 2024

daosbuild1 commented Feb 26, 2024

NiuYawei Mar 6, 2024

Choose a reason for hiding this comment

liuxuezhao Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

liuxuezhao Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

liuxuezhao Mar 6, 2024

Choose a reason for hiding this comment

daosbuild1 commented Mar 6, 2024

wangdi1 commented Mar 6, 2024

daosbuild1 commented Mar 7, 2024

daosbuild1 commented Mar 8, 2024

cdavis28 commented Mar 11, 2024

daosbuild1 commented Mar 11, 2024

jolivier23 commented Mar 11, 2024

daosbuild1 commented Mar 14, 2024

daosbuild1 commented Mar 15, 2024

wangdi1 commented Mar 15, 2024

daosbuild1 commented Mar 15, 2024

kccain commented Mar 15, 2024

jolivier23 commented Mar 15, 2024

liuxuezhao Mar 6, 2024 •

edited

Loading

liuxuezhao Mar 6, 2024 •

edited

Loading