Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14845 object: retry migration for retriable failure #13590

Merged
merged 14 commits into from
Mar 15, 2024

Conversation

wangdi1
Copy link
Contributor

@wangdi1 wangdi1 commented Jan 11, 2024

To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure.

Required-githooks: true
Features: rebuild

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@wangdi1 wangdi1 requested a review from liuxuezhao January 11, 2024 00:12
Copy link

Bug-tracker data:
Ticket title is 'timeout for mdtest after killing one rank FTEST_erasurecode.EcodOnlineRebuildMdtest.1-./erasurecode/online_rebuild_mdtest.py'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-14845

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/1/execution/node/1180/log

liuxuezhao
liuxuezhao previously approved these changes Jan 15, 2024
jolivier23
jolivier23 previously approved these changes Jan 19, 2024
To avoid retry rebuild and reclaim, let's retry rebuild
until further pool map changes, in that case, it should
fail the current rebuild, and further rebuild will resolve
the failure.

various fixs about rebuild if PS leader keeps changing
during rebuild.

Move migrate max ULT control to migrate_obj_iter_cb() to make
sure max ULT count will not exceed the setting.

Change the yield freq from 128 to 16 to make sure the object

Required-githooks: true
Features: ec rebuild
Signed-off-by: Di Wang <[email protected]>
@wangdi1 wangdi1 dismissed stale reviews from jolivier23 and liuxuezhao via 39d2656 February 21, 2024 02:04
@wangdi1 wangdi1 force-pushed the wangdi/rebuild_timeout branch from 6dc07c4 to 39d2656 Compare February 21, 2024 02:04
@wangdi1 wangdi1 requested review from a team as code owners February 21, 2024 02:04
Copy link

Bug-tracker data:
Ticket title is 'timeout for mdtest after killing one rank FTEST_erasurecode.EcodOnlineRebuildMdtest.1-./erasurecode/online_rebuild_mdtest.py'
Status is 'Resolved'
Labels: 'ci_impact,pr_test,release/2.4'
https://daosio.atlassian.net/browse/DAOS-14845

jolivier23
jolivier23 previously approved these changes Feb 21, 2024
Add max ULT control for all targets on xstream, so
the object being migrated can not exceed MIGRATE_MAX_ULT.

Add each target max ULT control, so each target migrate
ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr.

Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open
for each object and dkey migration.

Some minor fixes for migration.

Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
Copy link

Bug-tracker data:
Ticket title is 'timeout for mdtest after killing one rank FTEST_erasurecode.EcodOnlineRebuildMdtest.1-./erasurecode/online_rebuild_mdtest.py'
Status is 'Resolved'
Labels: 'ci_impact,pr_test,release/2.4'
https://daosio.atlassian.net/browse/DAOS-14845

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/3/execution/node/1452/log

Update migrate max ULT control

Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
Copy link

Ticket title is 'timeout for mdtest after killing one rank FTEST_erasurecode.EcodOnlineRebuildMdtest.1-./erasurecode/online_rebuild_mdtest.py'
Status is 'Resolved'
Labels: 'ci_impact,pr_test,release/2.4'
https://daosio.atlassian.net/browse/DAOS-14845

fix typo

Features: rebuild
Required-githooks: true
Signed-off-by: Di Wang <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/5/execution/node/1199/log

fix the segfault.

Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/6/execution/node/1173/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13590/7/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/7/execution/node/1531/log


tgt_cnt = atomic_load(&tls->mpt_obj_ult_cnts[tgt_idx]) +
atomic_load(&tls->mpt_dkey_ult_cnts[tgt_idx]);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of "while (check_condition) { wait }" is quite inefficient where there are large amount of waiters.

Do you know how many waiters there could be? If there are at most tens of waiters, I think we can live with it; If there could be hundreds or thousands of waiters, the overhead of the unnecessary cycle of "wakeup -> recheck -> go back to wait" for most waiters will kill the performance badly, to make it worse, lots of contention will be generated by these atomic operations.

Copy link
Contributor

@liuxuezhao liuxuezhao Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems at most one waiter per tgt per pool @wangdi1 ?

@wangdi1 wangdi1 requested a review from liuxuezhao March 6, 2024 03:49

tgt_cnt = atomic_load(&tls->mpt_obj_ult_cnts[tgt_idx]) +
atomic_load(&tls->mpt_dkey_ult_cnts[tgt_idx]);
}
Copy link
Contributor

@liuxuezhao liuxuezhao Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems at most one waiter per tgt per pool @wangdi1 ?

rc = dsc_obj_fetch(oh, eph, &mrone->mo_dkey, iod_num, iods, sgls,
NULL, flags, NULL, csum_iov_fetch);
if (rc == -DER_TIMEDOUT &&
tls->mpt_version + 1 >= tls->mpt_pool->spc_map_version) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, seems impossible to be "tls->mpt_version + 1 > tls->mpt_pool->spc_map_version", maybe I am wrong.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13590/13/display/redirect

@wangdi1
Copy link
Contributor Author

wangdi1 commented Mar 6, 2024

job failed due to env issue. needs re-trigger.

Merge branch 'master' into wangdi/rebuild_timeout
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/14/execution/node/1175/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/15/execution/node/1198/log

Features: rebuild
Merge branch 'master' into wangdi/rebuild_timeout
@cdavis28
Copy link
Contributor

@wangdi1 I know time is limited. Will we get this soon?

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/16/execution/node/1405/log

@jolivier23
Copy link
Contributor

@wangdi1 I know time is limited. Will we get this soon?

@cdavis28 looks like his latest was run with Allow-unstable-test: true. That should at least help to get full results. I haven't been able to get my simple patch through CI successfully in the last week.

Merge branch 'master' into wangdi/rebuild_timeout
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/17/execution/node/1405/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/17/execution/node/1549/log

@wangdi1
Copy link
Contributor Author

wangdi1 commented Mar 15, 2024

failure due to DAOS-15124 and DAOS-15127.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/17/execution/node/1528/log

@kccain
Copy link
Contributor

kccain commented Mar 15, 2024

failure due to DAOS-15124 and DAOS-15127.

also, the rebuild/container_create_race.py test failure is known, https://daosio.atlassian.net/browse/DAOS-15002

@jolivier23
Copy link
Contributor

DAOS-15127 is marked as resolved

@gnailzenh gnailzenh merged commit 5656098 into master Mar 15, 2024
44 of 49 checks passed
@gnailzenh gnailzenh deleted the wangdi/rebuild_timeout branch March 15, 2024 14:38
wangdi1 added a commit that referenced this pull request Mar 15, 2024
To avoid retry rebuild and reclaim, let's retry rebuild
until further pool map changes, in that case, it should
fail the current rebuild, and further rebuild will resolve
the failure.

various fixs about rebuild if PS leader keeps changing
during rebuild.

Move migrate max ULT control to migrate_obj_iter_cb() to make
sure max ULT count will not exceed the setting.

Change the yield freq from 128 to 16 to make sure the object

Optimize migrate memory usage
- Add max ULT control for all targets on xstream, so
  the object being migrated can not exceed MIGRATE_MAX_ULT.

- Add each target max ULT control, so each target migrate
   ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr.

-  Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open
   for each object and dkey migration.

Features: rebuild
Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
liuxuezhao pushed a commit that referenced this pull request Mar 19, 2024
To avoid retry rebuild and reclaim, let's retry rebuild
until further pool map changes, in that case, it should
fail the current rebuild, and further rebuild will resolve
the failure.

various fixs about rebuild if PS leader keeps changing
during rebuild.

Move migrate max ULT control to migrate_obj_iter_cb() to make
sure max ULT count will not exceed the setting.

Change the yield freq from 128 to 16 to make sure the object

Optimize migrate memory usage
- Add max ULT control for all targets on xstream, so
  the object being migrated can not exceed MIGRATE_MAX_ULT.

- Add each target max ULT control, so each target migrate
   ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr.

-  Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open
   for each object and dkey migration.

Features: rebuild
Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
liuxuezhao pushed a commit that referenced this pull request Mar 19, 2024
To avoid retry rebuild and reclaim, let's retry rebuild
until further pool map changes, in that case, it should
fail the current rebuild, and further rebuild will resolve
the failure.

various fixs about rebuild if PS leader keeps changing
during rebuild.

Move migrate max ULT control to migrate_obj_iter_cb() to make
sure max ULT count will not exceed the setting.

Change the yield freq from 128 to 16 to make sure the object

Optimize migrate memory usage
- Add max ULT control for all targets on xstream, so
  the object being migrated can not exceed MIGRATE_MAX_ULT.

- Add each target max ULT control, so each target migrate
   ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr.

-  Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open
   for each object and dkey migration.

Features: rebuild
Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
liuxuezhao pushed a commit that referenced this pull request Mar 20, 2024
To avoid retry rebuild and reclaim, let's retry rebuild
until further pool map changes, in that case, it should
fail the current rebuild, and further rebuild will resolve
the failure.

various fixs about rebuild if PS leader keeps changing
during rebuild.

Move migrate max ULT control to migrate_obj_iter_cb() to make
sure max ULT count will not exceed the setting.

Change the yield freq from 128 to 16 to make sure the object

Optimize migrate memory usage
- Add max ULT control for all targets on xstream, so
  the object being migrated can not exceed MIGRATE_MAX_ULT.

- Add each target max ULT control, so each target migrate
   ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr.

-  Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open
   for each object and dkey migration.

Features: rebuild
Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
liuxuezhao pushed a commit that referenced this pull request Mar 22, 2024
To avoid retry rebuild and reclaim, let's retry rebuild
until further pool map changes, in that case, it should
fail the current rebuild, and further rebuild will resolve
the failure.

various fixs about rebuild if PS leader keeps changing
during rebuild.

Move migrate max ULT control to migrate_obj_iter_cb() to make
sure max ULT count will not exceed the setting.

Change the yield freq from 128 to 16 to make sure the object

Optimize migrate memory usage
- Add max ULT control for all targets on xstream, so
  the object being migrated can not exceed MIGRATE_MAX_ULT.

- Add each target max ULT control, so each target migrate
   ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr.

-  Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open
   for each object and dkey migration.

Features: rebuild
Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
liuxuezhao pushed a commit that referenced this pull request Mar 22, 2024
To avoid retry rebuild and reclaim, let's retry rebuild
until further pool map changes, in that case, it should
fail the current rebuild, and further rebuild will resolve
the failure.

various fixs about rebuild if PS leader keeps changing
during rebuild.

Move migrate max ULT control to migrate_obj_iter_cb() to make
sure max ULT count will not exceed the setting.

Change the yield freq from 128 to 16 to make sure the object

Optimize migrate memory usage
- Add max ULT control for all targets on xstream, so
  the object being migrated can not exceed MIGRATE_MAX_ULT.

- Add each target max ULT control, so each target migrate
   ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr.

-  Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open
   for each object and dkey migration.

Test-tag: ec_offline_rebuild
Test-repeat: 3
Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
NiuYawei pushed a commit that referenced this pull request Mar 26, 2024
…3993)

To avoid retry rebuild and reclaim, let's retry rebuild
until further pool map changes, in that case, it should
fail the current rebuild, and further rebuild will resolve
the failure.

various fixs about rebuild if PS leader keeps changing
during rebuild.

Move migrate max ULT control to migrate_obj_iter_cb() to make
sure max ULT count will not exceed the setting.

Change the yield freq from 128 to 16 to make sure the object

Optimize migrate memory usage
- Add max ULT control for all targets on xstream, so
  the object being migrated can not exceed MIGRATE_MAX_ULT.

- Add each target max ULT control, so each target migrate
   ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr.

-  Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open
   for each object and dkey migration.

- DAOS-14845 object: fix a bug with  mpt_inflight_size

one dkey migrate possible exceed the mpt_inflight_max_size, in this
case original code possibly cause the dkey migrate ULT dead loop and
then rebuild cannot complete.
Example log - "migrate_one_ult() mrone 0x7f3c91fe1ec0 wait start 0/33554432",
that case will cause the ULT wait again after wakeup until shutdown.

Signed-off-by: Di Wang <[email protected]>
Signed-off-by: Xuezhao Liu <[email protected]>
Co-authored-by: Xuezhao Liu <[email protected]>
jolivier23 pushed a commit that referenced this pull request Apr 10, 2024
* DAOS-14845 object: retry migration for retriable failure

To avoid retry rebuild and reclaim, let's retry rebuild
until further pool map changes, in that case, it should
fail the current rebuild, and further rebuild will resolve
the failure.

various fixs about rebuild if PS leader keeps changing
during rebuild.

Move migrate max ULT control to migrate_obj_iter_cb() to make
sure max ULT count will not exceed the setting.

Change the yield freq from 128 to 16 to make sure the object

Optimize migrate memory usage
- Add max ULT control for all targets on xstream, so
  the object being migrated can not exceed MIGRATE_MAX_ULT.

- Add each target max ULT control, so each target migrate
   ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr.

-  Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open
   for each object and dkey migration.

Change-Id: I3b426542f6a5b196fc0e7cabb680d4ff9b1db65c
Signed-off-by: Di Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants