-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14845 object: retry migration for retriable failure #13590
Conversation
Bug-tracker data: |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/1/execution/node/1180/log |
To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Required-githooks: true Features: ec rebuild Signed-off-by: Di Wang <[email protected]>
6dc07c4
to
39d2656
Compare
Bug-tracker data: |
Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. Some minor fixes for migration. Required-githooks: true Signed-off-by: Di Wang <[email protected]>
Bug-tracker data: |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/3/execution/node/1452/log |
Update migrate max ULT control Required-githooks: true Signed-off-by: Di Wang <[email protected]>
Ticket title is 'timeout for mdtest after killing one rank FTEST_erasurecode.EcodOnlineRebuildMdtest.1-./erasurecode/online_rebuild_mdtest.py' |
fix typo Features: rebuild Required-githooks: true Signed-off-by: Di Wang <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/5/execution/node/1199/log |
Required-githooks: true
fix the segfault. Required-githooks: true Signed-off-by: Di Wang <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/6/execution/node/1173/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13590/7/display/redirect |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/7/execution/node/1531/log |
|
||
tgt_cnt = atomic_load(&tls->mpt_obj_ult_cnts[tgt_idx]) + | ||
atomic_load(&tls->mpt_dkey_ult_cnts[tgt_idx]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of "while (check_condition) { wait }" is quite inefficient where there are large amount of waiters.
Do you know how many waiters there could be? If there are at most tens of waiters, I think we can live with it; If there could be hundreds or thousands of waiters, the overhead of the unnecessary cycle of "wakeup -> recheck -> go back to wait" for most waiters will kill the performance badly, to make it worse, lots of contention will be generated by these atomic operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems at most one waiter per tgt per pool @wangdi1 ?
|
||
tgt_cnt = atomic_load(&tls->mpt_obj_ult_cnts[tgt_idx]) + | ||
atomic_load(&tls->mpt_dkey_ult_cnts[tgt_idx]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems at most one waiter per tgt per pool @wangdi1 ?
rc = dsc_obj_fetch(oh, eph, &mrone->mo_dkey, iod_num, iods, sgls, | ||
NULL, flags, NULL, csum_iov_fetch); | ||
if (rc == -DER_TIMEDOUT && | ||
tls->mpt_version + 1 >= tls->mpt_pool->spc_map_version) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, seems impossible to be "tls->mpt_version + 1 > tls->mpt_pool->spc_map_version", maybe I am wrong.
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13590/13/display/redirect |
job failed due to env issue. needs re-trigger. |
Merge branch 'master' into wangdi/rebuild_timeout
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/14/execution/node/1175/log |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/15/execution/node/1198/log |
Features: rebuild Merge branch 'master' into wangdi/rebuild_timeout
@wangdi1 I know time is limited. Will we get this soon? |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/16/execution/node/1405/log |
Merge branch 'master' into wangdi/rebuild_timeout
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/17/execution/node/1405/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/17/execution/node/1549/log |
failure due to DAOS-15124 and DAOS-15127. |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13590/17/execution/node/1528/log |
also, the rebuild/container_create_race.py test failure is known, https://daosio.atlassian.net/browse/DAOS-15002 |
DAOS-15127 is marked as resolved |
To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Optimize migrate memory usage - Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. - Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. - Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. Features: rebuild Required-githooks: true Signed-off-by: Di Wang <[email protected]>
To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Optimize migrate memory usage - Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. - Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. - Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. Features: rebuild Required-githooks: true Signed-off-by: Di Wang <[email protected]>
To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Optimize migrate memory usage - Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. - Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. - Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. Features: rebuild Required-githooks: true Signed-off-by: Di Wang <[email protected]>
To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Optimize migrate memory usage - Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. - Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. - Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. Features: rebuild Required-githooks: true Signed-off-by: Di Wang <[email protected]>
To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Optimize migrate memory usage - Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. - Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. - Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. Features: rebuild Required-githooks: true Signed-off-by: Di Wang <[email protected]>
To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Optimize migrate memory usage - Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. - Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. - Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. Test-tag: ec_offline_rebuild Test-repeat: 3 Required-githooks: true Signed-off-by: Di Wang <[email protected]>
…3993) To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Optimize migrate memory usage - Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. - Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. - Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. - DAOS-14845 object: fix a bug with mpt_inflight_size one dkey migrate possible exceed the mpt_inflight_max_size, in this case original code possibly cause the dkey migrate ULT dead loop and then rebuild cannot complete. Example log - "migrate_one_ult() mrone 0x7f3c91fe1ec0 wait start 0/33554432", that case will cause the ULT wait again after wakeup until shutdown. Signed-off-by: Di Wang <[email protected]> Signed-off-by: Xuezhao Liu <[email protected]> Co-authored-by: Xuezhao Liu <[email protected]>
* DAOS-14845 object: retry migration for retriable failure To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure. various fixs about rebuild if PS leader keeps changing during rebuild. Move migrate max ULT control to migrate_obj_iter_cb() to make sure max ULT count will not exceed the setting. Change the yield freq from 128 to 16 to make sure the object Optimize migrate memory usage - Add max ULT control for all targets on xstream, so the object being migrated can not exceed MIGRATE_MAX_ULT. - Add each target max ULT control, so each target migrate ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr. - Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open for each object and dkey migration. Change-Id: I3b426542f6a5b196fc0e7cabb680d4ff9b1db65c Signed-off-by: Di Wang <[email protected]>
To avoid retry rebuild and reclaim, let's retry rebuild until further pool map changes, in that case, it should fail the current rebuild, and further rebuild will resolve the failure.
Required-githooks: true
Features: rebuild
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: