DAOS-14598 object: correct epoch for parity migration #13453

wangdi1 · 2023-12-06T22:38:44Z

Use stable epoch for partial parity update to make sure these partial updates are not below stable epoch boundary, otherwise both EC and VOS aggregation might operate on the same recxs at the same time, which can corrupt the data during rebuild.

Required-githooks: true

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2023-12-06T22:39:00Z

Bug-tracker data:
Ticket title is 'daos_test/suite.py:DaosCoreTest.test_daos_rebuild_ec - At least one multi-variant server was not found in its expected state'
Status is 'In Review'
Labels: 'ci_impact,md_on_ssd,pr_test,triaged'
https://daosio.atlassian.net/browse/DAOS-14598

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1

LGTM. No errors found by checkpatch.

wangdi1 · 2023-12-12T17:02:43Z

ping

src/object/srv_ec_aggregate.c

src/pool/srv_target.c

daosbuild1

LGTM. No errors found by checkpatch.

liuxuezhao

may test with "Test-tag: aggregation"
another point is ds_cont_child_reset_ec_agg_eph_all can reset to sc_ec_agg_eph_boundary rather 0

wangdi1 · 2023-12-16T05:31:03Z

may test with "Test-tag: aggregation" another point is ds_cont_child_reset_ec_agg_eph_all can reset to sc_ec_agg_eph_boundary rather 0

ok

Use stable epoch for partial parity update to make sure these partial updates are not below stable epoch boundary, otherwise both EC and VOS aggregation might operate on the same recxs at the same time, which can corrupt the data during rebuild. During EC aggregation, it should consider the un-aggregate epoch on non-leader parity as well, otherwise if the leader parity failed, which will be excluded from global EC stable epoch calculation immediately, then before the leader parity is being rebuilt, the global stable epoch might pass the un-aggregated epoch on the failed target, then these partial update on the data shard might be aggregated before EC aggregation, which might cause data corruption. And also it should choose a less fseq shard among all parity shards as the aggregate leader, in case the last parity can not be rebuilt in time. Required-githooks: true Test-tag:aggregation pr Signed-off-by: Di Wang <[email protected]>

Required-githooks: true Signed-off-by: Di Wang <[email protected]>

daosbuild1 · 2024-01-03T08:11:09Z

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13453/5/execution/node/1318/log

jolivier23

looks good, other than the spelling errors that need to be corrected.

jolivier23 · 2024-01-05T22:56:59Z

src/object/srv_ec_aggregate.c

@@ -123,13 +125,13 @@ struct ec_agg_param {
 	struct ec_agg_entry	 ap_agg_entry;	 /* entry used for each OID   */
 	daos_epoch_range_t	 ap_epr;	 /* hi/lo extent threshold    */
 	daos_epoch_t		 ap_filter_eph;	 /* Aggregatable filter epoch */
+	daos_epoch_t		ap_min_unagg_eph; /* minum unaggregate epoch */


need to fix this, otherwise, it will show on every PR

jolivier23 · 2024-01-05T22:57:14Z

src/object/srv_ec_aggregate.c

+		}
+	}
+
+	/* No parity shard is avaible */


correct word spelling Test-tag: pr aggregation Required-githooks: true Signed-off-by: Di Wang <[email protected]>

daosbuild1 · 2024-01-08T03:33:15Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13453/6/execution/node/1364/log

daosbuild1 · 2024-01-08T14:13:48Z

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13453/6/execution/node/1504/log

Increase rebuild ec timeout and fix test number. Required-githooks: true Signed-off-by: Di Wang <[email protected]>

daltonbohning · 2024-01-08T17:16:47Z

FYI, Test-tag and Features are not carried forward automatically

daosbuild1 · 2024-01-09T00:20:22Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13453/7/execution/node/1318/log

daosbuild1 · 2024-01-09T00:52:56Z

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13453/7/execution/node/1410/log

wangdi1 · 2024-01-09T00:55:32Z

failure due to DAOS-14845

Use stable epoch for partial parity update to make sure these partial updates are not below stable epoch boundary, otherwise both EC and VOS aggregation might operate on the same recxs at the same time, which can corrupt the data during rebuild. During EC aggregation, it should consider the un-aggregate epoch on non-leader parity as well, otherwise if the leader parity failed, which will be excluded from global EC stable epoch calculation immediately, then before the leader parity is being rebuilt, the global stable epoch might pass the un-aggregated epoch on the failed target, then these partial update on the data shard might be aggregated before EC aggregation, which might cause data corruption. And also it should choose a less fseq shard among all parity shards as the aggregate leader, in case the last parity can not be rebuilt in time. Signed-off-by: Di Wang <[email protected]>

wangdi1 requested review from liuxuezhao, gnailzenh and jolivier23 December 6, 2023 22:38

daosbuild1 reviewed Dec 6, 2023

View reviewed changes

wangdi1 force-pushed the wangdi/daos_14598 branch from 13bb0ae to 1fd3216 Compare December 11, 2023 20:30

daosbuild1 reviewed Dec 11, 2023

View reviewed changes

liuxuezhao reviewed Dec 13, 2023

View reviewed changes

src/object/srv_ec_aggregate.c Show resolved Hide resolved

src/object/srv_ec_aggregate.c Show resolved Hide resolved

src/object/srv_ec_aggregate.c Show resolved Hide resolved

src/pool/srv_target.c Show resolved Hide resolved

wangdi1 requested a review from liuxuezhao December 13, 2023 19:53

daosbuild1 reviewed Dec 13, 2023

View reviewed changes

liuxuezhao previously approved these changes Dec 15, 2023

View reviewed changes

wangdi1 dismissed liuxuezhao’s stale review via c640a78 December 16, 2023 05:43

wangdi1 force-pushed the wangdi/daos_14598 branch from 69b427d to c640a78 Compare December 16, 2023 05:43

Merge branch 'master' into wangdi/daos_14598

814df6a

Required-githooks: true Signed-off-by: Di Wang <[email protected]>

wangdi1 requested a review from liuxuezhao January 3, 2024 04:44

liuxuezhao previously approved these changes Jan 3, 2024

View reviewed changes

jolivier23 requested changes Jan 5, 2024

View reviewed changes

wangdi1 added 2 commits January 6, 2024 04:52

Merge branch 'master' into wangdi/daos_14598

bb50601

DAOS-14598 object: correct word spelling

bd0a78c

correct word spelling Test-tag: pr aggregation Required-githooks: true Signed-off-by: Di Wang <[email protected]>

wangdi1 dismissed liuxuezhao’s stale review via bd0a78c January 6, 2024 04:57

wangdi1 requested review from jolivier23 and liuxuezhao January 7, 2024 17:32

liuxuezhao previously approved these changes Jan 8, 2024

View reviewed changes

DAOS-14598 tests: increase rebuild ec timeout

a98b14d

Increase rebuild ec timeout and fix test number. Required-githooks: true Signed-off-by: Di Wang <[email protected]>

wangdi1 dismissed liuxuezhao’s stale review via a98b14d January 8, 2024 16:27

wangdi1 requested a review from a team as a code owner January 8, 2024 16:27

wangdi1 requested a review from liuxuezhao January 9, 2024 01:53

liuxuezhao approved these changes Jan 9, 2024

View reviewed changes

jolivier23 approved these changes Jan 9, 2024

View reviewed changes

wangdi1 requested a review from a team January 9, 2024 19:16

jolivier23 merged commit fc09fbe into master Jan 10, 2024

jolivier23 deleted the wangdi/daos_14598 branch January 10, 2024 20:02

kccain mentioned this pull request Jan 10, 2024

DAOS-14021 pool: enable md dup op detection #13536

Merged

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-14598 object: correct epoch for parity migration #13453

DAOS-14598 object: correct epoch for parity migration #13453

wangdi1 commented Dec 6, 2023

github-actions bot commented Dec 6, 2023 •

edited

Loading

daosbuild1 left a comment

daosbuild1 left a comment

wangdi1 commented Dec 12, 2023

daosbuild1 left a comment

liuxuezhao left a comment

wangdi1 commented Dec 16, 2023

daosbuild1 commented Jan 3, 2024

jolivier23 left a comment

jolivier23 Jan 5, 2024

jolivier23 Jan 5, 2024

daosbuild1 commented Jan 8, 2024

daosbuild1 commented Jan 8, 2024

daltonbohning commented Jan 8, 2024

daosbuild1 commented Jan 9, 2024

daosbuild1 commented Jan 9, 2024

wangdi1 commented Jan 9, 2024

DAOS-14598 object: correct epoch for parity migration #13453

DAOS-14598 object: correct epoch for parity migration #13453

Conversation

wangdi1 commented Dec 6, 2023

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Dec 6, 2023 • edited Loading

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 left a comment

Choose a reason for hiding this comment

wangdi1 commented Dec 12, 2023

daosbuild1 left a comment

Choose a reason for hiding this comment

liuxuezhao left a comment

Choose a reason for hiding this comment

wangdi1 commented Dec 16, 2023

daosbuild1 commented Jan 3, 2024

jolivier23 left a comment

Choose a reason for hiding this comment

jolivier23 Jan 5, 2024

Choose a reason for hiding this comment

jolivier23 Jan 5, 2024

Choose a reason for hiding this comment

daosbuild1 commented Jan 8, 2024

daosbuild1 commented Jan 8, 2024

daltonbohning commented Jan 8, 2024

daosbuild1 commented Jan 9, 2024

daosbuild1 commented Jan 9, 2024

wangdi1 commented Jan 9, 2024

github-actions bot commented Dec 6, 2023 •

edited

Loading