Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14598 object: correct epoch for parity migration #13453

Merged
merged 5 commits into from
Jan 10, 2024

Conversation

wangdi1
Copy link
Contributor

@wangdi1 wangdi1 commented Dec 6, 2023

Use stable epoch for partial parity update to make sure these partial updates are not below stable epoch boundary, otherwise both EC and VOS aggregation might operate on the same recxs at the same time, which can corrupt the data during rebuild.

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented Dec 6, 2023

Bug-tracker data:
Ticket title is 'daos_test/suite.py:DaosCoreTest.test_daos_rebuild_ec - At least one multi-variant server was not found in its expected state'
Status is 'In Review'
Labels: 'ci_impact,md_on_ssd,pr_test,triaged'
https://daosio.atlassian.net/browse/DAOS-14598

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@wangdi1
Copy link
Contributor Author

wangdi1 commented Dec 12, 2023

ping

@wangdi1 wangdi1 requested a review from liuxuezhao December 13, 2023 19:53
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

liuxuezhao
liuxuezhao previously approved these changes Dec 15, 2023
Copy link
Contributor

@liuxuezhao liuxuezhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may test with "Test-tag: aggregation"
another point is ds_cont_child_reset_ec_agg_eph_all can reset to sc_ec_agg_eph_boundary rather 0

@wangdi1
Copy link
Contributor Author

wangdi1 commented Dec 16, 2023

may test with "Test-tag: aggregation" another point is ds_cont_child_reset_ec_agg_eph_all can reset to sc_ec_agg_eph_boundary rather 0

ok

Use stable epoch for partial parity update to make sure
these partial updates are not below stable epoch boundary,
otherwise both EC and VOS aggregation might operate on
the same recxs at the same time, which can corrupt the data
during rebuild.

During EC aggregation, it should consider the un-aggregate epoch on
non-leader parity as well, otherwise if the leader parity failed, which
will be excluded from global EC stable epoch calculation immediately,
then before the leader parity is being rebuilt, the global stable epoch
might pass the un-aggregated epoch on the failed target, then these
partial update on the data shard might be aggregated before EC
aggregation, which might cause data corruption.

And also it should choose a less fseq shard among all parity shards as
the aggregate leader, in case the last parity can not be rebuilt in
time.

Required-githooks: true
Test-tag:aggregation pr
Signed-off-by: Di Wang <[email protected]>
Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
@wangdi1 wangdi1 requested a review from liuxuezhao January 3, 2024 04:44
liuxuezhao
liuxuezhao previously approved these changes Jan 3, 2024
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13453/5/execution/node/1318/log

Copy link
Contributor

@jolivier23 jolivier23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, other than the spelling errors that need to be corrected.

@@ -123,13 +125,13 @@ struct ec_agg_param {
struct ec_agg_entry ap_agg_entry; /* entry used for each OID */
daos_epoch_range_t ap_epr; /* hi/lo extent threshold */
daos_epoch_t ap_filter_eph; /* Aggregatable filter epoch */
daos_epoch_t ap_min_unagg_eph; /* minum unaggregate epoch */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to fix this, otherwise, it will show on every PR

}
}

/* No parity shard is avaible */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

correct word spelling

Test-tag: pr aggregation
Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
liuxuezhao
liuxuezhao previously approved these changes Jan 8, 2024
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13453/6/execution/node/1364/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13453/6/execution/node/1504/log

Increase rebuild ec timeout and fix test number.

Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
@daltonbohning
Copy link
Contributor

FYI, Test-tag and Features are not carried forward automatically

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13453/7/execution/node/1318/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13453/7/execution/node/1410/log

@wangdi1
Copy link
Contributor Author

wangdi1 commented Jan 9, 2024

failure due to DAOS-14845

@wangdi1 wangdi1 requested a review from liuxuezhao January 9, 2024 01:53
@wangdi1 wangdi1 requested a review from a team January 9, 2024 19:16
@jolivier23 jolivier23 merged commit fc09fbe into master Jan 10, 2024
@jolivier23 jolivier23 deleted the wangdi/daos_14598 branch January 10, 2024 20:02
wangdi1 added a commit that referenced this pull request Feb 7, 2024
Use stable epoch for partial parity update to make sure
these partial updates are not below stable epoch boundary,
otherwise both EC and VOS aggregation might operate on
the same recxs at the same time, which can corrupt the data
during rebuild.

During EC aggregation, it should consider the un-aggregate epoch on
non-leader parity as well, otherwise if the leader parity failed, which
will be excluded from global EC stable epoch calculation immediately,
then before the leader parity is being rebuilt, the global stable epoch
might pass the un-aggregated epoch on the failed target, then these
partial update on the data shard might be aggregated before EC
aggregation, which might cause data corruption.

And also it should choose a less fseq shard among all parity shards as
the aggregate leader, in case the last parity can not be rebuilt in
time.

Signed-off-by: Di Wang <[email protected]>
jolivier23 pushed a commit that referenced this pull request Feb 28, 2024
Use stable epoch for partial parity update to make sure
these partial updates are not below stable epoch boundary,
otherwise both EC and VOS aggregation might operate on
the same recxs at the same time, which can corrupt the data
during rebuild.

During EC aggregation, it should consider the un-aggregate epoch on
non-leader parity as well, otherwise if the leader parity failed, which
will be excluded from global EC stable epoch calculation immediately,
then before the leader parity is being rebuilt, the global stable epoch
might pass the un-aggregated epoch on the failed target, then these
partial update on the data shard might be aggregated before EC
aggregation, which might cause data corruption.

And also it should choose a less fseq shard among all parity shards as
the aggregate leader, in case the last parity can not be rebuilt in
time.

Signed-off-by: Di Wang <[email protected]>
jolivier23 pushed a commit that referenced this pull request Mar 12, 2024
Use stable epoch for partial parity update to make sure
these partial updates are not below stable epoch boundary,
otherwise both EC and VOS aggregation might operate on
the same recxs at the same time, which can corrupt the data
during rebuild.

During EC aggregation, it should consider the un-aggregate epoch on
non-leader parity as well, otherwise if the leader parity failed, which
will be excluded from global EC stable epoch calculation immediately,
then before the leader parity is being rebuilt, the global stable epoch
might pass the un-aggregated epoch on the failed target, then these
partial update on the data shard might be aggregated before EC
aggregation, which might cause data corruption.

And also it should choose a less fseq shard among all parity shards as
the aggregate leader, in case the last parity can not be rebuilt in
time.

Signed-off-by: Di Wang <[email protected]>
jolivier23 pushed a commit that referenced this pull request Apr 10, 2024
Use stable epoch for partial parity update to make sure
these partial updates are not below stable epoch boundary,
otherwise both EC and VOS aggregation might operate on
the same recxs at the same time, which can corrupt the data
during rebuild.

During EC aggregation, it should consider the un-aggregate epoch on
non-leader parity as well, otherwise if the leader parity failed, which
will be excluded from global EC stable epoch calculation immediately,
then before the leader parity is being rebuilt, the global stable epoch
might pass the un-aggregated epoch on the failed target, then these
partial update on the data shard might be aggregated before EC
aggregation, which might cause data corruption.

And also it should choose a less fseq shard among all parity shards as
the aggregate leader, in case the last parity can not be rebuilt in
time.

Signed-off-by: Di Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants