Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement RDF Patch serializer #2877

Merged

Conversation

recalcitrantsupplant
Copy link
Contributor

@recalcitrantsupplant recalcitrantsupplant commented Aug 6, 2024

Supports serialization from Dataset instances only; triples and quads within a Dataset are supported.

Summary of changes

Three methods to create RDF Patches from RDFLib Datasets:

  1. Serialize a Dataset as an addition patch
  2. Serialize a Dataset as a delete patch
  3. Create a patch representing the difference between a Dataset instance and a target Dataset instance

Basic usage:

  1. ds.serialize(format="patch", operation="add")
  2. ds.serialize(format="patch", operation="remove")
  3. ds1.serialize(format="patch", target=ds2)

Complete examples are provided in an example script.

Checklist

  • Checked that there aren't other open pull requests for
    the same change.
  • Checked that all tests and type checking passes.
  • If the change adds new features or changes the RDFLib public API:
    • Created an issue to discuss the change and get in-principle agreement. Discussed w/ maintainers
    • Considered adding an example in ./examples.
  • If the change has a potential impact on users of this project:
    • Updated relevant documentation to avoid inaccuracies.
    • Considered adding additional documentation.
  • Considered granting push permissions to the PR branch,
    so maintainers can fix minor issues and keep your PR up to date.

Copy link
Contributor

@edmondchuc edmondchuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in our chat, this version of the serializer could function as a utility function rather than a serializer, performing one of two operations:

  • Add or Remove Operation: Takes a dataset with an add or remove operation parameter and generates an RDF patch.

  • Difference Operation: Takes two datasets and performs a difference operation, with the second dataset as the "target".

This approach allows us to reserve the patch serializer for RDF patches on an RDFLib store with change tracking enabled. The backend would track changes applied to the store through a series of RDF patches, and the serializer would output these patches sequentially.

I'm not suggesting we need to implement this, but just toying with what the potential RDFLib public interfaces may look like with a complete RDF patch support.

@ashleysommer
Copy link
Contributor

The mypy issues in main should be resolved now. I've resynced this PR, hopefully it parses now.

@ashleysommer
Copy link
Contributor

@recalcitrantsupplant
Still 4 mypy errors related to this PR:

poetry run python -m mypy --show-error-context --show-error-codes --junit-xml=test_reports/3.9-macos-latest-mypy-junit.xml
  rdflib/plugins/serializers/patch.py: note: In member "serialize" of class "PatchSerializer":
  rdflib/plugins/serializers/patch.py:30: error: Signature of "serialize" incompatible with supertype "Serializer"  [override]
  rdflib/plugins/serializers/patch.py:30: note:      Superclass:
  rdflib/plugins/serializers/patch.py:30: note:          def serialize(self, stream: IO[bytes], base: Optional[str] = ..., encoding: Optional[str] = ..., **args: Any) -> None
  rdflib/plugins/serializers/patch.py:30: note:      Subclass:
  rdflib/plugins/serializers/patch.py:30: note:          def serialize(self, stream: IO[bytes], base: Optional[str] = ..., encoding: Optional[str] = ..., operation: Optional[str] = ..., target: Optional[Graph] = ..., header_id: Optional[str] = ..., header_prev: Optional[str] = ...) -> Any
  rdflib/plugins/serializers/patch.py: note: In function "serialize":
  rdflib/plugins/serializers/patch.py:60: error: "Graph" has no attribute "get_context"  [attr-defined]
  rdflib/plugins/serializers/patch.py: note: In member "serialize" of class "PatchSerializer":
  rdflib/plugins/serializers/patch.py:74: error: "Graph" has no attribute "contexts"  [attr-defined]
  rdflib/plugins/serializers/patch.py: note: In member "_patch_row" of class "PatchSerializer":
  rdflib/plugins/serializers/patch.py:88: error: "Graph" has no attribute "default_context"  [attr-defined]
  Found 4 errors in 1 file (checked 401 source files)

@coveralls
Copy link

coveralls commented Aug 11, 2024

Coverage Status

coverage: 90.611% (+0.02%) from 90.595%
when pulling e58efff on recalcitrantsupplant:david/rdf-patch-serialiser
into 0c11deb on RDFLib:main.

@ashleysommer ashleysommer merged commit 404be3b into RDFLib:main Aug 26, 2024
22 checks passed
@ashleysommer
Copy link
Contributor

ashleysommer commented Aug 26, 2024

@recalcitrantsupplant I thought this was complete and passing tests and ready to merge, however after merging it there are now some test failures appearing in main, and I've verified locally it has something to do with the roundtrip tests.
As this .patch parser is a registered RDFLib parser plugin, the test suite includes it in the roundtrip tests. There are a bunch of .nt->.patch conversions and .n3->.patch roundtrip tests that are failing with odd errors.
I don't have time to troubleshoot it now.

As @edmondchuc mentioned above, I don't know what use this is as a regular RDFLib serializer, because you really need to associate it with actions such as Add or Remove or target, which are not defined by default when doing operations like roundtrip. This really makes sense as a helper function, that allows you to script the creation of a .patch file.

To allow main to pass tests again, I'm going to revert this merge. We can revisit it again soon.

@ashleysommer
Copy link
Contributor

ashleysommer commented Aug 26, 2024

@recalcitrantsupplant Another update, I haven't revered this yet.
I found that if I add a fallback to find cases where operation is not passed, but target is also not passed, then default to "add" operation. With simple change, now all roundtrip tests are passing. Does that sound like a sensible change to you? I would assume anyone tring to naively serialize a graph or Dataset to .patch format without an operation is going to by default want the "add" patch operation.

This prevents unexpectedly getting a .patch file that contains only the header block and no content (thats why the test cases were failing.)

@ashleysommer
Copy link
Contributor

The issue described in the previous two comments is now fixed by #2898 so tests in main are now passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants