add chunk serializer & tests #1968

nicholascar · 2022-05-22T12:06:12Z

Summary of changes

This file provides a single function serialize_in_chunks() which can serialize a
Graph into a number of NT files with a maximum number of triples or maximum file size.

There is an option to preserve any prefixes declared for the original graph in the first
file, which will be a Turtle file.

Checklist

Checked that there aren't other open pull requests for
the same change.
Added tests for any changes that have a runtime impact.
Checked that all tests and type checking passes.
For changes that have a potential impact on users of this project:
- Updated relevant documentation to avoid inaccuracies.
- Considered adding additional documentation.
- Considered adding an example in ./examples for new features.
- Considered updating our changelog (CHANGELOG.md).
Considered granting push permissions to the PR branch,
so maintainers can fix minor issues and keep your PR up to date.

aucampia · 2022-05-22T20:40:28Z

I will have a look at the windows test failures on Monday (CET).

nicholascar · 2022-05-24T23:43:53Z

@gjhiggins @ashleysommer @aucampia what do you think of the approach here?

The need to chunk serialize files is a small one - a project I'm working on needs it - and I thought it interesting enough to make an RDFLib tool for, rather than just keeping the code within the project.

I've tried to be efficient in terms of memory usage - no duplicate graph objects etc. - and to faithfully serialize the graph but there may be smarter approaches.

aucampia · 2022-05-25T21:39:47Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

rdflib/tools/chunk_serializer.py

test/test_tools/test_chunk_serializer.py

rdflib/tools/chunk_serializer.py

aucampia · 2022-05-25T23:06:25Z

I've tried to be efficient in terms of memory usage - no duplicate graph objects etc. - and to faithfully serialize the graph but there may be smarter approaches.

I think it is a fairly reasonable approach. It would have been nice if we had some way to do stream serialization to some text sink, but that is quite a big change and should probably be done with an abundance of caution and this seems like a reasonable approach in the interim.

I will maybe add some more tests on your branch if that is okay with you. Also happy to address comments I made, just let me know on the comment if you agree or disagree.

rdflib/tools/chunk_serializer.py

ghost · 2022-05-25T23:12:35Z

@gjhiggins @ashleysommer @aucampia what do you think of the approach here?

Minimally, it should indicate clearly that it is restricted to Graph serialization because, as the test below shows, context information is not preserved:

@pytest.mark.xfail(reason="Context information not preserved")
def test_chunking_of_conjunctivegraph():
    nquads = """\
<http://example.org/alice> <http://purl.org/dc/terms/publisher> "Alice" .
<http://example.org/bob> <http://purl.org/dc/terms/publisher> "Bob" .
_:harry <http://purl.org/dc/terms/publisher> "Harry" .
_:harry <http://xmlns.com/foaf/0.1/name> "Harry" _:harry .
_:harry <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]> _:harry .
_:alice <http://xmlns.com/foaf/0.1/name> "Alice" <http://example.org/alice> .
_:alice <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]> <http://example.org/alice> .
_:bob <http://xmlns.com/foaf/0.1/name> "Bob" <http://example.org/bob> .
_:bob <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]> <http://example.org/bob> .
_:bob <http://xmlns.com/foaf/0.1/knows> _:alice <http://example.org/bob> ."""
    g = ConjunctiveGraph()
    g.parse(data=nquads, format="nquads")

    # make a temp dir to work with
    temp_dir_path = Path(tempfile.TemporaryDirectory().name)
    Path(temp_dir_path).mkdir()

    # serialize into chunks file with 100 triples each
    serialize_in_chunks(
        g, max_triples=100, file_name_stem="chunk_100", output_dir=temp_dir_path
    )

    # check, when a graph is made from the chunk files, it's isomorphic with original
    g2 = ConjunctiveGraph()
    for f in Path(temp_dir_path).glob("*.nt"):
        g2.parse(f, format="nt")

    assert len(list(g.contexts())) == len(list(g2.contexts()))

The need to chunk serialize files is a small one - a project I'm working on needs it - and I thought it interesting enough to make an RDFLib tool for, rather than just keeping the code within the project.

RDFLib has traditionally been ambivalent about what's perceived as core vs non-core. Additional functionality appears to inevitably accrete, up to a point where it gets migrated out en masse into a separate package, the contents of which gradually become obsolete as they either fall out of use or are subsequently integrated into core library functionality.

Additional non-core functionality does have a regrettable tendency to languish in an untended and unkempt state. For instance, there's tools/graphisomorphism.py which is

currently broken (and has been since 2018)
long-obsolete, refererring as it does to RDFa as a supported format and based on Sean B. Palmers's 2004 rdfdiff.py implementation
Is subject to the same triples-only limitation.
Is obsoleted in functionality by both rdflib.compare.isomorphic and the weaker Graph.isomorphic()

Is it even worth bothering with a relatively trivial fix/update ...

diff --git a/rdflib/tools/graphisomorphism.py b/rdflib/tools/graphisomorphism.py
index 004b567b..75462eb9 100644
--- a/rdflib/tools/graphisomorphism.py
+++ b/rdflib/tools/graphisomorphism.py
@@ -27,6 +27,10 @@ class IsomorphicTestableGraph(Graph):
         """
         return hash(tuple(sorted(self.hashtriples())))
 
+    def __hash__(self): 
+        # return hash(tuple(sorted(self.hashtriples())))
+        return self.internal_hash()
+
     def hashtriples(self):
         for triple in self:
             g = ((isinstance(t, BNode) and self.vhash(t)) or t for t in triple)
@@ -49,19 +53,19 @@ class IsomorphicTestableGraph(Graph):
             else:
                 yield self.vhash(triple[p], done=True)
 
-    def __eq__(self, G):
+    def __eq__(self, g):
         """Graph isomorphism testing."""
-        if not isinstance(G, IsomorphicTestableGraph):
+        if not isinstance(g, IsomorphicTestableGraph):
             return False
-        elif len(self) != len(G):
+        elif len(self) != len(g):
             return False
-        elif list.__eq__(list(self), list(G)):
+        elif list.__eq__(list(self), list(g)):
             return True  # @@
-        return self.internal_hash() == G.internal_hash()
+        return self.internal_hash() == g.internal_hash()
 
-    def __ne__(self, G):
+    def __ne__(self, g):
         """Negative graph isomorphism testing."""
-        return not self.__eq__(G)
+        return not self.__eq__(g)
 
 
 def main():
@@ -82,10 +86,10 @@ def main():
         default="xml",
         dest="inputFormat",
         metavar="RDF_FORMAT",
-        choices=["xml", "trix", "n3", "nt", "rdfa"],
+        choices=["xml", "n3", "nt", "turtle", "trix", "trig", "nquads", "json-ld", "hext"],
         help="The format of the RDF document(s) to compare"
-        + "One of 'xml','n3','trix', 'nt', "
-        + "or 'rdfa'.  The default is %default",
+        + "One of 'xml', 'turtle', 'n3', 'nt', 'trix', 'trig', 'nquads', 'json-ld'"
+        + "or 'hext'.  The default is %default",
     )
 
     (options, args) = op.parse_args()

when its appearance in tools is unlikely to persist for much longer?

There are a few non-core contributions in the closed PRs which I'm recruiting for preservaition in the cookbook. I'm guessing that a command-line version of graphisomorphism will ultimately end up there.

aucampia · 2022-05-25T23:15:57Z

NOTE: I still have not had time to look at windows issue, will try tomorrow.

- Use bytes written as size instead of using `os.path.size()`. The second option here is very dependent on OS behaviour and what is on disk. - Add all open files to an exit stack so they are closed by the time the function returns. - Set encoding explicitly to utf-8 on opened files.

aucampia · 2022-05-26T11:18:37Z

@nicholascar made some changes to your branch to get the tests to pass on windows:

Use bytes written as size instead of using os.path.size().
The second option here is very dependent on OS behaviour and what is
on disk.
Add all open files to an exit stack so they are closed by the time the
function returns.
Set encoding explicitly to utf-8 on opened files.

rdflib/tools/chunk_serializer.py

coveralls · 2022-05-26T11:23:54Z

Coverage increased (+0.01%) to 90.458% when pulling 8647eb0 on chunk_serializer into 131d9e6 on master.

- Very that writing triple won't exceed max file size before writing instead of after writing. This also necesitates using binary mode for file IO so that an accurate byte count can be obtained.

aucampia · 2022-05-26T11:48:05Z

Another commit:

Very that writing triple won't exceed max file size before writing
instead of after writing.

This also necesitates using binary mode for file IO so that an
accurate byte count can be obtained.

aucampia · 2022-08-07T20:12:20Z

@nicholascar will finish this up in W32

aucampia · 2022-08-09T19:24:59Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

- Add type hints to rdflib.plugins.serializers.nt - Use functions from rdflib.plugins.serializers.nt instead of copying them. - Don't use `Path.cwd()` in default argument as this is set at import time and will not change if the user does chdir. - Fix docstring. - Add some paramaterized testing.

Incorrectly interpreted something as a bug in typeshed.

aucampia

I closed all remaining open comments, added some more tests and changed to using functions from rdflib.plugins.serializers.nt.

I think this is good to merge now.

aucampia · 2022-08-10T06:06:14Z

I think this is good to merge now.

I will merge this later this week if there is no further feedback.

nicholascar added 3 commits May 22, 2022 21:46

add chunk serializer & tests

dacb433

blurb

83645b1

fix some mypy

f1859f6

Add type to except statements.

7db4232

pre-commit-ci bot and others added 2 commits May 25, 2022 21:40

[pre-commit.ci] auto fixes from pre-commit.com hooks

3249bb0

for more information, see https://pre-commit.ci

Merge branch 'master' into chunk_serializer

282a870

aucampia reviewed May 25, 2022

View reviewed changes