Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add chunk serializer & tests #1968

Merged
merged 19 commits into from
Aug 12, 2022
Merged

add chunk serializer & tests #1968

merged 19 commits into from
Aug 12, 2022

Conversation

nicholascar
Copy link
Member

@nicholascar nicholascar commented May 22, 2022

Summary of changes

This file provides a single function serialize_in_chunks() which can serialize a
Graph into a number of NT files with a maximum number of triples or maximum file size.

There is an option to preserve any prefixes declared for the original graph in the first
file, which will be a Turtle file.

Checklist

  • Checked that there aren't other open pull requests for
    the same change.
  • Added tests for any changes that have a runtime impact.
  • Checked that all tests and type checking passes.
  • For changes that have a potential impact on users of this project:
    • Updated relevant documentation to avoid inaccuracies.
    • Considered adding additional documentation.
    • Considered adding an example in ./examples for new features.
    • Considered updating our changelog (CHANGELOG.md).
  • Considered granting push permissions to the PR branch,
    so maintainers can fix minor issues and keep your PR up to date.

@aucampia
Copy link
Member

aucampia commented May 22, 2022

I will have a look at the windows test failures on Monday (CET).

@nicholascar
Copy link
Member Author

@gjhiggins @ashleysommer @aucampia what do you think of the approach here?

The need to chunk serialize files is a small one - a project I'm working on needs it - and I thought it interesting enough to make an RDFLib tool for, rather than just keeping the code within the project.

I've tried to be efficient in terms of memory usage - no duplicate graph objects etc. - and to faithfully serialize the graph but there may be smarter approaches.

@aucampia
Copy link
Member

pre-commit.ci autofix

@aucampia
Copy link
Member

I've tried to be efficient in terms of memory usage - no duplicate graph objects etc. - and to faithfully serialize the graph but there may be smarter approaches.

I think it is a fairly reasonable approach. It would have been nice if we had some way to do stream serialization to some text sink, but that is quite a big change and should probably be done with an abundance of caution and this seems like a reasonable approach in the interim.

I will maybe add some more tests on your branch if that is okay with you. Also happy to address comments I made, just let me know on the comment if you agree or disagree.

@ghost
Copy link

ghost commented May 25, 2022

@gjhiggins @ashleysommer @aucampia what do you think of the approach here?

Minimally, it should indicate clearly that it is restricted to Graph serialization because, as the test below shows, context information is not preserved:

@pytest.mark.xfail(reason="Context information not preserved")
def test_chunking_of_conjunctivegraph():
    nquads = """\
<http://example.org/alice> <http://purl.org/dc/terms/publisher> "Alice" .
<http://example.org/bob> <http://purl.org/dc/terms/publisher> "Bob" .
_:harry <http://purl.org/dc/terms/publisher> "Harry" .
_:harry <http://xmlns.com/foaf/0.1/name> "Harry" _:harry .
_:harry <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]> _:harry .
_:alice <http://xmlns.com/foaf/0.1/name> "Alice" <http://example.org/alice> .
_:alice <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]> <http://example.org/alice> .
_:bob <http://xmlns.com/foaf/0.1/name> "Bob" <http://example.org/bob> .
_:bob <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]> <http://example.org/bob> .
_:bob <http://xmlns.com/foaf/0.1/knows> _:alice <http://example.org/bob> ."""
    g = ConjunctiveGraph()
    g.parse(data=nquads, format="nquads")

    # make a temp dir to work with
    temp_dir_path = Path(tempfile.TemporaryDirectory().name)
    Path(temp_dir_path).mkdir()

    # serialize into chunks file with 100 triples each
    serialize_in_chunks(
        g, max_triples=100, file_name_stem="chunk_100", output_dir=temp_dir_path
    )

    # check, when a graph is made from the chunk files, it's isomorphic with original
    g2 = ConjunctiveGraph()
    for f in Path(temp_dir_path).glob("*.nt"):
        g2.parse(f, format="nt")

    assert len(list(g.contexts())) == len(list(g2.contexts()))

The need to chunk serialize files is a small one - a project I'm working on needs it - and I thought it interesting enough to make an RDFLib tool for, rather than just keeping the code within the project.

RDFLib has traditionally been ambivalent about what's perceived as core vs non-core. Additional functionality appears to inevitably accrete, up to a point where it gets migrated out en masse into a separate package, the contents of which gradually become obsolete as they either fall out of use or are subsequently integrated into core library functionality.

Additional non-core functionality does have a regrettable tendency to languish in an untended and unkempt state. For instance, there's tools/graphisomorphism.py which is

  1. currently broken (and has been since 2018)
  2. long-obsolete, refererring as it does to RDFa as a supported format and based on Sean B. Palmers's 2004 rdfdiff.py implementation
  3. Is subject to the same triples-only limitation.
  4. Is obsoleted in functionality by both rdflib.compare.isomorphic and the weaker Graph.isomorphic()

Is it even worth bothering with a relatively trivial fix/update ...

diff --git a/rdflib/tools/graphisomorphism.py b/rdflib/tools/graphisomorphism.py
index 004b567b..75462eb9 100644
--- a/rdflib/tools/graphisomorphism.py
+++ b/rdflib/tools/graphisomorphism.py
@@ -27,6 +27,10 @@ class IsomorphicTestableGraph(Graph):
         """
         return hash(tuple(sorted(self.hashtriples())))
 
+    def __hash__(self): 
+        # return hash(tuple(sorted(self.hashtriples())))
+        return self.internal_hash()
+
     def hashtriples(self):
         for triple in self:
             g = ((isinstance(t, BNode) and self.vhash(t)) or t for t in triple)
@@ -49,19 +53,19 @@ class IsomorphicTestableGraph(Graph):
             else:
                 yield self.vhash(triple[p], done=True)
 
-    def __eq__(self, G):
+    def __eq__(self, g):
         """Graph isomorphism testing."""
-        if not isinstance(G, IsomorphicTestableGraph):
+        if not isinstance(g, IsomorphicTestableGraph):
             return False
-        elif len(self) != len(G):
+        elif len(self) != len(g):
             return False
-        elif list.__eq__(list(self), list(G)):
+        elif list.__eq__(list(self), list(g)):
             return True  # @@
-        return self.internal_hash() == G.internal_hash()
+        return self.internal_hash() == g.internal_hash()
 
-    def __ne__(self, G):
+    def __ne__(self, g):
         """Negative graph isomorphism testing."""
-        return not self.__eq__(G)
+        return not self.__eq__(g)
 
 
 def main():
@@ -82,10 +86,10 @@ def main():
         default="xml",
         dest="inputFormat",
         metavar="RDF_FORMAT",
-        choices=["xml", "trix", "n3", "nt", "rdfa"],
+        choices=["xml", "n3", "nt", "turtle", "trix", "trig", "nquads", "json-ld", "hext"],
         help="The format of the RDF document(s) to compare"
-        + "One of 'xml','n3','trix', 'nt', "
-        + "or 'rdfa'.  The default is %default",
+        + "One of 'xml', 'turtle', 'n3', 'nt', 'trix', 'trig', 'nquads', 'json-ld'"
+        + "or 'hext'.  The default is %default",
     )
 
     (options, args) = op.parse_args()

when its appearance in tools is unlikely to persist for much longer?

There are a few non-core contributions in the closed PRs which I'm recruiting for preservaition in the cookbook. I'm guessing that a command-line version of graphisomorphism will ultimately end up there.

@aucampia
Copy link
Member

NOTE: I still have not had time to look at windows issue, will try tomorrow.

- Use bytes written as size instead of using `os.path.size()`.
  The second option here is very dependent on OS behaviour and what is
  on disk.

- Add all open files to an exit stack so they are closed by the time the
  function returns.

- Set encoding explicitly to utf-8 on opened files.
@aucampia
Copy link
Member

aucampia commented May 26, 2022

@nicholascar made some changes to your branch to get the tests to pass on windows:

  • Use bytes written as size instead of using os.path.size().
    The second option here is very dependent on OS behaviour and what is
    on disk.

  • Add all open files to an exit stack so they are closed by the time the
    function returns.

  • Set encoding explicitly to utf-8 on opened files.

@coveralls
Copy link

coveralls commented May 26, 2022

Coverage Status

Coverage increased (+0.01%) to 90.458% when pulling 8647eb0 on chunk_serializer into 131d9e6 on master.

- Very that writing triple won't exceed max file size before writing
  instead of after writing.

  This also necesitates using binary mode for file IO so that an
  accurate byte count can be obtained.
@aucampia
Copy link
Member

Another commit:

  • Very that writing triple won't exceed max file size before writing
    instead of after writing.

    This also necesitates using binary mode for file IO so that an
    accurate byte count can be obtained.

@nicholascar nicholascar requested a review from aucampia August 7, 2022 11:36
@aucampia
Copy link
Member

aucampia commented Aug 7, 2022

@nicholascar will finish this up in W32

@aucampia
Copy link
Member

aucampia commented Aug 9, 2022

pre-commit.ci autofix

pre-commit-ci bot and others added 5 commits August 9, 2022 19:25
- Add type hints to rdflib.plugins.serializers.nt
- Use functions from rdflib.plugins.serializers.nt instead of copying
  them.
- Don't use `Path.cwd()` in default argument as this is set at import
  time and will not change if the user does chdir.
- Fix docstring.
- Add some paramaterized testing.
Incorrectly interpreted something as a bug in typeshed.
Copy link
Member

@aucampia aucampia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I closed all remaining open comments, added some more tests and changed to using functions from rdflib.plugins.serializers.nt.

I think this is good to merge now.

@aucampia aucampia requested a review from a team August 10, 2022 00:40
@aucampia aucampia added the review wanted This indicates that the PR is ready for review label Aug 10, 2022
@aucampia
Copy link
Member

I think this is good to merge now.

I will merge this later this week if there is no further feedback.

@aucampia aucampia merged commit a4b9305 into master Aug 12, 2022
@nicholascar nicholascar deleted the chunk_serializer branch March 16, 2023 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review wanted This indicates that the PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants