-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add chunk serializer & tests #1968
Conversation
I will have a look at the windows test failures on Monday (CET). |
@gjhiggins @ashleysommer @aucampia what do you think of the approach here? The need to chunk serialize files is a small one - a project I'm working on needs it - and I thought it interesting enough to make an RDFLib tool for, rather than just keeping the code within the project. I've tried to be efficient in terms of memory usage - no duplicate graph objects etc. - and to faithfully serialize the graph but there may be smarter approaches. |
pre-commit.ci autofix |
I think it is a fairly reasonable approach. It would have been nice if we had some way to do stream serialization to some text sink, but that is quite a big change and should probably be done with an abundance of caution and this seems like a reasonable approach in the interim. I will maybe add some more tests on your branch if that is okay with you. Also happy to address comments I made, just let me know on the comment if you agree or disagree. |
Minimally, it should indicate clearly that it is restricted to @pytest.mark.xfail(reason="Context information not preserved")
def test_chunking_of_conjunctivegraph():
nquads = """\
<http://example.org/alice> <http://purl.org/dc/terms/publisher> "Alice" .
<http://example.org/bob> <http://purl.org/dc/terms/publisher> "Bob" .
_:harry <http://purl.org/dc/terms/publisher> "Harry" .
_:harry <http://xmlns.com/foaf/0.1/name> "Harry" _:harry .
_:harry <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]> _:harry .
_:alice <http://xmlns.com/foaf/0.1/name> "Alice" <http://example.org/alice> .
_:alice <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]> <http://example.org/alice> .
_:bob <http://xmlns.com/foaf/0.1/name> "Bob" <http://example.org/bob> .
_:bob <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]> <http://example.org/bob> .
_:bob <http://xmlns.com/foaf/0.1/knows> _:alice <http://example.org/bob> ."""
g = ConjunctiveGraph()
g.parse(data=nquads, format="nquads")
# make a temp dir to work with
temp_dir_path = Path(tempfile.TemporaryDirectory().name)
Path(temp_dir_path).mkdir()
# serialize into chunks file with 100 triples each
serialize_in_chunks(
g, max_triples=100, file_name_stem="chunk_100", output_dir=temp_dir_path
)
# check, when a graph is made from the chunk files, it's isomorphic with original
g2 = ConjunctiveGraph()
for f in Path(temp_dir_path).glob("*.nt"):
g2.parse(f, format="nt")
assert len(list(g.contexts())) == len(list(g2.contexts()))
RDFLib has traditionally been ambivalent about what's perceived as core vs non-core. Additional functionality appears to inevitably accrete, up to a point where it gets migrated out en masse into a separate package, the contents of which gradually become obsolete as they either fall out of use or are subsequently integrated into core library functionality. Additional non-core functionality does have a regrettable tendency to languish in an untended and unkempt state. For instance, there's
Is it even worth bothering with a relatively trivial fix/update ... diff --git a/rdflib/tools/graphisomorphism.py b/rdflib/tools/graphisomorphism.py
index 004b567b..75462eb9 100644
--- a/rdflib/tools/graphisomorphism.py
+++ b/rdflib/tools/graphisomorphism.py
@@ -27,6 +27,10 @@ class IsomorphicTestableGraph(Graph):
"""
return hash(tuple(sorted(self.hashtriples())))
+ def __hash__(self):
+ # return hash(tuple(sorted(self.hashtriples())))
+ return self.internal_hash()
+
def hashtriples(self):
for triple in self:
g = ((isinstance(t, BNode) and self.vhash(t)) or t for t in triple)
@@ -49,19 +53,19 @@ class IsomorphicTestableGraph(Graph):
else:
yield self.vhash(triple[p], done=True)
- def __eq__(self, G):
+ def __eq__(self, g):
"""Graph isomorphism testing."""
- if not isinstance(G, IsomorphicTestableGraph):
+ if not isinstance(g, IsomorphicTestableGraph):
return False
- elif len(self) != len(G):
+ elif len(self) != len(g):
return False
- elif list.__eq__(list(self), list(G)):
+ elif list.__eq__(list(self), list(g)):
return True # @@
- return self.internal_hash() == G.internal_hash()
+ return self.internal_hash() == g.internal_hash()
- def __ne__(self, G):
+ def __ne__(self, g):
"""Negative graph isomorphism testing."""
- return not self.__eq__(G)
+ return not self.__eq__(g)
def main():
@@ -82,10 +86,10 @@ def main():
default="xml",
dest="inputFormat",
metavar="RDF_FORMAT",
- choices=["xml", "trix", "n3", "nt", "rdfa"],
+ choices=["xml", "n3", "nt", "turtle", "trix", "trig", "nquads", "json-ld", "hext"],
help="The format of the RDF document(s) to compare"
- + "One of 'xml','n3','trix', 'nt', "
- + "or 'rdfa'. The default is %default",
+ + "One of 'xml', 'turtle', 'n3', 'nt', 'trix', 'trig', 'nquads', 'json-ld'"
+ + "or 'hext'. The default is %default",
)
(options, args) = op.parse_args() when its appearance in There are a few non-core contributions in the closed PRs which I'm recruiting for preservaition in the cookbook. I'm guessing that a command-line version of graphisomorphism will ultimately end up there. |
NOTE: I still have not had time to look at windows issue, will try tomorrow. |
- Use bytes written as size instead of using `os.path.size()`. The second option here is very dependent on OS behaviour and what is on disk. - Add all open files to an exit stack so they are closed by the time the function returns. - Set encoding explicitly to utf-8 on opened files.
@nicholascar made some changes to your branch to get the tests to pass on windows:
|
- Very that writing triple won't exceed max file size before writing instead of after writing. This also necesitates using binary mode for file IO so that an accurate byte count can be obtained.
Another commit:
|
@nicholascar will finish this up in W32 |
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
- Add type hints to rdflib.plugins.serializers.nt - Use functions from rdflib.plugins.serializers.nt instead of copying them. - Don't use `Path.cwd()` in default argument as this is set at import time and will not change if the user does chdir. - Fix docstring. - Add some paramaterized testing.
Incorrectly interpreted something as a bug in typeshed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I closed all remaining open comments, added some more tests and changed to using functions from rdflib.plugins.serializers.nt
.
I think this is good to merge now.
I will merge this later this week if there is no further feedback. |
Summary of changes
This file provides a single function
serialize_in_chunks()
which can serialize aGraph into a number of NT files with a maximum number of triples or maximum file size.
There is an option to preserve any prefixes declared for the original graph in the first
file, which will be a Turtle file.
Checklist
the same change.
./examples
for new features.CHANGELOG.md
).so maintainers can fix minor issues and keep your PR up to date.