Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type annotations to constructor parameters in Literal #1498

Merged
merged 2 commits into from
Dec 19, 2021
Merged

Add type annotations to constructor parameters in Literal #1498

merged 2 commits into from
Dec 19, 2021

Conversation

veyndan
Copy link
Contributor

@veyndan veyndan commented Dec 11, 2021

@gjhiggins From your comment here #1494 (comment), I opted to go for type annotations instead of explicit exceptions. I guess the difference between cleaner code by just having type annotations with no explicit runtime checks and having more code with explicit runtime checks is a subjective one, so I'd be interested to hear what you think. Again, just pulled this out into a separate PR as I feel it's out of the scope of the other PR.

Also, the base branch for this should be #1494, but I'm not sure how to do that in GitHub when both the branches are upstream.

@veyndan veyndan changed the title Literal lang type annotation Add type annotation to lang in Literal Dec 11, 2021
@ghost
Copy link

ghost commented Dec 11, 2021

@gjhiggins From your comment here #1494 (comment), I opted to go for type annotations instead of explicit exceptions. I guess the difference between cleaner code by just having type annotations with no explicit runtime checks and having more code with explicit runtime checks is a subjective one, so I'd be interested to hear what you think.

With the caveat that ... for the last few years, I've been working in C++ and I have only recently returned to Python which means that I haven't had any experience of using type annotations but the rationale of "opening up Python code to easier static analysis and refactoring" as described in PEP484 seems completely appropriate for a library and for this change ... your decision looks sound to me, fwiw.

Again, just pulled this out into a separate PR as I feel it's out of the scope of the other PR.

Your call, you're in driving seat.

Also, the base branch for this should be #1494, but I'm not sure how to do that in GitHub when both the branches are upstream.

Ew. Sometimes it's more work negotiating the contrib process than it is to make the fix.

Cheers

Graham

Copy link
Member

@aucampia aucampia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your change looks good but more is needed to make it pass type validation, I made a PR against your branch to fix it.

@veyndan
Copy link
Contributor Author

veyndan commented Dec 11, 2021

@gjhiggins From your comment here #1494 (comment), I opted to go for type annotations instead of explicit exceptions. I guess the difference between cleaner code by just having type annotations with no explicit runtime checks and having more code with explicit runtime checks is a subjective one, so I'd be interested to hear what you think.

With the caveat that ... for the last few years, I've been working in C++ and I have only recently returned to Python which means that I haven't had any experience of using type annotations but the rationale of "opening up Python code to easier static analysis and refactoring" as described in PEP484 seems completely appropriate for a library and for this change ... your decision looks sound to me, fwiw.

Same here, though I've been working with Kotlin instead of Python for the last few years. I haven't had any experience with type annotations either, but was excited when I heard about it.

Again, just pulled this out into a separate PR as I feel it's out of the scope of the other PR.

Your call, you're in driving seat.

Cool, yeah it seems like going with the type annotations is the way to go then.

Also, the base branch for this should be #1494, but I'm not sure how to do that in GitHub when both the branches are upstream.

Ew. Sometimes it's more work negotiating the contrib process than it is to make the fix.

Definitely.

@veyndan veyndan changed the title Add type annotation to lang in Literal Add type annotations to constructor parameters in Literal Dec 11, 2021
rdflib/term.py Outdated
@@ -548,7 +556,7 @@ def __new__(cls, lexical_or_value, lang=None, datatype=None, normalize=None):
"per http://www.w3.org/TR/rdf-concepts/#section-Graph-Literal"
)

if lang and not _is_valid_langtag(lang):
if lang is not None and not _is_valid_langtag(lang):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, though would like @gjhiggins to also sign off on it as this is what was in #1494.

The reason this makes sense to me is that _is_valid_langtag should be true for any lang tag, and if it is not true for a lang tag then the literal should not be constructed.

With the way it was if lang and not _is_valid_langtag(lang): there is range of values for lang for which _is_valid_langtag would be false, but for which Literal objects would be constructed - and those literal objects would then have invalid language tags.

But I also think the typing here is very useful to prevent any confusion.

@aucampia
Copy link
Member

This should also ideally only be merged with #1497 - and we could do with a couple of more tests for this, some ideas:

  • test lang=""
  • test lang={}
  • test lang="e n"

I would be happy to add them myself afterwards though.

@veyndan
Copy link
Contributor Author

veyndan commented Dec 11, 2021

This should also ideally only be merged with #1497 - and we could do with a couple of more tests for this, some ideas:

  • test lang=""
  • test lang={}
  • test lang="e n"

I would be happy to add them myself afterwards though.

Regarding test lang={}, my assumption was that as we now have a type annotation of Optional[str] for lang, we don't need to test for things that aren't of the type str or None. Otherwise, everywhere where we have type annotations, we'd also need to make tests with types that differ from the type annotations, which'd cause us to have to write (and maintain) a large number of tests.

@aucampia
Copy link
Member

aucampia commented Dec 11, 2021

Otherwise, everywhere where we have type annotations, we'd also need to make tests with types that differ from the type annotations, which'd cause us to have to write (and maintain) a large number of tests.

lang={} should (and will fail) because _is_valid_langtag won't return true for lang={} - I am guessing it will be a type error due to regex not accepting values of dict type. On the other hand, if it does not fail, then _is_valid_langtag should be fixed.

Basically it should be (and with your changes is I think) impossible to construct a Literal that does not have a valid language tag.

@veyndan
Copy link
Contributor Author

veyndan commented Dec 11, 2021

Otherwise, everywhere where we have type annotations, we'd also need to make tests with types that differ from the type annotations, which'd cause us to have to write (and maintain) a large number of tests.

lang={} should (and will fail) because _is_valid_langtag won't return true for lang={} - I am guessing it will be a type error due to regex not accepting values of dict type. On the other hand, if it does not fail, then _is_valid_langtag should be fixed.

Basically it should be (and with your changes is I think) impossible to construct a Literal that does not have a valid language tag.

Yep, I agree that it will fail with the current implementation of the code, but I guess my more general question is whether we should be including the actual unit test for this at all? My objection to including the unit test is (as I mentioned) the (potential) large increase in unit tests and argument verification code if we were to apply this same logic and write unit tests passing in random objects to functions that have defined the types they support in their type annotations.

Also, having this test means that if we refactor this function, we can't assume that lang is always an Optional[str] and we have to do regular runtime type checking again using something like isinstance(lang, Optional[str]) which isn't the prettiest.

@aucampia
Copy link
Member

aucampia commented Dec 11, 2021

Yep, I agree that it will fail with the current implementation of the code, but I guess my more general question is whether we should be including the actual unit test for this at all?

I would say yes we should, because semantically there will never be a case where a Literal should construct if it is supplied with an empty dict as lang tag, and that the appropriate exception for this would always be ValueError. Refactoring may change how this exception occurs, but it should always occur when the language tag is not valid, and it will never be valid if it is not a string that is at least 1 character long and complies with the other requirements detailed in the rdf spec.

My objection to including the unit test is (as I mentioned) the (potential) large increase in unit tests and argument verification code if we were to apply this same logic and write unit tests passing in random objects to functions that have defined the types they support in their type annotations.

The right way to test this would be with parameterized tests, I will submit another PR against your branch with #1497 cherry picked in and with parameterized tests added to show you what I mean.

Also I think covering some important edge cases would be best, to ensure that we don't get regressions and especially that the same regression this is fixing does not prop up again.

Also, having this test means that if we refactor this function, we can't assume that lang is always an Optional[str] and we have to do regular runtime type checking again using something like isinstance(lang, Optional[str]) which isn't the prettiest.

I don't think there is any reason to do runtime type checks, merely by the consequence of checking _is_valid_langtag will already exclude all values that are not strings. Maybe the exact error message could use some tuning there, but already throwing a ValueError (which will be the case) will be a massive improvement over the current behaviour.

@aucampia
Copy link
Member

aucampia commented Dec 12, 2021

@veyndan I made another PR against your branch: https://github.com/veyndan/rdflib/pull/2

This cherry picks the changes from #1497 into this branch and then adds additional tests using pytest parametrize. In general it is nice to have PRs as small as possible but #1497 and this PR is very linked and I think both are fairly safe and uncontroversial.

PS: Thanks for your contribution and patience.

@veyndan
Copy link
Contributor Author

veyndan commented Dec 16, 2021

I would say yes we should, because semantically there will never be a case where a Literal should construct if it is supplied with an empty dict as lang tag, and that the appropriate exception for this would always be ValueError. Refactoring may change how this exception occurs, but it should always occur when the language tag is not valid, and it will never be valid if it is not a string that is at least 1 character long and complies with the other requirements detailed in the rdf spec.

I understand that lang being anything other than a string is invalid. My point is that we have "restricted" the type of lang to a string with the type annotations. Is it really necessary to do a runtime check when we already have the type annotations. This is a general question — it's not specific to lang.

The right way to test this would be with parameterized tests, I will submit another PR against your branch with #1497 cherry picked in and with parameterized tests added to show you what I mean.

Parameterized tests don't really change the number of tests that are run though, nor do they eliminate the extra code required to write these tests. Granted the tests would be shorter, but they're still something that need to be maintained, and the presence of them in any format should be considered.

Just to make sure I fully understand your position, here's what both our positions are (I believe — and please correct me if I'm wrong):

  • You think that even if we have a type annotation on some parameter, we still need to have a runtime check verifying the type of the parameter is correct.
  • I think that if we have a type annotation on some parameter, we don't need a runtime check verifying the type of parameter is correct.

@veyndan I made another PR against your branch: veyndan#2

This cherry picks the changes from #1497 into this branch and then adds additional tests using pytest parametrize. In general it is nice to have PRs as small as possible but #1497 and this PR is very linked and I think both are fairly safe and uncontroversial.

PS: Thanks for your contribution and patience.

Thanks for the PR, but I'll just hold off on reviewing it until we've resolved our discussion. Also, #1497 has been merged to master now.

@aucampia
Copy link
Member

I understand that lang being anything other than a string is invalid. My point is that we have "restricted" the type of lang to a string with the type annotations. Is it really necessary to do a runtime check when we already have the type annotations. This is a general question — it's not specific to lang.

The answer to this question is not really very straight forward. In brief, I think having something like this would be wrong:

if not isinstance(lang, str):
    raise TypeError(f"Type of lang {lang!r} is invalid")

And actually if you did this, mypy would error out, because warn_unreachable is enabled for rdflib, and the above would be unreachable as far as mypy is concerned, and within this is probably the best answer I can give you to this question, if mypy determines that the type checking is unreachable, then it probably should not be there.

However in this case (state of https://github.com/veyndan/rdflib/pull/2) mypy will not (and did not) determine that anything is unreachable, because nothing is unreachable.

Now as to the tests, there is a well defined value space for the language tag: it must be a string that conforms to the RDF spec for lang tag, and this value space is something that has to be checked outside of the type system as it is not expressible in the type system, and if this check does not pass an error should be raised, and we know this check should never pass for any value in this array, because they do not lie within the value space: [ "", {}, [], 1, False, True ] - so no check that adequately verifies that a value lies within the valid value space of language tag should succeed for these, and the way in which it should fail should follow the guidelines python sets for errors, which is that ValueError should occur for invalid values, and TypeError for invalid types.

So to me, it is reasonable to test that a value of {} for lang raises TypeError - because I know that there must be runtime validation of the value space, and I know that this should follow python's guidance on error checks, which is that Values outside of the value range but with right type should raise ValueError, and values outside of the value space with wrong type should raise TypeError. And again, I emphasize, what the tests I added in https://github.com/veyndan/rdflib/pull/2 check for is that the value space validation for lang is sufficiently strict.

@veyndan
Copy link
Contributor Author

veyndan commented Dec 17, 2021

The answer to this question is not really very straight forward. In brief, I think having something like this would be wrong:

if not isinstance(lang, str):
    raise TypeError(f"Type of lang {lang!r} is invalid")

And actually if you did this, mypy would error out, because warn_unreachable is enabled for rdflib, and the above would be unreachable as far as mypy is concerned, and within this is probably the best answer I can give you to this question, if mypy determines that the type checking is unreachable, then it probably should not be there.

I don't believe that's true based on the documentation where it says the warn_unreachable warning will be silenced when "the unreachable statement is a raise statement".

spec for lang tag, and this value space is something that has to be checked outside of the type system as it is not expressible in the type system, and if this check does not pass an error should be raised, and we know this check should never pass for any value in this array, because they do not lie within the value space: [ "", {}, [], 1, False, True ] - so no check that adequately verifies that a value lies within the valid value space of language tag should succeed for these, and the way in which it should fail should follow the guidelines python sets for errors, which is that ValueError should occur for invalid values, and TypeError for invalid types.

So to me, it is reasonable to test that a value of {} for lang raises TypeError - because I know that there must be runtime validation of the value space, and I know that this should follow python's guidance on error checks, which is that Values outside of the value range but with right type should raise ValueError, and values outside of the value space with wrong type should raise TypeError. And again, I emphasize, what the tests I added in veyndan#2 check for is that the value space validation for lang is sufficiently strict.

I definitely agree that we should raise a ValueError when the value is of the right type but the value is wrong — no qualms there.

With the raising of TypeError when the type is wrong, I think if there's some way that mypy could disallow unnecessary checks (from the sounds of it, what you were proposing that warn_unreachable should do) I think that'd be awesome, as I'd agree that it's good to just rely on the tools to do the type checking for us, so if that means passing in values that aren't of the type Optional[str], so be it. Personally, I just haven't seen that capability yet (though I'm not well versed in mypy which is probably most of the issue here).

Are you aware of such a feature of mypy that would cause the following to cause an error in mypy?

if not isinstance(lang, str):
raise TypeError(f"Type of lang {lang!r} is invalid")

@aucampia
Copy link
Member

I think the best way to resolve this is to start from where we agree:

I definitely agree that we should raise a ValueError when the value is of the right type but the value is wrong — no qualms there.

If x = Literal("foo",lang="999") should raise a ValueError, then what should happen when someone runs x = Literal("foo",lang={}) ? Should it construct a Literal (i.e. should constructor complete with no exception and return an object which is essentially corrupted), and if so why? And if it should not, then what is the most reasonable expectation when running x = Literal("foo",lang={})?

To me, I do expect x = Literal("foo",lang={}) to raise an exception if Literal's constructor is going to do value space validation for lang, maybe expecting it to raise a TypeError is overly specific, maybe it would be okay if it raised either a TypeError or a ValueError, but it is unclear what other type of exception would be more appropriate, definitely RuntimeError would be wrong, Exception would be overly broad, and similarly all other exceptions are significantly less appropriate than TypeError or ValueError.

I think there is maybe some argument for providing a way to disable all value space validation for cases where someone wants best effort processing of broken RDF, but given value space validation is done, and it is always done, I think it is very reasonable to always expect an error when the provided value falls outside of the value space, and {} does fall outside of the value space, and thus I do expect it to generate an error, and the most reasonable errors for it to generate are TypeError or ValueError, with TypeError being more reasonable than ValueError - but definitely, given value space validation is expected, it would be wrong for x = Literal("foo",lang={}) to succeed without raising any exception.

Consider this code:

from typing import Any, Optional, Type

import pytest


class EvenNumber:
    value: int

    def __init__(self, value: int) -> None:
        if value % 2 != 0:
            raise ValueError(f"{value} is not divisible by 2")


@pytest.mark.parametrize(
    "value, error",
    [
        (2, None),
        (1, ValueError),
        ({}, TypeError),
        ([], TypeError),
        ("not a number", TypeError),
    ],
)
def test_value_space(value: Any, error: Optional[Type[Exception]]) -> None:
    if error is not None:
        with pytest.raises(error):
            EvenNumber(value)
    else:
        EvenNumber(value)

This generates the following output:

$ poetry run pytest tests/test_valspace.py 
============================================================================ test session starts ============================================================================
platform linux -- Python 3.9.9, pytest-6.2.5, py-1.10.0, pluggy-0.13.1
rootdir: /home/iwana/sw/d/gitlab.com/aucampia/pvt/scratchpad/tech/py3, configfile: pyproject.toml
plugins: subtests-0.5.0, cov-2.12.1
collected 5 items                                                                                                                                                           

tests/test_valspace.py::test_value_space[2-None] PASSED                                                                                                               [ 20%]
tests/test_valspace.py::test_value_space[1-ValueError] PASSED                                                                                                         [ 40%]
tests/test_valspace.py::test_value_space[value2-TypeError] PASSED                                                                                                     [ 60%]
tests/test_valspace.py::test_value_space[value3-TypeError] PASSED                                                                                                     [ 80%]
tests/test_valspace.py::test_value_space[not a number-TypeError] PASSED                                                                                               [100%]

============================================================================= 5 passed in 0.02s =============================================================================

I did no explicit type checks here, but I still get TypeError here, and what is more, if I expect that a class constructor is going to do runtime value space validation, I do at the very least expect an exception when I give it a value outside of that value space, regardless of the type, to me what would be completely unexpected and wrong is if a constructor fails for something which is of the right type but outside of the value space, but succeeds for the wrong type.


And actually if you did this, mypy would error out, because warn_unreachable is enabled for rdflib, and the above would be unreachable as far as mypy is concerned, and within this is probably the best answer I can give you to this question, if mypy determines that the type checking is unreachable, then it probably should not be there.

I don't believe that's true based on the documentation where it says the warn_unreachable warning will be silenced when "the unreachable statement is a raise statement".

Today I learn, I guess this behaviour makes some sense, regardless though, I still do not think that the following is needed:

if not isinstance(lang, str):
    raise TypeError(f"Type of lang {lang!r} is invalid")

And again I mainly don't think it is needed because a type error will already be raised if lang is not a string with your code as is.

With the raising of TypeError when the type is wrong, I think if there's some way that mypy could disallow unnecessary checks (from the sounds of it, what you were proposing that warn_unreachable should do) I think that'd be awesome

I think maybe pyright can do it, but I am not sure that really is not that big a problem, anyway not at the moment. And I think it is also good to consider that a lot of rdflib users don't use mypy, and this may remain the case for a long time, because rdflib is used a lot in academic and scientific settings.

@veyndan
Copy link
Contributor Author

veyndan commented Dec 17, 2021

Thanks for the write-up! I think you raised some really good points, but I think I'll still need to mull it over a bit more. In order to not hold up these PRs any more than they already have, I'll add the proposed tests (thank you for all the PRs you've made against my fork btw!)

veyndan and others added 2 commits December 18, 2021 12:16
_value needs an explicit type and also added types for:

- `lexical_or_value`
- `datatype`
- `normalize`
@aucampia
Copy link
Member

aucampia commented Dec 18, 2021

I think this maybe need a rebasing from master to get cleaner diff, not sure why it shows tests being added in diff (master...veyndan:literal-lang-TypeAnnotation) from master if they are already there:

class TestNewPT:
# NOTE: TestNewPT is written for pytest so that pytest features like
# parametrize can be used.
# New tests should be added here instead of in TestNew.
@pytest.mark.parametrize(
"lang, exception_type",
[
({}, TypeError),
([], TypeError),
(1, TypeError),
(b"en", TypeError),
("999", ValueError),
("-", ValueError),
],
)
def test_cant_pass_invalid_lang(
self,
lang: Any,
exception_type: Type[Exception],
):
"""
Construction of Literal fails if the language tag is invalid.
"""
with pytest.raises(exception_type):
Literal("foo", lang=lang)

@veyndan
Copy link
Contributor Author

veyndan commented Dec 18, 2021

Rebased onto master

Copy link
Member

@aucampia aucampia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks

@nicholascar nicholascar self-requested a review December 19, 2021 22:18
@nicholascar nicholascar merged commit f43ad3f into RDFLib:master Dec 19, 2021
@veyndan veyndan deleted the literal-lang-TypeAnnotation branch December 19, 2021 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants