Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] The Pants Rust inference parser exhibits some dependency misinterpretation issues when python-infer.string_imports is enabled. #20324

Open
4 tasks done
liudonggalaxy opened this issue Dec 21, 2023 · 8 comments
Assignees
Labels
backend: Python Python backend-related issues bug
Milestone

Comments

@liudonggalaxy
Copy link

liudonggalaxy commented Dec 21, 2023

Maintainer summary

  • find tree sitter incantantion required to handle concatenated strings with [python-infer].string_imports. (issue 1 in the orginal report)
  • strings that is not a valid python identifier: a.b. (issue 2 in original report)
  • Other strings that are not valid python identifiers retrying... (issue 3 in the orginal report)
  • Other strings that are not valid python identifiers (., .., etc) (issue reported in comments)

Original report

Describe the bug
The Pants Rust inference parser exhibits some dependency misinterpretation issues when python-infer.string_imports is enabled. This is evident in the following examples:

  1. The Rust parser in Pants could not infer dependencies across multiple lines correctly. In the following example, the Rust parser identifies a.b as a dependency, whereas the classic parser correctly discerns a.b.c.d:
@mock.patch(
    'a.b.' 
    'c.d', 
    new=mock.PropertyMock(return_value=0.7))
  1. For the string BASE_PATH = 'a.b.', the Rust parser mistakenly treats it as a dependency, unlike the classic parser.
  2. In print('retrying...'), the Rust parser incorrectly interprets it as a dependency, while the classic parser does not.

To compare the Rust and classic parsers, the following commands were executed on both MacOS and Linux:

mkdir -p pants_inference
pants --python-infer-no-use-rust-parser peek :: > pants_inference/classic_parser.json
pants --python-infer-use-rust-parser peek :: > pants_inference/rust_parser.json
diff pants_inference/classic_parser.json pants_inference/rust_parser.json

Here is the pants configuration in pants.toml.

[python-infer]
unowned_dependency_behavior = "error"
use_rust_parser = true
string_imports = true  # Infer a target's dependencies based on strings that look like dynamic dependencies

Pants version
2.17.0

OS
Both MacOS and Linux.

Additional info

@huonw huonw added the backend: Python Python backend-related issues label Dec 21, 2023
@thejcannon
Copy link
Member

1 seems like a bug. We'll likely have to find the right tree-sitter incantation for that.

2 and 3 are more of a "quirk" than a bug. However, since they aren't valid dotted module names (as far as I know) I think it's safe to also "fix" this quirk.

@cognifloyd
Copy link
Member

cognifloyd commented Jan 31, 2024

Similar to point 2: the rust parser also causes a warning for every file that has either of these strings, neither of which is a valid python module:

  • .
  • ..

I get that using pants 2.18.2 with this in pants.toml (from the StackStorm project):

[python-infer]
string_imports = true
string_imports_min_dots = 1  # tools/config_gen.py has import strings with only one dot.
unowned_dependency_behavior = "ignore"
ambiguity_resolution = "by_source_root"
use_rust_parser = true

The old parser (in 2.18.x) explicitly required alphanumeric characters before the dot:

self._string_import_regex = re.compile(
r"^([a-z_][a-z_\d]*\.){"
+ os.environ["STRING_IMPORTS_MIN_DOTS"]
+ r",}[a-zA-Z_]\w*$",
re.UNICODE,
)

The new rust parser, however, seems to only exclude strings with a whitespace or a \ char (assuming I'm reading this rust code correctly):

if !text.contains(|c: char| c.is_ascii_whitespace() || c == '\\') {
self.string_candidates
.insert(text.to_string(), (range.start_point.row + 1) as u64);
}

Can we make the rust parser be more selective on which strings might be dependencies? In particular, a string that only consists of . chars, or maybe any string that ends in . should be excluded (as that would also catch points 2 and 3: 'a.b.' and 'retrying...').

Maybe with something like this:

        if !text.ends_with(".") && !text.contains(|c: char| c.is_ascii_whitespace() || c == '\\') {
            self.string_candidates
                .insert(text.to_string(), (range.start_point.row + 1) as u64);
        }

Oh. That wouldn't work because the detected strings are used for both imports and assets.
So, the additional logic to exclude irrelevant strings would need to go in the python side of things:

if python_infer_subsystem.string_imports or python_infer_subsystem.assets:
for string, line in native_result.string_candidates.items():
slash_count = string.count("/")
if (
python_infer_subsystem.string_imports
and not slash_count
and string.count(".") >= python_infer_subsystem.string_imports_min_dots
):
imports.setdefault(string, (line, True))
if (
python_infer_subsystem.assets
and slash_count >= python_infer_subsystem.assets_min_slashes
):
assets.add(string)

So, maybe:

    if python_infer_subsystem.string_imports or python_infer_subsystem.assets:
        for string, line in native_result.string_candidates.items():
            slash_count = string.count("/")
            if (
                python_infer_subsystem.string_imports
                and not slash_count
                and not string.endswith(".")
                and string.count(".") >= python_infer_subsystem.string_imports_min_dots
            ):
                imports.setdefault(string, (line, True))
            if (
                python_infer_subsystem.assets
                and slash_count >= python_infer_subsystem.assets_min_slashes
            ):
                assets.add(string)

@cognifloyd

This comment was marked as resolved.

@cognifloyd
Copy link
Member

@liudonggalaxy Do you get any warnings or errors for your last two points:

  1. For the string BASE_PATH = 'a.b.', the Rust parser mistakenly treats it as a dependency, unlike the classic parser.
  2. In print('retrying...'), the Rust parser incorrectly interprets it as a dependency, while the classic parser does not.

@liudonggalaxy
Copy link
Author

liudonggalaxy commented Jan 31, 2024

mkdir -p pants_inference
pants --python-infer-no-use-rust-parser peek :: > pants_inference/classic_parser.json
pants --python-infer-use-rust-parser peek :: > pants_inference/rust_parser.json
diff pants_inference/classic_parser.json pants_inference/rust_parser.json

Hello,
I didn't observe any warnings or errors, but the following CircleCI test failed in our environment:

mkdir -p pants_inference
pants --python-infer-no-use-rust-parser peek :: > pants_inference/classic_parser.json
pants --python-infer-use-rust-parser peek :: > pants_inference/rust_parser.json
diff pants_inference/classic_parser.json pants_inference/rust_parser.json

To resolve this issue and ensure the CircleCI job passes, we modified the following source code.

BASE_PATH = 'a.b.'
print('retrying...')

->

BASE_PATH ='.'.join(['a', 'b']) + '.'
print('retrying' + '...')

cognifloyd added a commit that referenced this issue Feb 1, 2024
Ignores strings that are not valid python modules such as strings that
end in "." (such as "." and "..").

Related: #20324
WorkerPants pushed a commit that referenced this issue Feb 2, 2024
Ignores strings that are not valid python modules such as strings that
end in "." (such as "." and "..").

Related: #20324
cognifloyd added a commit that referenced this issue Feb 2, 2024
Ignores any strings that have the ignore pragma comment.

This used to work with the python-based parser, but apparently there
weren't tests so it was not carried into the rust-based parser.

Related: #20324, #20472
WorkerPants pushed a commit that referenced this issue Feb 2, 2024
Ignores any strings that have the ignore pragma comment.

This used to work with the python-based parser, but apparently there
weren't tests so it was not carried into the rust-based parser.

Related: #20324, #20472
cognifloyd added a commit that referenced this issue Feb 6, 2024
…20483)

Ignores strings that are not valid python modules such as strings that
end in "." (such as "." and "..").

Related: #20324 (fixes only points 2 and 3 where the strings end in
".").

Co-authored-by: Jacob Floyd <[email protected]>
cognifloyd added a commit that referenced this issue Feb 6, 2024
#20484)

Ignores any strings that have the ignore pragma comment.

This used to work with the python-based parser, but apparently there
weren't tests so it was not carried into the rust-based parser.

Related: #20324, #20472

Co-authored-by: Jacob Floyd <[email protected]>
@cognifloyd
Copy link
Member

With #20472, only dotted strings that are made up of valid python identifiers will be be considered as possible imports. So, points 2 and 3 should be resolved in pants 2.20 or the 2.19.1 (once released).

To fix the first issue, someone needs to figure out how to deal with concatenated strings with the tree sitter in the dependency parser.

@huonw huonw added this to the 2.18.x milestone Feb 19, 2024
cognifloyd added a commit that referenced this issue Feb 24, 2024
Ignores strings that are not valid python modules such as strings that
end in "." (such as "." and "..").

Related: #20324
WorkerPants pushed a commit that referenced this issue Feb 24, 2024
Ignores any strings that have the ignore pragma comment.

This used to work with the python-based parser, but apparently there
weren't tests so it was not carried into the rust-based parser.

Related: #20324, #20472
cognifloyd added a commit that referenced this issue Feb 24, 2024
#20601)

Ignores any strings that have the ignore pragma comment.

This used to work with the python-based parser, but apparently there
weren't tests so it was not carried into the rust-based parser.

Related: #20324, #20472

Co-authored-by: Jacob Floyd <[email protected]>
cognifloyd added a commit that referenced this issue Feb 24, 2024
…20599)

Ignores strings that are not valid python modules such as strings that
end in "." (such as "." and "..").

Related: #20324 (fixes only points 2 and 3 where the strings end in
".").
@huonw huonw modified the milestones: 2.18.x, 2.19.x Mar 4, 2024
@huonw huonw modified the milestones: 2.19.x, 2.20.x Jun 4, 2024
@huonw huonw modified the milestones: 2.20.x, 2.21.x Aug 1, 2024
@huonw huonw modified the milestones: 2.21.x, 2.22.x Nov 2, 2024
@huonw huonw modified the milestones: 2.22.x, 2.23.x Dec 17, 2024
@huonw huonw modified the milestones: 2.23.x, 2.24.x Feb 9, 2025
@tdyas
Copy link
Contributor

tdyas commented Mar 6, 2025

Tree-sitter uses KindId::CONCATENATED_STRING type to represent concatenated strings which is a separate node type from KindId::STRING. KindId::CONCATENATED_STRING has children nodes which are KindId::STRING nodes.

The tree-sitter playground is useful for figuring these out.

For example, __import__("foo" ".bar") is parsed as:

[module](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 0] - [1, 0]
  [expression_statement](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 0] - [0, 24]
    [call](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 0] - [0, 24]
      function: [identifier](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 0] - [0, 10]
      arguments: [argument_list](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 10] - [0, 24]
        [concatenated_string](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 11] - [0, 23]
          [string](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 11] - [0, 16]
            [string_start](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 11] - [0, 12]
            [string_content](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 12] - [0, 15]
            [string_end](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 15] - [0, 16]
          [string](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 17] - [0, 23]
            [string_start](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 17] - [0, 18]
            [string_content](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 18] - [0, 22]
            [string_end](https://tree-sitter.github.io/tree-sitter/7-playground.html#) [0, 22] - [0, 23]

@tdyas
Copy link
Contributor

tdyas commented Mar 6, 2025

Rough draft WIP fix at #22050 for handling concatenated string literals. Needs tests (beyond the minimal example added already) and some evaluation whether a refactor of some of the parser is worthwhile as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: Python Python backend-related issues bug
Projects
None yet
Development

No branches or pull requests

5 participants