Skip to content

Commit 94b0b52

Browse files
committed
Add support for parsing f-string as per PEP 701 (#7041)
This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. #7263 (comment) fixes: #7043 fixes: #6835
1 parent 3839819 commit 94b0b52

31 files changed

+24100
-16245
lines changed

crates/ruff_benchmark/benches/formatter.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ fn benchmark_formatter(criterion: &mut Criterion) {
6565
let comment_ranges = comment_ranges.finish();
6666

6767
// Parse the AST.
68-
let module = parse_tokens(tokens, Mode::Module, "<filename>")
68+
let module = parse_tokens(tokens, source, Mode::Module, "<filename>")
6969
.expect("Input to be a valid python program");
7070

7171
b.iter(|| {

crates/ruff_linter/src/linter.rs

+1
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,7 @@ pub fn check_path(
143143
if use_ast || use_imports || use_doc_lines {
144144
match ruff_python_parser::parse_program_tokens(
145145
tokens,
146+
source_kind.source_code(),
146147
&path.to_string_lossy(),
147148
source_type.is_ipynb(),
148149
) {

crates/ruff_python_ast/src/nodes.rs

+8
Original file line numberDiff line numberDiff line change
@@ -2600,6 +2600,14 @@ impl Constant {
26002600
_ => false,
26012601
}
26022602
}
2603+
2604+
/// Returns `true` if the constant is a string constant that is a unicode string (i.e., `u"..."`).
2605+
pub fn is_unicode_string(&self) -> bool {
2606+
match self {
2607+
Constant::Str(value) => value.unicode,
2608+
_ => false,
2609+
}
2610+
}
26032611
}
26042612

26052613
#[derive(Clone, Debug, PartialEq, Eq)]

crates/ruff_python_ast/tests/preorder.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ fn function_type_parameters() {
130130

131131
fn trace_preorder_visitation(source: &str) -> String {
132132
let tokens = lex(source, Mode::Module);
133-
let parsed = parse_tokens(tokens, Mode::Module, "test.py").unwrap();
133+
let parsed = parse_tokens(tokens, source, Mode::Module, "test.py").unwrap();
134134

135135
let mut visitor = RecordVisitor::default();
136136
visitor.visit_mod(&parsed);

crates/ruff_python_ast/tests/visitor.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ fn function_type_parameters() {
131131

132132
fn trace_visitation(source: &str) -> String {
133133
let tokens = lex(source, Mode::Module);
134-
let parsed = parse_tokens(tokens, Mode::Module, "test.py").unwrap();
134+
let parsed = parse_tokens(tokens, source, Mode::Module, "test.py").unwrap();
135135

136136
let mut visitor = RecordVisitor::default();
137137
walk_module(&mut visitor, &parsed);

crates/ruff_python_formatter/src/cli.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ pub fn format_and_debug_print(source: &str, cli: &Cli, source_type: &Path) -> Re
4444

4545
// Parse the AST.
4646
let module =
47-
parse_ok_tokens(tokens, Mode::Module, "<filename>").context("Syntax error in input")?;
47+
parse_ok_tokens(tokens, source, Mode::Module, "<filename>").context("Syntax error in input")?;
4848

4949
let options = PyFormatOptions::from_extension(source_type);
5050

crates/ruff_python_formatter/src/comments/mod.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -567,7 +567,7 @@ mod tests {
567567
let source_code = SourceCode::new(source);
568568
let (tokens, comment_ranges) =
569569
tokens_and_ranges(source).expect("Expect source to be valid Python");
570-
let parsed = parse_ok_tokens(tokens, Mode::Module, "test.py")
570+
let parsed = parse_ok_tokens(tokens, source, Mode::Module, "test.py")
571571
.expect("Expect source to be valid Python");
572572

573573
CommentsTestCase {

crates/ruff_python_formatter/src/lib.rs

+2-2
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ pub fn format_module_source(
127127
options: PyFormatOptions,
128128
) -> Result<Printed, FormatModuleError> {
129129
let (tokens, comment_ranges) = tokens_and_ranges(source)?;
130-
let module = parse_ok_tokens(tokens, Mode::Module, "<filename>")?;
130+
let module = parse_ok_tokens(tokens, source, Mode::Module, "<filename>")?;
131131
let formatted = format_module_ast(&module, &comment_ranges, source, options)?;
132132
Ok(formatted.print()?)
133133
}
@@ -213,7 +213,7 @@ def main() -> None:
213213

214214
// Parse the AST.
215215
let source_path = "code_inline.py";
216-
let module = parse_ok_tokens(tokens, Mode::Module, source_path).unwrap();
216+
let module = parse_ok_tokens(tokens, source, Mode::Module, source_path).unwrap();
217217
let options = PyFormatOptions::from_extension(Path::new(source_path));
218218
let formatted = format_module_ast(&module, &comment_ranges, source, options).unwrap();
219219

crates/ruff_python_parser/src/lib.rs

+2-1
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,7 @@ pub fn tokenize(contents: &str, mode: Mode) -> Vec<LexResult> {
146146
/// Parse a full Python program from its tokens.
147147
pub fn parse_program_tokens(
148148
lxr: Vec<LexResult>,
149+
source: &str,
149150
source_path: &str,
150151
is_jupyter_notebook: bool,
151152
) -> anyhow::Result<Suite, ParseError> {
@@ -154,7 +155,7 @@ pub fn parse_program_tokens(
154155
} else {
155156
Mode::Module
156157
};
157-
match parse_tokens(lxr, mode, source_path)? {
158+
match parse_tokens(lxr, source, mode, source_path)? {
158159
Mod::Module(m) => Ok(m.body),
159160
Mod::Expression(_) => unreachable!("Mode::Module doesn't return other variant"),
160161
}

crates/ruff_python_parser/src/parser.rs

+56-6
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ use ruff_python_ast::{Mod, ModModule, Suite};
5050
/// ```
5151
pub fn parse_program(source: &str, source_path: &str) -> Result<ModModule, ParseError> {
5252
let lexer = lex(source, Mode::Module);
53-
match parse_tokens(lexer, Mode::Module, source_path)? {
53+
match parse_tokens(lexer, source, Mode::Module, source_path)? {
5454
Mod::Module(m) => Ok(m),
5555
Mod::Expression(_) => unreachable!("Mode::Module doesn't return other variant"),
5656
}
@@ -78,7 +78,7 @@ pub fn parse_suite(source: &str, source_path: &str) -> Result<Suite, ParseError>
7878
/// ```
7979
pub fn parse_expression(source: &str, source_path: &str) -> Result<ast::Expr, ParseError> {
8080
let lexer = lex(source, Mode::Expression);
81-
match parse_tokens(lexer, Mode::Expression, source_path)? {
81+
match parse_tokens(lexer, source, Mode::Expression, source_path)? {
8282
Mod::Expression(expression) => Ok(*expression.body),
8383
Mod::Module(_m) => unreachable!("Mode::Expression doesn't return other variant"),
8484
}
@@ -107,7 +107,7 @@ pub fn parse_expression_starts_at(
107107
offset: TextSize,
108108
) -> Result<ast::Expr, ParseError> {
109109
let lexer = lex_starts_at(source, Mode::Module, offset);
110-
match parse_tokens(lexer, Mode::Expression, source_path)? {
110+
match parse_tokens(lexer, source, Mode::Expression, source_path)? {
111111
Mod::Expression(expression) => Ok(*expression.body),
112112
Mod::Module(_m) => unreachable!("Mode::Expression doesn't return other variant"),
113113
}
@@ -193,7 +193,7 @@ pub fn parse_starts_at(
193193
offset: TextSize,
194194
) -> Result<Mod, ParseError> {
195195
let lxr = lexer::lex_starts_at(source, mode, offset);
196-
parse_tokens(lxr, mode, source_path)
196+
parse_tokens(lxr, source, mode, source_path)
197197
}
198198

199199
/// Parse an iterator of [`LexResult`]s using the specified [`Mode`].
@@ -208,18 +208,21 @@ pub fn parse_starts_at(
208208
/// ```
209209
/// use ruff_python_parser::{lexer::lex, Mode, parse_tokens};
210210
///
211-
/// let expr = parse_tokens(lex("1 + 2", Mode::Expression), Mode::Expression, "<embedded>");
211+
/// let source = "1 + 2";
212+
/// let expr = parse_tokens(lex(source, Mode::Expression), source, Mode::Expression, "<embedded>");
212213
/// assert!(expr.is_ok());
213214
/// ```
214215
pub fn parse_tokens(
215216
lxr: impl IntoIterator<Item = LexResult>,
217+
source: &str,
216218
mode: Mode,
217219
source_path: &str,
218220
) -> Result<Mod, ParseError> {
219221
let lxr = lxr.into_iter();
220222

221223
parse_filtered_tokens(
222224
lxr.filter_ok(|(tok, _)| !matches!(tok, Tok::Comment { .. } | Tok::NonLogicalNewline)),
225+
source,
223226
mode,
224227
source_path,
225228
)
@@ -228,6 +231,7 @@ pub fn parse_tokens(
228231
/// Parse tokens into an AST like [`parse_tokens`], but we already know all tokens are valid.
229232
pub fn parse_ok_tokens(
230233
lxr: impl IntoIterator<Item = Spanned>,
234+
source: &str,
231235
mode: Mode,
232236
source_path: &str,
233237
) -> Result<Mod, ParseError> {
@@ -245,13 +249,15 @@ pub fn parse_ok_tokens(
245249

246250
fn parse_filtered_tokens(
247251
lxr: impl IntoIterator<Item = LexResult>,
252+
source: &str,
248253
mode: Mode,
249254
source_path: &str,
250255
) -> Result<Mod, ParseError> {
251256
let marker_token = (Tok::start_marker(mode), TextRange::default());
252257
let lexer = iter::once(Ok(marker_token)).chain(lxr);
253258
python::TopParser::new()
254259
.parse(
260+
source,
255261
mode,
256262
lexer.map_ok(|(t, range)| (range.start(), t, range.end())),
257263
)
@@ -1253,11 +1259,55 @@ a = 1
12531259
"#
12541260
.trim();
12551261
let lxr = lexer::lex_starts_at(source, Mode::Ipython, TextSize::default());
1256-
let parse_err = parse_tokens(lxr, Mode::Module, "<test>").unwrap_err();
1262+
let parse_err = parse_tokens(lxr, source, Mode::Module, "<test>").unwrap_err();
12571263
assert_eq!(
12581264
parse_err.to_string(),
12591265
"IPython escape commands are only allowed in `Mode::Ipython` at byte offset 6"
12601266
.to_string()
12611267
);
12621268
}
1269+
1270+
#[test]
1271+
fn test_fstrings() {
1272+
let parse_ast = parse_suite(
1273+
r#"
1274+
f"{" f"}"
1275+
f"{foo!s}"
1276+
f"{3,}"
1277+
f"{3!=4:}"
1278+
f'{3:{"}"}>10}'
1279+
f'{3:{"{"}>10}'
1280+
f"{ foo = }"
1281+
f"{ foo = :.3f }"
1282+
f"{ foo = !s }"
1283+
f"{ 1, 2 = }"
1284+
f'{f"{3.1415=:.1f}":*^20}'
1285+
1286+
{"foo " f"bar {x + y} " "baz": 10}
1287+
match foo:
1288+
case "foo " f"bar {x + y} " "baz":
1289+
pass
1290+
"#
1291+
.trim(),
1292+
"<test>",
1293+
)
1294+
.unwrap();
1295+
insta::assert_debug_snapshot!(parse_ast);
1296+
}
1297+
1298+
#[test]
1299+
fn test_fstrings_with_unicode() {
1300+
let parse_ast = parse_suite(
1301+
r#"
1302+
u"foo" f"{bar}" "baz" " some"
1303+
"foo" f"{bar}" u"baz" " some"
1304+
"foo" f"{bar}" "baz" u" some"
1305+
u"foo" f"bar {baz} really" u"bar" "no"
1306+
"#
1307+
.trim(),
1308+
"<test>",
1309+
)
1310+
.unwrap();
1311+
insta::assert_debug_snapshot!(parse_ast);
1312+
}
12631313
}

0 commit comments

Comments
 (0)