feat(yaml): overhauling YAML lexer #6481

vohoanglong0107 · 2025-06-22T15:16:12Z

Summary

The current indentation handling model employed by the YAML lexer is not sophisticated enough to handle YAML compact forms, which allow a new collection to start without explicit indentation.

For more context, normally YAML requires a new collection, whether a sequence or a mapping, to start with an indent:

a:
  b:
    - x
  d: e

In contrast, compact forms allow a collection to start on the same line as its parent collection entry:

- a: b
  c: d

This breaks the current assumption held by the YAML lexer, which expects all nested collection to begin and end with indent and dedent tokens.

For example, given this nested compact sequence, the lexer currently only detects a single indent token preceding the second line - token, which results in the following token stream:

- - - - -
    -
-

# Whitespace tokens emitted for brevity
DASH:1
DASH:1
DASH:1
DASH:1
DASH:1
INDENT:5
DASH:1
DEDENT:1
DASH:1

Given the above stream, the YAML parser doesn't have enough information to correctly identify that the second level - token belongs to the third degree nested sequence, and that the final - belongs to the top level collection, purely based on indentation.

Similarly, in dealing with compact mapping, one can't confidentally tell that the following nested mapping is wrongly indented given only one indent token (unless we count the number of characters):

- a: b
    c: d

Additionally, a side effect of exposing indentation information to the parser is that the lexer must also emit newline token for when newline happens without changing indent level. It's extremely annoying to work with in the parser, even though newlines only matters in some context.

To deal with the above describe problems, this PR introduces a new scope concept to the lexer. In short, any token appears further to the right of a scope's boundary belong to that scope. For example:

a: [1, 2
 ]

Afther the first : token, the lexer enters a block mapping scope, with its border at column 2. Any token appearing beyound column 2 belong to this scope.

Now, instead of emitting indent and dedent tokens, the YAML lexer now emits MAPPING_START, MAPPING_END, SEQUENCE_START, SEQUENCE_END, FLOW_START, FLOW_END specially to signal to the parser the when to start new collection. All disambiguation logic, like distinguishing between a flow collection ([1, 2, 3]) and the start of a new mapping (plain token followed by :: [1, 2, 3]:) is handled at the lexer.

Sorry that this takes a while to get out. It doesn't help that the YAML spec has so many edge cases.

After this, the next major blocker would be in dealing with YAML properties, as it has quite a challenging indentation rule.

Test Plan

This PR adds more tests for the YAML lexer. Most notable ones are

Correctly identifies mis-indented sequence as a single plain token. For example:
```
- a
 - b
 - c
 - d
```
Limits the impact of missing closing indicators for flow collections to only the closest scope. For example, in the below document, the lexer can identify that the [1, 2, 3] collection misses a closing bracket and correctly insert a FLOW_END token before emitting the next mapping key
```
a:
  b: [1, 2, 3
c:
```
Correctly handles sequence nested under mapping without indentation. For example, the following document is equivalent to {"a": ["b"]}
```
a:
- b
```
Correctly handles nested compact forms:
```
- a: b
  c: d
```

changeset-bot · 2025-06-22T15:16:17Z

⚠️ No Changeset found

Latest commit: a2cec48

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

codspeed-hq · 2025-06-22T15:28:40Z

CodSpeed Performance Report

Merging #6481 will not alter performance

_{Comparing vohoanglong0107:moving-state-to-yaml-lexer (a2cec48) with main (9590493)}

Summary

✅ 115 untouched benchmarks

dyc3

Some questions -- I'd like to understand your changes a little better

dyc3 · 2025-06-28T15:27:18Z

crates/biome_yaml_parser/src/token_source.rs


 pub(crate) struct YamlTokenSource<'source> {
-    lexer: BufferedLexer<YamlSyntaxKind, YamlLexer<'source>>,
+    lexer: YamlLexer<'source>,


Why did you remove the BufferedLexer here? That's what lets the parser peek ahead of the current token.

Right, the main issue is that the YamlLexer is quite incompatible with BufferedLexer.
BufferedLexer's lookahead function relies on checkpoints and rewinding, which involves advancing the lexer by n tokens and rewinding back to the latest checkpoint.
To disambiguate between different YAML constructs, the YamlLexer needs to emit a bunch of tokens at once, rather of just one like the other lexers. For example, to check whether the first inline collection [a, b, c] is a mapping key, it has to lex the entire collection, then checks for the presence of :.

- [a, b, c]: abc [x, y, z]: xyz

Once it's certain that the collection is a mapping key, it emits a MAPPING_START token before emitting the rest.
For this reason, it's not trivial to implement checkpoint and rewinding for this lexer. The fact that this lexer manages some internal states for the disambiguation logic only adds to the complexity

dyc3 · 2025-06-28T15:31:24Z

crates/biome_yaml_parser/src/token_source.rs

        source
    }

-    fn next_non_trivia_token(&mut self, context: YamlLexContext, first_token: bool) {
+    fn next_non_trivia_token(&mut self, first_token: bool) {


Also, why did you remove the lex context?

So previously the lexer was stateless, it relied on the parser's instruction in how to lex the next token, which is what we used the context for. Now, the lexer manages its own state and knows whether to lex a mapping token, a sequence token, and so on. The parser no longer need to pass that information down.

dyc3 · 2025-06-28T15:41:39Z

crates/biome_yaml_parser/src/lexer/mod.rs

-    /// If the source starts with a Unicode BOM, this is the number of bytes for that token.
-    unicode_bom_length: usize,
+    /// Where the lexer is in the source
+    current_coordinate: TextCoordinate,


Are you keeping track of the document's indentation size anywhere?

Yes we are, through this current_coordinate, which stores the current column, and BlockScope. As stated above, since indentation (new line followed by spaces) alone is no longer enough to correctly determined whether the lexer is at the start of a new collection, it's now tracking the column of each tokens, and compares it with the left most column of the current scope.

ematipico

Thank you @dyc3 for the insightful questions, and @vohoanglong0107 for the great effort you put into it, and the answers you gave. It's now pretty much clear.

I think the PR as it is doesn't require too many changes now, and since we have more tests now, they are a good harness.

Something that you might want to consider for a later stage is to make this "stateful lexer" more generic, so maybe future parsers could use it in the future

ematipico · 2025-06-28T16:31:01Z

crates/biome_yaml_parser/src/lexer/mod.rs

+            diagnostics: Vec::new(),
+            scopes: Default::default(),
+            current_coordinate: Default::default(),
+            tokens: LexToken::default().into(),


You still want to track and lex the BOM Unicode character.

You're right, I completely forgot about it. Since one YAML file can have multiple BOM, once per document, let me see how to proceed in another PR

ematipico · 2025-06-28T16:35:25Z

crates/biome_yaml_parser/src/lexer/mod.rs

+    fn rewind(&mut self, _: LexerCheckpoint<Self::Kind>) {
+        unimplemented!()


If you won't plan to implement the rewind, we might want to delete it

Oh, it's a necessary fn of Lexer trait, so I couldn't delete it

ematipico · 2025-06-29T13:26:08Z

crates/biome_yaml_parser/src/parser/block.rs

+    let m = p.start();
+    p.expect(FLOW_START);
+    parse_flow_yaml_node(p);
+    // debug_assert!(p.at(FLOW_END));


dyc3

LGTM!

github-actions bot added A-Parser Area: parser A-Tooling Area: internal tools labels Jun 22, 2025

vohoanglong0107 changed the title ~~feat(yaml): YAML lexer overhaul~~ feat(yaml): overhauling YAML lexer Jun 22, 2025

vohoanglong0107 force-pushed the moving-state-to-yaml-lexer branch from 045676c to 7cb344c Compare June 22, 2025 15:26

vohoanglong0107 force-pushed the moving-state-to-yaml-lexer branch from 7cb344c to 0af865f Compare June 28, 2025 13:49

vohoanglong0107 requested review from a team June 28, 2025 13:50

vohoanglong0107 force-pushed the moving-state-to-yaml-lexer branch from 0af865f to 010da6b Compare June 28, 2025 14:06

feat(yaml): lex multiline plain literal

6b2c15b

vohoanglong0107 force-pushed the moving-state-to-yaml-lexer branch from 010da6b to 6b2c15b Compare June 28, 2025 14:07

vohoanglong0107 marked this pull request as ready for review June 28, 2025 14:09

dyc3 self-requested a review June 28, 2025 15:14

dyc3 reviewed Jun 28, 2025

View reviewed changes

docs(yaml_lexer): explain lexer's disambiguation logic

f05df54

vohoanglong0107 requested a review from dyc3 June 29, 2025 13:26

ematipico approved these changes Jun 29, 2025

View reviewed changes

dyc3 approved these changes Jun 29, 2025

View reviewed changes

chore(yaml): remove commented code

a2cec48

vohoanglong0107 merged commit 09ba0ab into biomejs:main Jun 30, 2025
29 checks passed

vohoanglong0107 deleted the moving-state-to-yaml-lexer branch June 30, 2025 13:12

		fn rewind(&mut self, _: LexerCheckpoint<Self::Kind>) {
		unimplemented!()

Uh oh!

feat(yaml): overhauling YAML lexer #6481

feat(yaml): overhauling YAML lexer #6481

Uh oh!

Conversation

vohoanglong0107 commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

changeset-bot bot commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

codspeed-hq bot commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging #6481 will not alter performance

Summary

Uh oh!

dyc3 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ematipico left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dyc3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vohoanglong0107 commented Jun 22, 2025 •

edited

Loading

changeset-bot bot commented Jun 22, 2025 •

edited

Loading

codspeed-hq bot commented Jun 22, 2025 •

edited

Loading