+
Skip to content

feat(yaml): overhauling YAML lexer #6481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

vohoanglong0107
Copy link
Contributor

@vohoanglong0107 vohoanglong0107 commented Jun 22, 2025

Summary

The current indentation handling model employed by the YAML lexer is not sophisticated enough to handle YAML compact forms, which allow a new collection to start without explicit indentation.

For more context, normally YAML requires a new collection, whether a sequence or a mapping, to start with an indent:

a:
  b:
    - x
  d: e

In contrast, compact forms allow a collection to start on the same line as its parent collection entry:

- a: b
  c: d

This breaks the current assumption held by the YAML lexer, which expects all nested collection to begin and end with indent and dedent tokens.

For example, given this nested compact sequence, the lexer currently only detects a single indent token preceding the second line - token, which results in the following token stream:

- - - - -
    -
-
# Whitespace tokens emitted for brevity
DASH:1
DASH:1
DASH:1
DASH:1
DASH:1
INDENT:5
DASH:1
DEDENT:1
DASH:1

Given the above stream, the YAML parser doesn't have enough information to correctly identify that the second level - token belongs to the third degree nested sequence, and that the final - belongs to the top level collection, purely based on indentation.

Similarly, in dealing with compact mapping, one can't confidentally tell that the following nested mapping is wrongly indented given only one indent token (unless we count the number of characters):

- a: b
    c: d

Additionally, a side effect of exposing indentation information to the parser is that the lexer must also emit newline token for when newline happens without changing indent level. It's extremely annoying to work with in the parser, even though newlines only matters in some context.

To deal with the above describe problems, this PR introduces a new scope concept to the lexer. In short, any token appears further to the right of a scope's boundary belong to that scope. For example:

a: [1, 2
 ]

Afther the first : token, the lexer enters a block mapping scope, with its border at column 2. Any token appearing beyound column 2 belong to this scope.

Now, instead of emitting indent and dedent tokens, the YAML lexer now emits MAPPING_START, MAPPING_END, SEQUENCE_START, SEQUENCE_END, FLOW_START, FLOW_END specially to signal to the parser the when to start new collection. All disambiguation logic, like distinguishing between a flow collection ([1, 2, 3]) and the start of a new mapping (plain token followed by :: [1, 2, 3]:) is handled at the lexer.

Sorry that this takes a while to get out. It doesn't help that the YAML spec has so many edge cases.

After this, the next major blocker would be in dealing with YAML properties, as it has quite a challenging indentation rule.

Test Plan

This PR adds more tests for the YAML lexer. Most notable ones are

  • Correctly identifies mis-indented sequence as a single plain token. For example:
    - a
     - b
     - c
     - d
  • Limits the impact of missing closing indicators for flow collections to only the closest scope. For example, in the below document, the lexer can identify that the [1, 2, 3] collection misses a closing bracket and correctly insert a FLOW_END token before emitting the next mapping key
    a:
      b: [1, 2, 3
    c:
  • Correctly handles sequence nested under mapping without indentation. For example, the following document is equivalent to {"a": ["b"]}
    a:
    - b
  • Correctly handles nested compact forms:
    - a: b
      c: d

Copy link

changeset-bot bot commented Jun 22, 2025

⚠️ No Changeset found

Latest commit: a2cec48

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions github-actions bot added A-Parser Area: parser A-Tooling Area: internal tools labels Jun 22, 2025
@vohoanglong0107 vohoanglong0107 changed the title feat(yaml): YAML lexer overhaul feat(yaml): overhauling YAML lexer Jun 22, 2025
@vohoanglong0107 vohoanglong0107 force-pushed the moving-state-to-yaml-lexer branch from 045676c to 7cb344c Compare June 22, 2025 15:26
Copy link

codspeed-hq bot commented Jun 22, 2025

CodSpeed Performance Report

Merging #6481 will not alter performance

Comparing vohoanglong0107:moving-state-to-yaml-lexer (a2cec48) with main (9590493)

Summary

✅ 115 untouched benchmarks

@vohoanglong0107 vohoanglong0107 force-pushed the moving-state-to-yaml-lexer branch from 7cb344c to 0af865f Compare June 28, 2025 13:49
@vohoanglong0107 vohoanglong0107 requested review from a team June 28, 2025 13:50
@vohoanglong0107 vohoanglong0107 force-pushed the moving-state-to-yaml-lexer branch from 0af865f to 010da6b Compare June 28, 2025 14:06
@vohoanglong0107 vohoanglong0107 force-pushed the moving-state-to-yaml-lexer branch from 010da6b to 6b2c15b Compare June 28, 2025 14:07
@vohoanglong0107 vohoanglong0107 marked this pull request as ready for review June 28, 2025 14:09
@dyc3 dyc3 self-requested a review June 28, 2025 15:14
Copy link
Contributor

@dyc3 dyc3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions -- I'd like to understand your changes a little better


pub(crate) struct YamlTokenSource<'source> {
lexer: BufferedLexer<YamlSyntaxKind, YamlLexer<'source>>,
lexer: YamlLexer<'source>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the BufferedLexer here? That's what lets the parser peek ahead of the current token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the main issue is that the YamlLexer is quite incompatible with BufferedLexer.
BufferedLexer's lookahead function relies on checkpoints and rewinding, which involves advancing the lexer by n tokens and rewinding back to the latest checkpoint.
To disambiguate between different YAML constructs, the YamlLexer needs to emit a bunch of tokens at once, rather of just one like the other lexers. For example, to check whether the first inline collection [a, b, c] is a mapping key, it has to lex the entire collection, then checks for the presence of :.

- [a, b, c]: abc
  [x, y, z]: xyz

Once it's certain that the collection is a mapping key, it emits a MAPPING_START token before emitting the rest.
For this reason, it's not trivial to implement checkpoint and rewinding for this lexer. The fact that this lexer manages some internal states for the disambiguation logic only adds to the complexity

source
}

fn next_non_trivia_token(&mut self, context: YamlLexContext, first_token: bool) {
fn next_non_trivia_token(&mut self, first_token: bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why did you remove the lex context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So previously the lexer was stateless, it relied on the parser's instruction in how to lex the next token, which is what we used the context for. Now, the lexer manages its own state and knows whether to lex a mapping token, a sequence token, and so on. The parser no longer need to pass that information down.

/// If the source starts with a Unicode BOM, this is the number of bytes for that token.
unicode_bom_length: usize,
/// Where the lexer is in the source
current_coordinate: TextCoordinate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you keeping track of the document's indentation size anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we are, through this current_coordinate, which stores the current column, and BlockScope. As stated above, since indentation (new line followed by spaces) alone is no longer enough to correctly determined whether the lexer is at the start of a new collection, it's now tracking the column of each tokens, and compares it with the left most column of the current scope.

@vohoanglong0107 vohoanglong0107 requested a review from dyc3 June 29, 2025 13:26
Copy link
Member

@ematipico ematipico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dyc3 for the insightful questions, and @vohoanglong0107 for the great effort you put into it, and the answers you gave. It's now pretty much clear.

I think the PR as it is doesn't require too many changes now, and since we have more tests now, they are a good harness.

Something that you might want to consider for a later stage is to make this "stateful lexer" more generic, so maybe future parsers could use it in the future

Comment on lines +33 to +36
diagnostics: Vec::new(),
scopes: Default::default(),
current_coordinate: Default::default(),
tokens: LexToken::default().into(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still want to track and lex the BOM Unicode character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I completely forgot about it. Since one YAML file can have multiple BOM, once per document, let me see how to proceed in another PR

Comment on lines +513 to +514
fn rewind(&mut self, _: LexerCheckpoint<Self::Kind>) {
unimplemented!()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you won't plan to implement the rewind, we might want to delete it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's a necessary fn of Lexer trait, so I couldn't delete it

let m = p.start();
p.expect(FLOW_START);
parse_flow_yaml_node(p);
// debug_assert!(p.at(FLOW_END));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

Copy link
Contributor

@dyc3 dyc3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vohoanglong0107 vohoanglong0107 merged commit 09ba0ab into biomejs:main Jun 30, 2025
29 checks passed
@vohoanglong0107 vohoanglong0107 deleted the moving-state-to-yaml-lexer branch June 30, 2025 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Parser Area: parser A-Tooling Area: internal tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载