-
-
Notifications
You must be signed in to change notification settings - Fork 644
feat(yaml): overhauling YAML lexer #6481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(yaml): overhauling YAML lexer #6481
Conversation
|
045676c
to
7cb344c
Compare
CodSpeed Performance ReportMerging #6481 will not alter performanceComparing Summary
|
7cb344c
to
0af865f
Compare
0af865f
to
010da6b
Compare
010da6b
to
6b2c15b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions -- I'd like to understand your changes a little better
|
||
pub(crate) struct YamlTokenSource<'source> { | ||
lexer: BufferedLexer<YamlSyntaxKind, YamlLexer<'source>>, | ||
lexer: YamlLexer<'source>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove the BufferedLexer
here? That's what lets the parser peek ahead of the current token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the main issue is that the YamlLexer is quite incompatible with BufferedLexer.
BufferedLexer's lookahead function relies on checkpoints and rewinding, which involves advancing the lexer by n tokens and rewinding back to the latest checkpoint.
To disambiguate between different YAML constructs, the YamlLexer needs to emit a bunch of tokens at once, rather of just one like the other lexers. For example, to check whether the first inline collection [a, b, c]
is a mapping key, it has to lex the entire collection, then checks for the presence of :
.
- [a, b, c]: abc
[x, y, z]: xyz
Once it's certain that the collection is a mapping key, it emits a MAPPING_START token before emitting the rest.
For this reason, it's not trivial to implement checkpoint and rewinding for this lexer. The fact that this lexer manages some internal states for the disambiguation logic only adds to the complexity
source | ||
} | ||
|
||
fn next_non_trivia_token(&mut self, context: YamlLexContext, first_token: bool) { | ||
fn next_non_trivia_token(&mut self, first_token: bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why did you remove the lex context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So previously the lexer was stateless, it relied on the parser's instruction in how to lex the next token, which is what we used the context for. Now, the lexer manages its own state and knows whether to lex a mapping token, a sequence token, and so on. The parser no longer need to pass that information down.
/// If the source starts with a Unicode BOM, this is the number of bytes for that token. | ||
unicode_bom_length: usize, | ||
/// Where the lexer is in the source | ||
current_coordinate: TextCoordinate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you keeping track of the document's indentation size anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we are, through this current_coordinate
, which stores the current column, and BlockScope. As stated above, since indentation (new line followed by spaces) alone is no longer enough to correctly determined whether the lexer is at the start of a new collection, it's now tracking the column of each tokens, and compares it with the left most column of the current scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dyc3 for the insightful questions, and @vohoanglong0107 for the great effort you put into it, and the answers you gave. It's now pretty much clear.
I think the PR as it is doesn't require too many changes now, and since we have more tests now, they are a good harness.
Something that you might want to consider for a later stage is to make this "stateful lexer" more generic, so maybe future parsers could use it in the future
diagnostics: Vec::new(), | ||
scopes: Default::default(), | ||
current_coordinate: Default::default(), | ||
tokens: LexToken::default().into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still want to track and lex the BOM Unicode character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I completely forgot about it. Since one YAML file can have multiple BOM, once per document, let me see how to proceed in another PR
fn rewind(&mut self, _: LexerCheckpoint<Self::Kind>) { | ||
unimplemented!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you won't plan to implement the rewind, we might want to delete it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's a necessary fn of Lexer trait, so I couldn't delete it
let m = p.start(); | ||
p.expect(FLOW_START); | ||
parse_flow_yaml_node(p); | ||
// debug_assert!(p.at(FLOW_END)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Summary
The current indentation handling model employed by the YAML lexer is not sophisticated enough to handle YAML compact forms, which allow a new collection to start without explicit indentation.
For more context, normally YAML requires a new collection, whether a sequence or a mapping, to start with an indent:
In contrast, compact forms allow a collection to start on the same line as its parent collection entry:
This breaks the current assumption held by the YAML lexer, which expects all nested collection to begin and end with
indent
anddedent
tokens.For example, given this nested compact sequence, the lexer currently only detects a single indent token preceding the second line
-
token, which results in the following token stream:- - - - - - -
Given the above stream, the YAML parser doesn't have enough information to correctly identify that the second level
-
token belongs to the third degree nested sequence, and that the final-
belongs to the top level collection, purely based on indentation.Similarly, in dealing with compact mapping, one can't confidentally tell that the following nested mapping is wrongly indented given only one
indent
token (unless we count the number of characters):Additionally, a side effect of exposing indentation information to the parser is that the lexer must also emit newline token for when newline happens without changing indent level. It's extremely annoying to work with in the parser, even though newlines only matters in some context.
To deal with the above describe problems, this PR introduces a new scope concept to the lexer. In short, any token appears further to the right of a scope's boundary belong to that scope. For example:
Afther the first
:
token, the lexer enters a block mapping scope, with its border at column 2. Any token appearing beyound column 2 belong to this scope.Now, instead of emitting
indent
anddedent
tokens, the YAML lexer now emitsMAPPING_START
,MAPPING_END
,SEQUENCE_START
,SEQUENCE_END
,FLOW_START
,FLOW_END
specially to signal to the parser the when to start new collection. All disambiguation logic, like distinguishing between a flow collection ([1, 2, 3]
) and the start of a new mapping (plain token followed by:
:[1, 2, 3]:
) is handled at the lexer.Sorry that this takes a while to get out. It doesn't help that the YAML spec has so many edge cases.
After this, the next major blocker would be in dealing with YAML properties, as it has quite a challenging indentation rule.
Test Plan
This PR adds more tests for the YAML lexer. Most notable ones are
[1, 2, 3]
collection misses a closing bracket and correctly insert aFLOW_END
token before emitting the next mapping key{"a": ["b"]}