MySQL 8.0 Lexer State Transitions for Identifier Parsing
Identifier Tokenization Logic in MySQL's SQL Lexer
Source file:
router/src/routing/src/sql_lexer.cc(MySQL 8.0.37)
The MY_LEX_IDENT state handles identifier recognition, including support for multi-byte character sets and special prefix forms like _charsetname. Its behavior adapts dynamically based on character set properties and SQL mode settings.
Multi-Byte Character Handling
Before scanning identifier characters, the lexer checks whether the preceding byte initiates a valid multi-byte sequence using my_mbcharlen() and my_ismbchar(). If the current position is inside or at the boundary of an incomplete multi-byte character, the lexer adjusts its read pointer accordingly:
int mb_len = my_mbcharlen(cs, lip->yyGetLast());
switch (mb_len) {
case 1:
break;
case 0:
if (my_mbmaxlenlen(cs) < 2) break;
[[fallthrough]];
default:
int seq_len = my_ismbchar(cs, lip->get_ptr() - 1, lip->get_end_of_query());
if (seq_len == 0) {
state = MY_LEX_CHAR;
continue;
}
lip->skip_binary(seq_len - 1);
}
Identifier Scanning Loop
Depending on whether the active character set supports multi-byte encoding (use_mb(cs)), the lexer applies distinct scanning strategies:
- Multi-byte path: Iterates while
ident_map[c]holds true, re-evaluating each character’s multi-byte context after everyyyGet()call. - Single-byte path: Uses a compact bitwise accumulation loop to detect presence of non-ASCII bytes, later used to determine quoting necessity:
if (!use_mb(cs)) {
int accumulated = c;
while (ident_map[c = lip->yyGet()]) {
accumulated |= c;
}
result_state = (accumulated & 0x80) ? IDENT_QUOTED : IDENT;
} else {
// Multi-byte-aware loop with repeated my_ismbchar validation
result_state = IDENT_QUOTED;
// ... [multi-byte loop body as above]
}
Whitespace Skipping Under IGNORE_SPACE Mode
When SQL_MODE=IGNORE_SPACE, the lexer skips whitespace immediately following the identifier start — but only if no actual identifier content has been consumed yet. This allows constructs like SELECT COUNT (1) to parse correctly:
if (lip->ignore_space && start == lip->get_ptr()) {
while (state_map[c] == MY_LEX_SKIP) {
if (c == '\n') lip->yylineno++;
c = lip->yyGet();
}
}
Dot-Separated Identifier Detection
If the scanner encounters a . immediately after the identifier start and the next character belongs to an identifier (per ident_map), it transitions to MY_LEX_IDENT_SEP, deferring further parsing to the subsequent state:
if (start == lip->get_ptr() && c == '.' && ident_map[lip->yyPeek()]) {
lip->next_state = MY_LEX_IDENT_SEP;
} else {
lip->yyUnget();
int keyword_id = find_keyword(lip, length, c == '(');
if (keyword_id) {
lip->next_state = MY_LEX_START;
return keyword_id;
}
lip->yySkip();
}
Underscore-Prefixed Charset Declaration
An identifier starting with _ is interpreted as a character set name declaration (e.g., _utf8mb4 'hello'). The lexer attempts to resolve the suffix via get_charset_by_csname() without raising errors. Special handling applies when the resolved collation matches utf8mb4_0900_ai_ci, substituting it with the session’s default_collation_for_utf8mb4 value:
if (yylval->lex_str.str[0] == '_') {
const char* cs_name = yylval->lex_str.str + 1;
const CHARSET_INFO* cs_info = get_charset_by_csname(cs_name, MY_CS_PRIMARY, MYF(0));
if (cs_info) {
lip->warn_on_deprecated_charset(cs_info, cs_name);
if (cs_info == &my_charset_utf8mb4_0900_ai_ci) {
cs_info = thd->variables.default_collation_for_utf8mb4;
}
yylval->charset = cs_info;
lip->m_underscore_cs = cs_info;
lip->body_utf8_append(lip->m_cpp_text_start, lip->get_cpp_tok_start() + length);
return UNDERSCORE_CHARSET;
}
}
UTF-8 Normalization and Final Dispatch
After extracting the raw token string via get_token(), the lexer appends both the original preprocessor text and the normalized literal representation (applying charset conversion where needed) into its UTF-8 output buffer:
yylval->lex_str = get_token(lip, 0, length);
lip->body_utf8_append(lip->m_cpp_text_start);
lip->body_utf8_append_literal(thd, &yylval->lex_str, cs, lip->m_cpp_text_end);
Return Values and State Transitions
| Current State | Condition | Action |
|---|---|---|
MY_LEX_IDENT |
Next character is . and folllowed by identifierr |
Set next_state = MY_LEX_IDENT_SEP |
| Token matches a keyword or function | Return keyword ID; set next_state = MY_LEX_START |
|
Token starts with _ and resolves to valid charset |
Return UNDERSCORE_CHARSET (852) |
|
| Otherwise | Return IDENT (482) or IDENT_QUOTED (484) |