Home > Tech > Content

MySQL 8.0 Lexer State Transitions for Identifier Parsing

Tech Apr 24 17

Identifier Tokenization Logic in MySQL's SQL Lexer

Source file: router/src/routing/src/sql_lexer.cc (MySQL 8.0.37)

The MY_LEX_IDENT state handles identifier recognition, including support for multi-byte character sets and special prefix forms like _charsetname. Its behavior adapts dynamically based on character set properties and SQL mode settings.

Multi-Byte Character Handling

Before scanning identifier characters, the lexer checks whether the preceding byte initiates a valid multi-byte sequence using my_mbcharlen() and my_ismbchar(). If the current position is inside or at the boundary of an incomplete multi-byte character, the lexer adjusts its read pointer accordingly:

int mb_len = my_mbcharlen(cs, lip->yyGetLast());
switch (mb_len) {
  case 1:
    break;
  case 0:
    if (my_mbmaxlenlen(cs) < 2) break;
    [[fallthrough]];
  default:
    int seq_len = my_ismbchar(cs, lip->get_ptr() - 1, lip->get_end_of_query());
    if (seq_len == 0) {
      state = MY_LEX_CHAR;
      continue;
    }
    lip->skip_binary(seq_len - 1);
}

Identifier Scanning Loop

Depending on whether the active character set supports multi-byte encoding (use_mb(cs)), the lexer applies distinct scanning strategies:

Multi-byte path: Iterates while ident_map[c] holds true, re-evaluating each character’s multi-byte context after every yyGet() call.
Single-byte path: Uses a compact bitwise accumulation loop to detect presence of non-ASCII bytes, later used to determine quoting necessity:

if (!use_mb(cs)) {
  int accumulated = c;
  while (ident_map[c = lip->yyGet()]) {
    accumulated |= c;
  }
  result_state = (accumulated & 0x80) ? IDENT_QUOTED : IDENT;
} else {
  // Multi-byte-aware loop with repeated my_ismbchar validation
  result_state = IDENT_QUOTED;
  // ... [multi-byte loop body as above]
}

Whitespace Skipping Under IGNORE_SPACE Mode

When SQL_MODE=IGNORE_SPACE, the lexer skips whitespace immediately following the identifier start — but only if no actual identifier content has been consumed yet. This allows constructs like SELECT COUNT (1) to parse correctly:

if (lip->ignore_space && start == lip->get_ptr()) {
  while (state_map[c] == MY_LEX_SKIP) {
    if (c == '\n') lip->yylineno++;
    c = lip->yyGet();
  }
}

Dot-Separated Identifier Detection

If the scanner encounters a . immediately after the identifier start and the next character belongs to an identifier (per ident_map), it transitions to MY_LEX_IDENT_SEP, deferring further parsing to the subsequent state:

if (start == lip->get_ptr() && c == '.' && ident_map[lip->yyPeek()]) {
  lip->next_state = MY_LEX_IDENT_SEP;
} else {
  lip->yyUnget();
  int keyword_id = find_keyword(lip, length, c == '(');
  if (keyword_id) {
    lip->next_state = MY_LEX_START;
    return keyword_id;
  }
  lip->yySkip();
}

Underscore-Prefixed Charset Declaration

An identifier starting with _ is interpreted as a character set name declaration (e.g., _utf8mb4 'hello'). The lexer attempts to resolve the suffix via get_charset_by_csname() without raising errors. Special handling applies when the resolved collation matches utf8mb4_0900_ai_ci, substituting it with the session’s default_collation_for_utf8mb4 value:

if (yylval->lex_str.str[0] == '_') {
  const char* cs_name = yylval->lex_str.str + 1;
  const CHARSET_INFO* cs_info = get_charset_by_csname(cs_name, MY_CS_PRIMARY, MYF(0));
  if (cs_info) {
    lip->warn_on_deprecated_charset(cs_info, cs_name);
    if (cs_info == &my_charset_utf8mb4_0900_ai_ci) {
      cs_info = thd->variables.default_collation_for_utf8mb4;
    }
    yylval->charset = cs_info;
    lip->m_underscore_cs = cs_info;
    lip->body_utf8_append(lip->m_cpp_text_start, lip->get_cpp_tok_start() + length);
    return UNDERSCORE_CHARSET;
  }
}

UTF-8 Normalization and Final Dispatch

After extracting the raw token string via get_token(), the lexer appends both the original preprocessor text and the normalized literal representation (applying charset conversion where needed) into its UTF-8 output buffer:

yylval->lex_str = get_token(lip, 0, length);
lip->body_utf8_append(lip->m_cpp_text_start);
lip->body_utf8_append_literal(thd, &yylval->lex_str, cs, lip->m_cpp_text_end);

Return Values and State Transitions

Current State	Condition	Action
`MY_LEX_IDENT`	Next character is `.` and folllowed by identifierr	Set `next_state = MY_LEX_IDENT_SEP`
	Token matches a keyword or function	Return keyword ID; set `next_state = MY_LEX_START`
	Token starts with `_` and resolves to valid charset	Return `UNDERSCORE_CHARSET` (852)
	Otherwise	Return `IDENT` (482) or `IDENT_QUOTED` (484)

Tags: MySQL lexer parsing

Back to List

Prev: Automated Scraping of WeChat Official Account Articles Using Playwright with Auto-Scrolling

Next: Common Issues and Solutions When Deploying StrongSwan IPSec

Fading Coder

MySQL 8.0 Lexer State Transitions for Identifier Parsing

Identifier Tokenization Logic in MySQL's SQL Lexer

Multi-Byte Character Handling

Identifier Scanning Loop

Whitespace Skipping Under IGNORE_SPACE Mode

Dot-Separated Identifier Detection

Underscore-Prefixed Charset Declaration

UTF-8 Normalization and Final Dispatch

Return Values and State Transitions

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

MySQL 8.0 Lexer State Transitions for Identifier Parsing

Identifier Tokenization Logic in MySQL's SQL Lexer

Multi-Byte Character Handling

Identifier Scanning Loop

Whitespace Skipping Under IGNORE_SPACE Mode

Dot-Separated Identifier Detection

Underscore-Prefixed Charset Declaration

UTF-8 Normalization and Final Dispatch

Return Values and State Transitions

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment