Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

MySQL 8.0 Lexer State Transitions for Identifier Parsing

Tech 1

Identifier Tokenization Logic in MySQL's SQL Lexer

Source file: router/src/routing/src/sql_lexer.cc (MySQL 8.0.37)

The MY_LEX_IDENT state handles identifier recognition, including support for multi-byte character sets and special prefix forms like _charsetname. Its behavior adapts dynamically based on character set properties and SQL mode settings.

Multi-Byte Character Handling

Before scanning identifier characters, the lexer checks whether the preceding byte initiates a valid multi-byte sequence using my_mbcharlen() and my_ismbchar(). If the current position is inside or at the boundary of an incomplete multi-byte character, the lexer adjusts its read pointer accordingly:

int mb_len = my_mbcharlen(cs, lip->yyGetLast());
switch (mb_len) {
  case 1:
    break;
  case 0:
    if (my_mbmaxlenlen(cs) < 2) break;
    [[fallthrough]];
  default:
    int seq_len = my_ismbchar(cs, lip->get_ptr() - 1, lip->get_end_of_query());
    if (seq_len == 0) {
      state = MY_LEX_CHAR;
      continue;
    }
    lip->skip_binary(seq_len - 1);
}

Identifier Scanning Loop

Depending on whether the active character set supports multi-byte encoding (use_mb(cs)), the lexer applies distinct scanning strategies:

  • Multi-byte path: Iterates while ident_map[c] holds true, re-evaluating each character’s multi-byte context after every yyGet() call.
  • Single-byte path: Uses a compact bitwise accumulation loop to detect presence of non-ASCII bytes, later used to determine quoting necessity:
if (!use_mb(cs)) {
  int accumulated = c;
  while (ident_map[c = lip->yyGet()]) {
    accumulated |= c;
  }
  result_state = (accumulated & 0x80) ? IDENT_QUOTED : IDENT;
} else {
  // Multi-byte-aware loop with repeated my_ismbchar validation
  result_state = IDENT_QUOTED;
  // ... [multi-byte loop body as above]
}

Whitespace Skipping Under IGNORE_SPACE Mode

When SQL_MODE=IGNORE_SPACE, the lexer skips whitespace immediately following the identifier start — but only if no actual identifier content has been consumed yet. This allows constructs like SELECT COUNT (1) to parse correctly:

if (lip->ignore_space && start == lip->get_ptr()) {
  while (state_map[c] == MY_LEX_SKIP) {
    if (c == '\n') lip->yylineno++;
    c = lip->yyGet();
  }
}

Dot-Separated Identifier Detection

If the scanner encounters a . immediately after the identifier start and the next character belongs to an identifier (per ident_map), it transitions to MY_LEX_IDENT_SEP, deferring further parsing to the subsequent state:

if (start == lip->get_ptr() && c == '.' && ident_map[lip->yyPeek()]) {
  lip->next_state = MY_LEX_IDENT_SEP;
} else {
  lip->yyUnget();
  int keyword_id = find_keyword(lip, length, c == '(');
  if (keyword_id) {
    lip->next_state = MY_LEX_START;
    return keyword_id;
  }
  lip->yySkip();
}

Underscore-Prefixed Charset Declaration

An identifier starting with _ is interpreted as a character set name declaration (e.g., _utf8mb4 'hello'). The lexer attempts to resolve the suffix via get_charset_by_csname() without raising errors. Special handling applies when the resolved collation matches utf8mb4_0900_ai_ci, substituting it with the session’s default_collation_for_utf8mb4 value:

if (yylval->lex_str.str[0] == '_') {
  const char* cs_name = yylval->lex_str.str + 1;
  const CHARSET_INFO* cs_info = get_charset_by_csname(cs_name, MY_CS_PRIMARY, MYF(0));
  if (cs_info) {
    lip->warn_on_deprecated_charset(cs_info, cs_name);
    if (cs_info == &my_charset_utf8mb4_0900_ai_ci) {
      cs_info = thd->variables.default_collation_for_utf8mb4;
    }
    yylval->charset = cs_info;
    lip->m_underscore_cs = cs_info;
    lip->body_utf8_append(lip->m_cpp_text_start, lip->get_cpp_tok_start() + length);
    return UNDERSCORE_CHARSET;
  }
}

UTF-8 Normalization and Final Dispatch

After extracting the raw token string via get_token(), the lexer appends both the original preprocessor text and the normalized literal representation (applying charset conversion where needed) into its UTF-8 output buffer:

yylval->lex_str = get_token(lip, 0, length);
lip->body_utf8_append(lip->m_cpp_text_start);
lip->body_utf8_append_literal(thd, &yylval->lex_str, cs, lip->m_cpp_text_end);

Return Values and State Transitions

Current State Condition Action
MY_LEX_IDENT Next character is . and folllowed by identifierr Set next_state = MY_LEX_IDENT_SEP
Token matches a keyword or function Return keyword ID; set next_state = MY_LEX_START
Token starts with _ and resolves to valid charset Return UNDERSCORE_CHARSET (852)
Otherwise Return IDENT (482) or IDENT_QUOTED (484)

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.