Home > Tech > Content

Core Database Concepts and Optimization Techniques

Tech May 2 24

Transactions

Transaction Definition and ACID Properties

Definition: A sequence of database operations that must either all succeed or all fail, serving as an indivisible unit of work.
ACID Properties:
- Atomicity: All operations within a transaction are completed successfully; otherwise, none are applied.
- Consistency: The transaction ensures the database transitions from one consistent state to another. Constraints (e.g., primary key, foreign key, check constraints) remain intact before and after the transaction. For example, in a bank transfer, the total balance across accounts must remain unchanged.
- Isolation: Concurrent transactions do not interfere with eachother, preventing anomalies like dirty reads, non-repeatable reads, and phantom reads.
- Durability: Once committed, changes are permanently stored, even in the event of system failure.

Data Inconsistencies Caused by Concurrent Operations

Lost Update: T1 modifies data; T2 later overwrites it, causing T1's changes to be lost.
Dirty Read: T1 modifies data; T2 reads it before T1 commits. If T1 rolls back, T2 has read invalid data.
Non-Repeatable Read: T1 reads data; T2 modifies it; T1 reads again and sees different values.
Phantom Read: T1 reads a range of rows; T2 inserts or deletes rows within that range; T1 re-reads and sees a different number of rows.

Difference Between Non-Repeatable Read and Phantom Read

Non-Repeatable Read: Same condition, same data value changes on re-read.
Phantom Read: Same condition, number of records differs on re-read.

Transaction Isolation Levels and Their Effects

Isolation Level	Avoids	Notes
Read Uncommitted	None	High risk of anomalies; visible to uncommitted changes.
Read Committed	Dirty Read	Still susceptible to non-repeatable reads and phantom reads.
Repeatable Read	Dirty Read, Non-Repeatable Read	Vulnerable to phantom reads.
Serializable	All	Enforces strict serial execution via locking.

MySQL’s default isolation level is Repeatable Read.

Viewing and Setting MySQL Isolation Level

-- View global isolation level
SELECT @@global.tx_isolation;

-- View session isolation level
SELECT @@transaction_isolation;

-- Set global isolation level
SET GLOBAL TRANSACTION ISOLATION LEVEL REPEATABLE READ;

-- Set session isolation level
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;

Choosing the Right Isolation Level

Favor higher consistency at the database layer, allowing upstream applications to adapt.
Spring-managed transactions override MySQL settings. If the specified isolation level is unsupported, Spring transaction won’t activate.
Read Committed and Repeatable Read are most commonly used.

Current Read vs Snapshot Read in MySQL

Current Read: SELECT ... FOR UPDATE — reads the latest version, applies locks (write-blocking), uses pessimistic locking.
Snapshot Read: SELECT — non-blocking, concurrent read using MVCC. In serializable mode, it behaves like current read.

Snapshot read is a key feature of MVCC, enabling high concurrency without blocking.

How MySQL Prevents Phantom Reads

For Snapshot Read: Uses MVCC. In repeatable read, the transaction sees a consistent snapshot throughout its lifetime. Even if new rows are inserted, they’re invisible to the transaction.
For Current Read: Uses Next-Key Locks (record lock + gap lock). When SELECT ... FOR UPDATE is executed, it acquires next-key locks. Any insert attempt within the locked range is blocked, preventing phantom reads.

InnoDB Row Lock Types

Record Lock: Locks a specific index entry.
Gap Lock: Locks gaps between index entries (or before first/after last). Only exists in Repeatable Read to prevent phantom reads.
Next-Key Lock: Combines record and gap locks, protecting both data and gaps.

Multi-Version Concurrency Control (MVCC)

MVCC resolves read-write conflicts without locks, enabling non-blocking reads and writes. It supports high concurrency while avoiding dirty reads, non-repeatable reads, and phantom reads—but cannot solve lost updates.

InnoDB implements MVCC through four components:

Hidden fields
Undo logs
Read view
Visibility algorithm

Hidden Fields

DB_ROW_ID: 6-byte hidden primary key. Used to create clustered indexes when no explicit PK is defined.
DB_TRX_ID: 6-byte ID of the transaction that created or last modified the row.
DB_ROLL_PTR: 7-byte pointer to the previous version of the row (in undo log).

Undo Logs

Log the inverse of current operations. An INSERT generates a corresponding DELETE, and an UPDATE generates a reversed UPDATE.
Multiple versions form a version chain per row.

Read View

Created during a snapshot read, capturing the system’s active transaction IDs at that moment.
Key elements:
- creator_trx_id: Transaction ID that created the read view.
- m_ids: Active transaction IDs at creation time.
- min_trx_id: Minimum transaction ID in m_ids.
- max_trx_id: Next expected transaction ID.

Visibility Algorithm

Determine which version of a row is visible to the current transaction:

If DB_TRX_ID = creator_trx_id: Visible (current transaction owns it).
If DB_TRX_ID < min_trx_id: Visible (committed before view creation).
If DB_TRX_ID >= max_trx_id: Not visible (created after view creation).
If min_trx_id ≤ DB_TRX_ID < max_trx_id:
- If in m_ids: Not visible (still active).
- Else: Visible (already committed).

Read Committed: Re-generates read view on every query. Repeatable Read: Uses the initial read view for the entire transaction.

MySQL Locking Mechanisms

By Granularity

Row-Level Lock: Fine-grained, lowest conflict probability, highest concurrency. Slow to acquire, may cause deadlocks.
Table-Level Lock: Coarse-grained, high conflict likelihood, low concurrency. Fast to acquire, no deadlocks.
```
LOCK TABLES table_name READ;
LOCK TABLES table_name WRITE;
```
Global Lock: Locks entire database. Minimal overhead, no deadlocks.
```
FLUSH TABLES WITH READ LOCK;
```

InnoDB defaults to row-level locking; MyISAM uses table-level.

By Type

Exclusive Lock (X): Write lock. Only the owning transaction can read/write. Others cannot access until released.
```
SELECT ... FOR UPDATE;
```
Shared Lock (S): Read lock. Allows multiple readers but blocks writers.
```
SELECT ... FOR SHARE;
```

By Strategy

Pessimistic Locking: Assumes conflicts are likely. Acquires locks early and holds them throughout. Ensures exclusivity.
Optimistic Locking: Assumes conflicts are rare. No locks during read. Checks for conflicts only at commit time. Uses version numbers or timestamps.

InnoDB Row Lock Implementation

Intention Locks (internal table locks):
- Intention Shared Lock (IS): Indicates intent to set shared locks on rows.
- Intention Exclusive Lock (IX): Indicates intent to set exclusive locks on rows.
Lock Compatibility Matrix:

Row locks are implemented on indexed columns. Without an index, InnoDB uses a hidden clustered index.

Indexes

What Are Indexes?

Sorted structures on one or more columns, improving query speed but consuming space and slowing DML.
Common types: B+ Tree and Hash indexes.

Pros and Cons

Pros: Converts random I/O into sequential I/O; faster retrieval.
Cons: Space overhead; maintenance cost during inserts/updates.

Index Types

Clustered Index: Determines physical storage order. Leaf nodes contain full data.
Non-Clustered Index: Separates index from data. Leaf nodes store pointers to data.
Unique Index: Prevents duplicate values. NULL allowed.
Primary Key Index: Clustered, unique, non-NULL.
Secondary Index (Auxiliary Index): Stores index key + primary key. May avoid "back-to-table" lookup if covered.
Covering Index: Query data is entirely available in the index, eliminating back-to-table access.

Index Field Selection Guidelines

Use frequently queried columns (in WHERE clauses).
Consider join columns.
Prefer smaller fields (reduces I/O).
Choose high-cardinality fields (use COUNT(DISTINCT) to assess).
Avoid NULLs; use default values instead.
Fixed-length types (e.g., CHAR) over variable (TEXT).
Use composite indexes for multi-column queries.
Avoid indexing tables with heavy write load.
Small tables (<300 rows): Full scan may be faster.
Large tables: Consider partitioning.

B-Trees

Each node holds multiple key-value pairs, sorted left-to-right.
M-order B-tree: Root ≥ 2 children; internal nodes ≥ M/2 children; all leaves at same level.
Low height enables fast lookups. One disk I/O loads a full node (pre-fetching).
3-level B-tree can hold ~1 billion records with <3 disk reads.

B+ Trees vs Hash Indexes

B+ Tree

Variant of B-tree.
Non-leaf nodes store only keys; leaf nodes form sorted linked list.
Advantages:
- Smaller height → faster queries.
- Stable performance (all paths same length).
- Native sorting → efficient range scans.

Hash Index

Based on hash tables.
O(1) lookup for exact matches.
Disadvantages:
- No support for sorting, grouping, or range queries.
- Cannot handle partial matches (e.g., LIKE %abc).
- Collision issues degrade performance.

Prefix Index

Index only the first N characters of a string.
Useful when prefix has high cardinality.

ALTER TABLE table_name ADD KEY(column_name(prefix_length));

To determine optimal prefix length:

-- Full column cardinality
SELECT COUNT(DISTINCT column_name) / COUNT(*) FROM table_name;

-- With prefix
SELECT COUNT(DISTINCT LEFT(column_name, prefix_length)) / COUNT(*) FROM table_name;

Leftmost Matching Principle

For composite indexes (a,b,c), matching must start from the left.
Stops at first range condition (<, >, BETWEEN, LIKE).

Examples:

✅ WHERE a=1 AND b=2 AND c=3 → uses index.
❌ WHERE b=2 AND c=3 → skips index.
⚠️ WHERE a=1 AND c=3 → only a uses index (b skipped).
⚠️ WHERE a > 1 AND b = 2 → only a uses index (range stops match).
✅ WHERE a BETWEEN 2 AND 8 AND b = 2 → both a and b use index.
✅ WHERE name LIKE 'j%' AND age = 22 → both use index due to ordered prefix.

When Indexes Fail

Violation of leftmost principle.
Function usage on indexed columns: WHERE ABS(a) = 1.
Expression on indexed column: WHERE a + 1 = 2.
Implicit type conversion: WHERE phone = 1300000001 if phone is VARCHAR.
Using OR with non-indexed columns.
Leading or trailing LIKE % patterns.
Using !=, NOT IN, <>, IS NULL, IS NOT NULL.

InnoDB vs MyISAM Index Differences

Feature	InnoDB	MyISAM
Primary Key	Clustered index (stores full data)	Non-clustered (stores row addresses)
Secondary Index	Stores primary key	Stores row addresses
Storage	B+ tree	B+ tree

Index Comparison

Creating and Dropping Indexes

-- Create index
CREATE INDEX idx_name ON table_name(column_list);

-- During table creation
CREATE TABLE user (
    id INT PRIMARY KEY,
    information TEXT,
    FULLTEXT(information)
);

-- Alter table
ALTER TABLE table_name ADD INDEX idx_name(column_list);

-- Drop index
ALTER TABLE table_name DROP INDEX idx_name;
ALTER TABLE table_name DROP PRIMARY KEY;

Database Systems

Relational vs NoSQL

Relational

Pros:
- Easy to understand (table-based).
- Standard SQL.
- Strong ACID guarantees reduce redundancy and inconsistency.
Cons:
- Poor scalability under high write load.
- Difficult horizontal scaling due to JOINs.

NoSQL

Pros:
- High performance (key-value storage).
- Easily scalable horizontally.
Cons:
- Weak transaction support.
- Some lack durability.

When to Use Which

Relational: Structured data (user accounts, addresses), complex queries, strong consistency.
NoSQL: Unstructured data (articles, comments), massive scale, simple key-based access.

Challenges of Relational Databases at Scale

Transactions: Two-phase commit protocol is slow and fragile.
JOINs: Often avoided in favor of denormalization to improve performance.
Performance: B+ trees less efficient than LSM-trees for write-heavy workloads.

SQL Execution Order

Write Order:

SELECT DISTINCT
    select_list
FROM
    left_table
LEFT JOIN
    right_table ON join_condition
WHERE
    where_condition
GROUP BY
    group_by_list
HAVING
    having_condition
ORDER BY
    order_by_condition

Execution Order:

FROM: Cartesian product of tables → virtual table V1.
ON: Filter V1 → V2.
JOIN: Add filtered left table → V3.
WHERE: Filter V3 → V4.
GROUP BY: Group V4 → V5.
HAVING: Filter V5 → V6.
SELECT: Project V6 → V8 (DISTINCT optional if GROUP BY used).
ORDER BY: Sort V8.

SQL Execution Flow

Pooling Design and Connection Pools

Pooling: Pre-allocate resources (e.g., connections) to avoid repeated setup costs.
Database Connection Pool: Reuses existing socket connections instead of creating new ones.
Why Needed? Creating connections is expensive. Pooling reduces latency and resource usage.

Normal Forms (1NF, 2NF, 3NF)

1NF: Atomic attributes (no nested sets).
2NF: Eliminate partial dependency on composite keys.
3NF: Eliminate transitive dependency.

Example: Splitting student data to remove dependencies on department via class.

JOIN Types

Join Type	Description
INNER JOIN	Returns matching rows only.
LEFT JOIN	Left table fully shown; right table shows common rows (nulls for unmatched).
RIGHT JOIN	Right table fully shown; left table shows common rows (nulls for unmatched).
FULL JOIN	Returns all rows from both tables (deduplicated).

SQL Injection

Attack: Inject malicious SQL via input parameters.
Example: Login with password ' OR '1'='1 → always returns true.

Prevention

Frontend: Validate input (regex, length limits).
Backend: Use prepared statements (e.g., PreparedStatement in Java). Parameters are bound separately from SQL logic.

Prepared Statements: Parameterize SQL—parse once, execute many times. Prevents injection.

MyBatis: `#{}` vs `${}`

Syntax	Behavior	Security
`#{}`	Parameterized placeholder; auto-quoting; prevents injection.	✅ Safe
`${}`	Direct string substitution; vulnerable to injection.	❌ Unsafe

<select id="findByName" parameterType="String" resultMap="studentResultMap">
    SELECT * FROM user WHERE username='${value}'
</select>

Use #{} unless you need dynamic column names (e.g., ORDER BY ${column}).

MySQL Query Execution Process

Client request.
Connection Manager: Authenticate, assign permissions.
Query Cache: Return cached result if present; else proceed.
Parser: Lexical and syntactic analysis.
Optimizer: Choose best execution plan (index selection).
Executor: Check privileges, call storage engine.
Storage Engine: Retrieve data; cache result if enabled.

MySQL Architecture

Split into two layers:

Server Layer: Handles connection, caching, parsing, optimization, execution, binlog, built-in functions.
Storage Engine Layer: Manages data storage. Pluggable (InnoDB, MyISAM). InnoDB has redo log and undo log.

MySQL Architecture

Log Types: Binlog, Redo Log, Undo Log

Log	Purpose	Location	Crash-Safe?
Binlog	Logical log of all changes (DDL/DML); used for replication and recovery.	Server layer	❌ Alone
Redo Log	Physical log of data modifications; ensures crash recovery.	InnoDB	✅
Undo Log	Records old values for rollback; supports MVCC.	InnoDB	✅

Key Insight: Both binlog and redo log are needed for crash-safe recovery. Binlog alone cannot guarantee persistence.

MySQL Update Statement Workflow

UPDATE tb_student SET age = '19' WHERE name = '张三';

InnoDB reads row, updates in memory, logs to redo log.
Redo log enters PREPARE state.
Executor logs to binlog.
Redo log commits.

Two-Phase Commit prevents data loss during crashes.

Why Rollback Mechanism?

Undo logs enable rollback on error or ROLLBACK. Must be written before data to ensure recovery.

InnoDB vs MyISAM

Feature	InnoDB	MyISAM
Transactions	Yes	No
MVCC	Yes	No
Lock Granularity	Row & Table	Table only
Index Structure	Clustered (PK)	Non-clustered
Primary Key	Required	Optional
Foreign Keys	Supported	Not supported

Character Sets and Collations

Character Set: Mapping from binary to symbols.
Collation: Rules for comparing characters within a charset.
Inheritance: Database → Table → Column.

CHAR vs VARCHAR

Feature	CHAR	VARCHAR
Length	Fixed	Variable
Storage	1 byte per char (EN), 2 bytes (CN)	2 bytes per char (EN/CN)
Max Size	255 bytes	65,535 bytes
Use Case	Fixed-length (gender, birthday)	Variable-length (name, description)

Optimal `VARCHAR(n)` Size

Set n to maximum expected length.
Avoid overly large values to save space and improve performance.
Use CHAR for fixed-length strings.

DELETE, TRUNCATE, DROP

Command	Behavior	Rollback?	Restart ID
DELETE	Row-by-row removal	✅	Continues
TRUNCATE	Removes all rows	❌	Starts at 1
DROP	Deletes table	❌	N/A

Use DELETE for partial deletion, TRUNCATE to clear data, DROP to remove table.

UNION vs UNION ALL

Operation	Behavior	Performance
UNION	Removes duplicates, sorts results	Slower
UNION ALL	Returns all rows, no deduplication	Faster

Use UNION ALL unless deduplication is required.

Temporary Tables

Created during query execution for intermediate results.
Visible only to current session.
Types: Memory (MEMORY), Disk (MyISAM/InnoDB).

When Created:

Subqueries in FROM clause.
DISTINCT with ORDER BY.
ORDER BY and GROUP BY differ.
UNION operations.

Slow Query Logging

Parameters:
- slow_query_log: Enable/disable.
- long_query_time: Threshold (default 10s).
- log_queries_not_using_indexes: Log unused index queries.
- log_output: FILE or TABLE.

Optimization Tips:

Analyze execution plans (EXPLAIN).

Normalize schema or add intermediate tables.

Optimize LIMIT pagination.

Backup Methods and Strategies

Backup Types

Hot Backup: Database remains online.
Warm Backup: Reads allowed, writes blocked.
Cold Backup: Database offline.

Strategies

File Copy: Simple for small databases.
mysqldump + Binlog: Complete backup + incremental via binlog.
- mysqldump exports structure and data as SQL.
- Binlog stores change history.

Why InnoDB Uses Auto-Increment Primary Keys

Random non-auto-increment keys (e.g., UUID) cause frequent page splits and fragmentation.
Auto-increment ensures sequential insertion → minimal page movement and better performance.

Why `NOT NULL`?

NULL affects COUNT() (excluded from count).
B+ trees don’t store NULL → indexing fails.
NOT IN queries return empty result if any value is NULL.
Comparisons involving NULL require special handling.

Slow SQL Causes

Intermittent slowness: Lock contention or redo log flush.
Consistent slowness: Missing or ineffective index.

MySQL Replication

Purpose: Scale read-heavy workloads.

Benefits:

Read Scaling: Offload reads to slaves.
High Availability: Slave can become master on failure.
Backup: Data replicated to slaves.

Replication Flow:

Master writes to binlog.
Master sends binlog to slave via Log Dump thread.
Slave writes to relay log via IO Thread.
Slave replays relay log via SQL Thread.

Replication Lag: Slave data lags behind master. Core operations (e.g., payments) should go to master.

Database Optimization

Schema Optimization

Break 3NF rules to reduce JOINs.
Use intermediate tables.
Vertical Partitioning: Split by columns (hot/cold data, large fields).
Horizontal Partitioning: Split by rows (sharding).
Prefer auto-increment primary keys.

SQL Optimization

Avoid SELECT *; specify columns.
Use LIMIT 1 for single-row queries.
Prevent index misuse (EXPLAIN helps).

Large Table Optimization

Limit data scope.
Implement read/write splitting.
Apply vertical/horizontal partitioning.

Sharding

Vertical: Split by columns (common, rare, large fields).
Horizontal: Split by rows (same schema, different shards).

Challenges: Cross-shard transactions, complex joins, hard to expand.

Handling IDs After Sharding

Use distributed ID generators (e.g., Snowflake).

Textbook Questions

Locking

Locking: Request before modifying; others blocked until released.
Types: X (exclusive), S (shared).

Livelock

Cause: Priority skew (e.g., T3/T4 get priority over T2).
Solution: First-Come-First-Served (FCFS).

Deadlock

Cause: Circular wait (T1 waits for T2, T2 waits for T1).
Prevention:
1. One-Step Locking: Acquire all locks at once.
2. Order Locking: Define a global lock order.

Deadlock Detection and Resolution

Detection: Timeout method (may cause false positives).
Resolution: Abort the least costly transaction.

Keys and Attributes

Super Key: Uniquely identifies tuples.
Candidate Key: Minimal super key.
Primary Key: Chosen candidate key.
Prime Attribute: Part of any candidate key.
Non-Prime Attribute: Not part of any candidate key.

Why Primary Key?

Enables precise identification of rows for updates/deletes.

Auto-Increment vs UUID

Auto-Increment:
- ✅ Smaller size, ordered, efficient indexing.
- ❌ Exposes business volume; hard to merge.
UUID:
- ✅ Global uniqueness, no exposure, generated at app layer.
- ❌ Larger size, random I/O, slower comparisons.

Recommendation: Use auto-increment unless sharding requires UUID.

Integrity Constraints

Entity Integrity: PK must be unique and non-null.
Referential Integrity: FK must be null or match referenced PK.

Why Can FK Be Null?

Indicates pending assignment.
Only allowed if FK is not part of PK.

Ensuring Consistency

Database Level: ACID properties.
Application Level: Code-level validation before commit.

Tags: Database MySQL transaction

Back to List

Prev: Guidelines for Initializing Parent Class Instances in Java Subclasses

Next: Interval Dynamic Programming Strategy for Maximum Sequence Value

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Overview In a recent project, I utilized the SBUS protocol with the Fus remote controller to control a vehicle's basic operations, including movement, lights, and mode switching. This article is aimed...

Comprehensive Guide to Hive SQL Syntax and Operations

This article provides a detailed walkthrough of Hive SQL, categorizing its features and syntax for practical use. Hive SQL is segmented into the following categories: DDL Statements: Operations on...

Fading Coder

Core Database Concepts and Optimization Techniques

Transactions

Transaction Definition and ACID Properties

Data Inconsistencies Caused by Concurrent Operations

Difference Between Non-Repeatable Read and Phantom Read

Transaction Isolation Levels and Their Effects

Viewing and Setting MySQL Isolation Level

Choosing the Right Isolation Level

Current Read vs Snapshot Read in MySQL

How MySQL Prevents Phantom Reads

InnoDB Row Lock Types

Multi-Version Concurrency Control (MVCC)

Hidden Fields

Undo Logs

Read View

Visibility Algorithm

MySQL Locking Mechanisms

By Granularity

By Type

By Strategy

InnoDB Row Lock Implementation

Indexes

What Are Indexes?

Pros and Cons

Index Types

Index Field Selection Guidelines

B-Trees

B+ Trees vs Hash Indexes

B+ Tree

Hash Index

Prefix Index

Leftmost Matching Principle

When Indexes Fail

InnoDB vs MyISAM Index Differences

Creating and Dropping Indexes

Database Systems

Relational vs NoSQL

Relational

NoSQL

When to Use Which

Challenges of Relational Databases at Scale

SQL Execution Order

Pooling Design and Connection Pools

Normal Forms (1NF, 2NF, 3NF)

JOIN Types

SQL Injection

Prevention

MyBatis: #{} vs ${}

MySQL Query Execution Process

MySQL Architecture

Log Types: Binlog, Redo Log, Undo Log

MySQL Update Statement Workflow

Why Rollback Mechanism?

InnoDB vs MyISAM

Character Sets and Collations

CHAR vs VARCHAR

Optimal VARCHAR(n) Size

DELETE, TRUNCATE, DROP

UNION vs UNION ALL

Temporary Tables

Slow Query Logging

Backup Methods and Strategies

Backup Types

Strategies

Why InnoDB Uses Auto-Increment Primary Keys

Why NOT NULL?

Slow SQL Causes

MySQL Replication

Database Optimization

Schema Optimization

SQL Optimization

Large Table Optimization

Sharding

Handling IDs After Sharding

Textbook Questions

Locking

Livelock

Deadlock

Deadlock Detection and Resolution

MyBatis: `#{}` vs `${}`

Optimal `VARCHAR(n)` Size

Why `NOT NULL`?

Leave a Comment