Core Principles of Database Systems: A Technical Overview
Foundational Concepts
- Data: Symbolic records describing entities. Types include text, graphics, images, audio, and video. A key characteristic is the inseparability of data from its semantics, and data inherently possesses structure.
- Database (DB): A long-term, organized, and shared collection of data stored within a computer system.
- Database Management System (DBMS): Software for storing and managing data within a database. Examples include Oracle, MySQL, and SQL Server. Core functionalities encompass:
- Data Definition via Data Definition Language (DDL).
- Data organization, storage, and management.
- Data Manipulation via Data Manipulation Language (DML).
- Transaction management and runtime operations.
- Database System (DBS): The complete system comprising the database, DBMS, application software, supporting hardware/software enviroment, database administrators, and end-users.
Evolution of Data Management
- Manual Stage: Calculations were possible, but results could not be persistently stored due to the absence of disk storage and operating systems.
- File System Stage: The advent of operating systems with file management allowed for the storage of computational results.
- Database Stage: Characterized by large capacity, scale, and data sharing capabilities.
Data Independence
- Logical Independence: Changes to the information content's logical structure (e.g., adding/deleting attributes) do not affect application programs.
- Physical Independence: Changes to the physical storage location or structure (e.g., file organization) do not affect application programs.
Database System Architecture
Internal Architecture (Three-Schema Architecture)
- External Schema (Subschema/User View): Describes the logical structure of the local data visible to and usable by specific database users. Multiple external schemas can exist, each being a subset of the conceptual schema.
- Conceptual Schema (Logical Schema): Describes the global logical structure and characteristics of all data in the database. A database has exactly one conceptual schema, which is independent of physical storage details and specific applications.
- Internal Schema (Storage Schema): Describes the physical structure and storage methods of data—how data is internally represented. A database has exactly one internal schema.
Data Abstraction and Mapping
Mappings between these levels ensure data independence.
- External/Conceptual Mapping: Enables logical independence. If the conceptual schema changes, this mapping can be modified to keep the external schemas unchanged, shielding applications.
- Conceptual/Internal Mapping: Enables physical independence. If the storage structure changes, this mapping can be modified to keep the conceptual schema unchanged.
Data Modeling
Conceptual Data Models
Models data from a semantic perspective, based on the user's viewpoint. It is independent of any specific DBMS. A common model is the Entity-Relationship (E-R) Model.
- Entity: A distinguishable object (e.g., a Student, a Course). Represented by a rectangle.
- Attribute: A characteristic of an entity (e.g., StudentID, Name). Represented by an oval.
- Relationship: An association among entities. Represented by a diamond. Types include: One-to-One (1:1), One-to-Many (1:N), and Many-to-Many (M:N).
Logical Data Models
Describes data from an organizational perspective. The predominant model is the Relational Model.
- Relation: A table with these properties: each column contains atomic values, and the order of rows/columns is insignificant.
- Tuple: A row in a table (a record).
- Attribute: A column in a table.
- Domain: The set of permissible values for an attribute.
- Candidate Key: A minimal set of attributes that uniquely identifies a tuple.
- Primary Key: A selected candidate key.
- Foreign Key: An attribute (or set) in one relation that serves as the primary key of another relation, establishing a link.
Data Integrity Constraints
Rules to insure stored data is meaningful and correct.
- Entity Integrity: Every tuple must be uniquely identifiable (Primary Key cannot be NULL).
- Referential Integrity: A foreign key value must either be NULL or match an existing primary key value in the referenced relation.
- User-Defined Integrity: Custom rules defined for specific applications (e.g., salary > 0).
Structured Query Language (SQL)
SQL is the standard language for relational databases.
| SQL Function Category | Key Verbs |
|---|---|
| Data Definition (DDL) | CREATE, DROP, ALTER |
| Data Query (DQL) | SELECT |
| Data Manipulation (DML) | INSERT, UPDATE, DELETE |
| Data Control (DCL) | GRANT, REVOKE |
Core SQL Operations
Data Definition:
-- Create a table with constraints
CREATE TABLE Employee (
EmpID CHAR(7) PRIMARY KEY,
Name VARCHAR(50) NOT NULL,
DeptID INT,
Salary DECIMAL(10,2) DEFAULT 5000 CHECK (Salary >= 0),
CONSTRAINT FK_Dept FOREIGN KEY (DeptID) REFERENCES Department(DeptID)
);
Data Query:
-- Basic SELECT with WHERE, GROUP BY, HAVING
SELECT DeptID, COUNT(*) AS StaffCount, AVG(Salary) AS AvgSalary
FROM Employee
WHERE HireDate > '2020-01-01'
GROUP BY DeptID
HAVING AVG(Salary) > 6000
ORDER BY AvgSalary DESC;
-- JOIN example
SELECT e.Name, d.DeptName, p.ProjectName
FROM Employee e
INNER JOIN Department d ON e.DeptID = d.DeptID
LEFT JOIN ProjectAssignment pa ON e.EmpID = pa.EmpID
LEFT JOIN Project p ON pa.ProjectID = p.ProjectID;
-- Subquery example
SELECT Name FROM Employee
WHERE Salary > (SELECT AVG(Salary) FROM Employee);
Data Manipulation:
INSERT INTO Employee (EmpID, Name, DeptID) VALUES ('E001', 'Alice', 10);
UPDATE Employee SET Salary = Salary * 1.05 WHERE DeptID = 10;
DELETE FROM Employee WHERE EmpID = 'E999';
Views
A view is a virtual table derived from one or more base tables.
-- Create a view
CREATE VIEW HighEarners AS
SELECT EmpID, Name, Salary FROM Employee WHERE Salary > 8000;
-- Use the view
SELECT * FROM HighEarners ORDER BY Salary DESC;
Purposes: Simplify complex queries, provide customized data perspectives, enhance security, and improve logical data independence.
Database Normalization
A process to eliminate undesirable data dependencies, reducing redundancy and anomalies (Update, Insertion, Deletion).
Functional Dependency
A constraint where the value of one attribute set determines the value of another (X -> Y).
- Full Functional Dependency: Y is functionally dependent on X but not on any proper subset of X.
- Partial Functional Dependency: Y is dependent on a proper subset of X.
- Transitive Dependency: X -> Y, Y -> Z, and Y does not determine X.
Normal Forms
- First Normal Form (1NF): All attribute values are atomic (indivisible).
- Second Normal Form (2NF): In 1NF, and every non-primary-key attribute is fully functionally dependent on the primary key (eliminates partial dependencies).
- Third Normal Form (3NF): In 2NF, and no non-primary-key attribute is transitively dependent on the primary key (eliminates transitive dependencies).
- Boyce-Codd Normal Form (BCNF): A stronger form where every determinant is a cendidate key.
Transaction Management
A Transaction is a logical unit of work comprising a sequence of operations that must execute entirely or not at all.
ACID Properties
- Atomicity: All operations in a transaction are completed; if any part fails, the entire transaction is rolled back.
- Consistency: A transaction transitions the database from one valid state to another.
- Isolation: Concurrent transactions execute without interfering with each other.
- Durability: Once committed, a transaction's changes are permanent.
SQL Transaction Control
-- Explicit transaction in T-SQL style
BEGIN TRANSACTION;
UPDATE Account SET Balance = Balance - 100 WHERE AccountID = 1;
UPDATE Account SET Balance = Balance + 100 WHERE AccountID = 2;
-- If both updates succeed
COMMIT TRANSACTION;
-- If an error occurs, rollback
-- ROLLBACK TRANSACTION;
Concurrency Control
Manages simultaneous transaction execution to prevent data inconsistencies.
Problems from Uncontrolled Concurrency
- Lost Update: Two transactions read and update the same data, one overwriting the other's change.
- Dirty Read: A transaction reads uncomitted data from another transaction, which is later rolled back.
- Non-Repeatable Read: A transaction reads the same row twice and gets different values because another transaction modified it in between.
- Phantom Read: A transaction re-executes a query returning a set of rows and finds new rows inserted by another committed transaction.
Lock-Based Protocols
- Shared Lock (S Lock): For reading; allows other transactions to acquire S locks but not X locks.
- Exclusive Lock (X Lock): For writing; prevents any other locks on the data item.
- Two-Phase Locking (2PL) Protocol: Guarantees serializability. It has a growing phase (acquire locks) and a shrinking phase (release locks).
Deadlock Handling
- Prevention: Methods like requiring all locks at transaction start.
- Detection & Resolution: Using wait-for graphs; resolving by aborting a chosen transaction.
Database Backup and Recovery
Failure Types & Recovery
- Transaction Failure: Logical errors within a transaction. Recovery uses the log to UNDO the transaction's effects.
- System Failure (Soft Crash): Power loss, OS crash. Recovery involves UNDOing uncommitted transactions and REDOing committed ones whose changes may not have been written to disk.
- Media Failure (Hard Crash): Disk head crash. Recovery requires restoring from a backup and REDOing transactions from the log.
Backup Strategies
- Data Dumping: Full (complete copy) vs. Incremental (only changed data since last backup).
- Log File: A chronological record of all updates. Essential for recovery. The Write-Ahead Logging (WAL) rule states: log records must be written to stable storage before the corresponding data page is updated.
Database Design Process
A structured approach typically involving these phases:
- Requirements Analysis: Gather and specify data and functional needs, often using Data Flow Diagrams (DFDs) and a Data Dictionary.
- Conceptual Design: Create an abstract, DBMS-independent model (e.g., an E-R Diagram) representing entities, attributes, and relationships.
- Logical Design: Transform the conceptual model into a specific data model (e.g., relational schema).
- Physical Design: Define physical storage structures (file organization, indexing) for the logical schema on a given DBMS.
- Implementation & Tuning: Create the database, load data, and optimize performance.
E-R to Relational Mapping Rules
- Entity: Becomes a relation (table).
- Relationship:
- M:N: Becomes a separate relation with the primary keys of the participating entities as its composite primary key.
- 1:N: Add the primary key of the '1' side as a foreign key in the relation for the 'N' side.
- 1:1: Add the primary key of one side as a foreign key in the other relation.
Physical Design Considerations
- Indexing: Speeds up queries but slows down updates. Create indexes on frequently searched columns, used in JOINs, WHERE, ORDER BY, and GROUP BY clauses.
- Clustering: Storing tuples from one or more relations together on disk basedd on a common attribute (cluster key) to optimize range queries and joins.
- Hashing: A direct access method effective for equality searches.