Database System Internals (PostgreSQL)

0 %

Course content

Resources
Chapter 1: The Anatomy of PostgreSQL: Architecture and Process Model
- 1.0: Overview & Readings
- 1.1: The Client/Server Model and Postmaster 10 xp
  - Quiz
- 1.2: Shared Memory Architecture 10 xp
  - Quiz
- 1.3: Core Background Workers
- 1.4: Utility Processes
- 1.5: The Backend Process
Chapter 2: The Journey of a Query: Lexing, Parsing, and the Traffic Cop
Chapter 3: The PostgreSQL Rule System and Query Rewriting
- 3.0 Overview & Readings
- 3.1: Rules vs Triggers
- 3.2: The Query Tree Structure
- 3.3: Views Under the Hood
- 3.4: The Rewriter in Action
Chapter 4: The Query Planner Part I: Statistics and Cost Estimation
- 4.0: Overview & Readings
- 4.1: The System Catalogs
- 4.2: The Role of ANALYZE
- 4.3: Understanding pg_statistic
- 4.4: Cost Constants
- 4.5: Calculating Cost
Chapter 5: The Query Planner Part II: Path Generation and GEQO
Chapter 6: The Executor: Processing the Plan Tree
- 6.0: Overview & Readings
- 6.1: The Volcano Execution Model
- 6.2: Executor Phases
- 6.3: Scan Nodes
- 6.4: Join and Materialize Nodes
- 6.5: Aggregate and Sort Nodes
Chapter 7: Advanced Indexing Under the Hood
- 7.0: Overview & Readings
- 7.1: B-Tree Internals
- 7.2: GiST Indexes
- 7.3: GIN Indexes
- 7.4: BRIN Indexes
- 7.5: Operator Classes
Chapter 8: Multiversion Concurrency Control (MVCC) and Vacuuming
Chapter 9: Memory Management and Caching Strategies
- 9.0: Overview & Readings
- 9.1: The Dual Caching Model
- 9.2: The Buffer Manager
- 9.3: Eviction Policies
- 9.4: Local Memory (work_mem)
- 9.5: Maintenance Memory
- 9.6: Assignment Reminder
Chapter 10: The Write-Ahead Log (WAL) and Crash Recovery
- 10.0: Overview & Readings
- 10.1: The Purpose of WAL
- 10.2: WAL Physical Structure
- 10.3: Physiological Logging
- 10.4: Checkpoints
- 10.5: Crash Recovery Mechanics
Chapter 11: Replication: Physical and Logical
Chapter 12: Distributed PostgreSQL and Sharding
Chapter 13: Extending the Engine
- 13.0: Overview & Readings
- 13.1: Extensibility Architecture
- 13.2: Custom Data Types
- 13.3: Background Worker Processes
- 13.4: Extension Hooks
- 13.5: Packaging Extensions
Project Based Assignments

0.0: Syllabus

Prerequisites

This is an advanced systems course. Students are expected to have successfully completed a foundational Database Management Systems course (or equivalent industry experience).

Required Knowledge: Proficiency in writing complex standard SQL (CTEs, Window Functions, Subqueries), a strong grasp of relational algebra, database normalization (up to 3NF), and basic data structure concepts (trees, hash tables).
Helpful but not required: Familiarity with C programming (for the engine-extension chapter) and basic Linux command-line navigation.

Target Audience

Upper-level computer science undergraduates or graduate students.
Aspiring Database Administrators (DBAs), Data Engineers, and Backend Systems Engineers who need to optimize high-throughput data applications.

Course Structure & Time Commitment

This course is structured for high retention, utilizing a micro-learning format paired with intensive, hands-on engineering projects.

Micro-Lessons: 13 core chapters divided into 15-minute focused lessons. (Approx. 65 total lessons / 16 hours of core lecture material).
Assessments: Each lessons will contain a "closed book" five question multiple-choice quiz and one short-form written response that can be completed with pen and paper. The course will have an open book multiple choice final covering all of the contents that can be completed in about one hour.
Project Work: Three major lab assignments requiring deep forensic analysis and tuning of a live PostgreSQL instance.
Estimated Total Commitment: 8–10 hours per week for a standard 14-week semester (roughly 120-140 total hours).

Core Learning Outcomes

By the end of this course, students will be able to:

Deconstruct the Execution Pipeline: Trace the exact path of a SQL string from lexical analysis through the cost-based optimizer and into the Volcano execution engine.
Optimize at the Physical Level: Read and interpret complex EXPLAIN (ANALYZE, BUFFERS) outputs to diagnose hardware-level bottlenecks, choosing the mathematically optimal access paths, join strategies, and index types (B-Tree, GiST, GIN, BRIN).
Manage Concurrency and Bloat: Explain the exact C-level implementation of Multiversion Concurrency Control (MVCC) and snapshot isolation, and tune the autovacuum daemon to prevent transaction ID wraparound and storage bloat.
Tune System Memory: Manipulate PostgreSQL’s dual-caching architecture by configuring shared_buffers, work_mem, and the clock-sweep eviction algorithm for specific OLTP or OLAP workloads.
Architect for Scale and Disaster Recovery: Configure physical Write-Ahead Log (WAL) streaming replication, logical decoding, and declarative partitioning using Foreign Data Wrappers (FDW) to build distributed, highly available data systems.

Outline

Chapter 1: The Anatomy of PostgreSQL: Architecture and Process Model

This chapter provides a structural overview of the PostgreSQL database engine. Students will explore the client/server model, the postmaster process, backend processes, and the shared memory architecture. We will cover the division of labor between background workers (e.g., checkpointer, background writer, WAL writer) and the memory structures they interact with.

Chapter 2: The Journey of a Query: Lexing, Parsing, and the Traffic Cop

We begin tracing the lifecycle of a query from the moment it hits the server. This chapter deconstructs the lexical analysis and parsing phases, where SQL text is transformed into an internal parse tree. Students will investigate how the "Traffic Cop" routes queries and handles utility versus optimizable statements.

Chapter 3: The PostgreSQL Rule System and Query Rewriting

Before a query is planned, it passes through the rewriter. This chapter explores how PostgreSQL implements views and rules by transforming parse trees into query trees. We will analyze the differences between the rule system and triggers, and how the rewriter flattens views for optimization.

Chapter 4: The Query Planner Part I: Statistics and Cost Estimation

An introduction to the cost-based optimizer. This chapter details how PostgreSQL predicts the computational cost of a query before running it. Students will evaluate the system catalogs (pg_class, pg_statistic), analyze how ANALYZE gathers data distributions (histograms, MCVs), and calculate specific cost parameters (e.g., seq_page_cost, random_page_cost) explicitly dictate planner decisions.

Chapter 5: The Query Planner Part II: Path Generation and GEQO

Building on cost estimation, this chapter covers how the planner generates and evaluates different execution paths. We will dissect the algorithms behind sequential scans vs. index scans, and analyze join strategies (Nested Loop, Hash Join, Merge Join). The chapter concludes with a look at Genetic Query Optimization (GEQO) for handling complex queries with many joins.

Chapter 6: The Executor: Processing the Plan Tree

Once the cheapest path is chosen, the plan tree is handed to the Executor. Students will study PostgreSQL's demand-driven, pipeline execution model (the Volcano model). We will examine the mechanics of specific execution nodes (e.g., Agg, Sort, Limit) and how tuples are fetched, processed, and passed up the tree to the client.

Chapter 7: Advanced Indexing Under the Hood

Moving beyond standard B-trees, this chapter explores the physical structure and use cases of PostgreSQL's specialized index types. We will dissect GiST (Generalized Search Tree), GIN (Generalized Inverted Index), BRIN (Block Range Index), and SP-GiST. Students will learn how to write operator classes and how the engine determines index applicability.

Chapter 8: Multiversion Concurrency Control (MVCC) and Vacuuming

A intensive exploration of transactional integrity without locking overhead. This chapter covers the hidden system columns (xmin, xmax), snapshot isolation, and how PostgreSQL resolves row visibility. We will examine the physical bloat caused by MVCC and the inner workings of the VACUUM and autovacuum processes to reclaim space and prevent transaction ID wraparound.

Chapter 9: Memory Management and Caching Strategies

This chapter covers how PostgreSQL interacts with RAM. We will contrast PostgreSQL's shared_buffers with the operating system page cache (double buffering). Students will learn how work_mem dictates the performance of sorts and hash tables, and how the clock-sweep algorithm handles buffer eviction..

Chapter 10: The Write-Ahead Log (WAL) and Crash Recovery

An exploration of durability. Students will learn the physical layout of the WAL, how physiological logging works, and the mechanics of the checkpointer process. We will walk through the exact sequence of events during crash recovery and how PostgreSQL replays the WAL to guarantee the ACID property of Durability.

Chapter 11: Replication: Physical and Logical

This chapter shifts focus to data distribution and high availability. We will compare physical streaming replication (byte-for-byte copy) with logical replication (publish/subscribe at the row level). Students will explore replication slots, the walreceiver/walsender processes, and logical decoding plugins.

Chapter 12: Distributed PostgreSQL and Sharding

Scaling PostgreSQL beyond a single node. This chapter covers native declarative partitioning, Foreign Data Wrappers (FDW), and parallel query execution. We will also examine architectures like Citus that transform PostgreSQL into a distributed database, discussing distributed query planning and two-phase commit protocols.

Chapter 13: Extending the Engine

The course concludes by exploring PostgreSQL's extensibility. Students will learn the architecture behind creating custom data types, writing background workers in C, and leveraging extension hooks to modify the parser, planner, or executor behavior without altering the core source code.