PostgreSQL Basics and Practice

《后来的我们 - 五月天》

Preface

Recently, I’ve been thinking about organizing and summarizing the technical points and tools commonly used in my work. On one hand, this helps me review these knowledge points and deepen my memory of their usage. On the other hand, it can serve as a reference for future use.

Currently, I’m planning to cover several core areas including databases, CI/CD (GitHub Actions + GitLab CI), containers (Docker + k8s), operations (Ansible, etc.), as well as some summaries of language features, practical Git techniques, and Shell scripting tips. Since I’ve only encountered many of these topics at work and done some extended learning on my own, they may not fully align with specific enterprise practices (mostly based on my own experiences and understanding). I hope this can be helpful.

This article is the PostgreSQL part of the database series. I’ve previously summarized MySQL, which you can refer to in “MySQL Basic Knowledge and Related Operations”.

Overview of Data and Databases

Data

Fundamentally, data is a type of fact or observed result, a logical summary of objective matters, and a form and carrier of information. People have been managing data since ancient times (even before the concept existed), initially through manual management, then gradually through file systems (like libraries, managing different information by categories), and finally, with the development of computer technology, forming the more convenient and efficient mode of database management.

Database

A database is a repository that organizes, stores, and manages data according to a certain data structure. Its main characteristics are:

Structured
Shareable
Low redundancy
High independence
Easy to expand

It’s easy to understand that data organized according to different relationships/structures have different characteristics and are suitable for different application scenarios. Currently, there are mainly three types: hierarchical databases, network databases, and relational databases. PostgreSQL, which we will focus on, is a relational database.

Database Management System (DBMS)

A Database Management System (DBMS) is a system that performs various operations on databases. It has core functions such as establishing and maintaining databases, organizing and managing data storage, controlling databases, defining data, manipulating data, and managing communication between data. Different database management systems handle databases and data differently, and the way data is presented also varies. It often requires choosing an appropriate database management system based on factors such as data scale and business requirements. For example, in situations with massive data and high-concurrency data read/write operations, the performance of relational databases can deteriorate significantly.

Relational Database Management System (RDBMS)

Main Characteristics

Relational databases primarily present data in the form of data tables, with each row representing a record and each column corresponding to the data field of the record name. Many rows and columns form a single table, and several single tables form a database. Users/systems query the database through SQL (Structured Query Language).

Some relational database operations have transactional properties, namely the ACID rules:

Atomicity
Consistency
Isolation
Durability

Atomicity means that a series of transaction operations must either all complete or all fail; there is no situation where only part is completed. For example, in a bank transfer scenario, after the transfer occurs, the sender’s balance decreases, but if a database operation error occurs and the receiver’s balance does not increase, it would cause serious problems.

Consistency means that after a transaction is completed, the data in the entire database is consistent; there should be no situation where the same data is out of sync within the database.

Isolation means that different transactions should run independently and without interference. Of course, this sacrifices some efficiency but provides better guarantees for data accuracy.

Durability means that when a transaction is completed, its changes to the database and its effects on the system are permanent.

Data Integrity

Data integrity is a very important requirement and property of databases, referring to the consistency and reliability of data stored in the database. It is mainly divided into four types:

Entity integrity
Domain integrity
Referential integrity
User-defined integrity

Entity integrity requires that each data table has a unique identifier, and the primary key field in each table cannot be empty or duplicate, mainly meaning that the data in the table can be uniquely distinguished.

Domain integrity is to make some additional restrictions on the columns in the table, such as restricting data types, check constraints, setting default values, whether to allow null values, and value range.

--- Constraining field uniqueness when creating a table
CREATE TABLE person (
    id INT NOT NULL auto_increment PRIMARY KEY,
    name VARCHAR(30),
    id_number VARCHAR(18) UNIQUE
);

Referential integrity means that the database does not allow referencing non-existent entities. Tables in the database often have some associations with other tables, and referential integrity can be ensured through foreign key constraints.

User-defined integrity is to make some semantic restrictions on data according to specific application scenarios and involved data, such as balance cannot be negative, etc. It is generally constrained and limited by setting rules, stored procedures, and triggers.

Mainstream RDBMS

Currently, the mainstream relational database management systems include:

SQL Server
Sybase
DB2
Oracle
MySQL
PostgreSQL

Oracle, MySQL, and PostgreSQL are more commonly used by enterprises and individuals. Next, we will use PostgreSQL as an example for detailed operation explanations.

PostgreSQL

Installation and Configuration

PostgreSQL is a modern open-source object-relational database management system.

For individual users learning to use it, you can directly download the software installation package to set up a local environment. You can choose different versions according to different systems, and it also has a convenient graphical interface for starting, stopping, restarting services, and making related configurations. This article uses PostgreSQL 14 on macOS as an example. After installing and performing basic settings from the official website, you can manage the local PostgreSQL service. The version may vary slightly, but the core functions are not much different.

Graphical Interface

Open the PostgreSQL.app application, and you will see the following interface:

mac_postgres_interface

In this management interface, you can conveniently start and stop the PostgreSQL service. Clicking on the corresponding database also allows you to enter the command-line operation interface.

Command-line Interface

First, we add the psql path to the environment variables for subsequent use. I use zsh, so I add the following content to the ~/.zshrc file:

# postgres
export PATH=${PATH}:/Applications/Postgres.app/Contents/Versions/14/bin

After that, enter psql in the terminal to access the PostgreSQL command-line interface. You can use the following command to view the psql command list:

psql --help

Connecting to PostgreSQL

We can connect to the database using the following command:

# Connect to the database
psql -h <host> -p <port> -U <username> <database-name>

Of course, we can also use some third-party tools to connect to the database more conveniently. I currently use TablePlus, which supports PostgreSQL databases, and I recommend it.

Command-line Interaction

PostgreSQL provides powerful command-line interaction functionality. We can use \ + keyword to perform operations. We can view command details and help information by consulting the documentation or using the \? and help commands. Other commonly used commands are as follows:

# View help
help

# View psql command details
\?

# View databases (all)
\l

# View databases (specific)
\l <database-name>

# Connect to a database
\c <database-name>

# View methods
\df

# View tables (all)
\d

# View tables (tables only)
\dt

# View tables (specific)
\d <table-name>

# Execute SQL commands
\i <filepath>/<filename>

# Open expanded view
\x

# Export to CSV
\copy (SELECT * FROM person LEFT JOIN car ON person.car_id = car.id) TO 'path/to/output.csv' DELIMITER ',' CSV HEADER;

# Exit
\q

Core Syntax

After configuring and connecting to the local PostgreSQL, we can perform some operations on the database. SQL language is mainly divided into the following four categories:

DDL (Data Definition Language)
DML (Data Manipulation Language)
DQL (Data Query Language)
DCL (Data Control Language)

DDL Operations

--- Create database
CREATE DATABASE <database-name>;

--- Delete database
DROP DATABASE <database-name>;

# Enter a specific database
\c <database-name>;

--- Create table (add constraints)
CREATE TABLE person (
    id BIGSERIAL NOT NULL PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    gender VARCHAR(10) NOT NULL,
    date_of_birth DATE NOT NULL,
    country_of_birth VARCHAR(50),
    email VARCHAR(150)
);

--- Modify table
ALTER TABLE person ADD PRIMARY KEY(id);

--- Delete column
ALTER TABLE person DROP column email;

--- Delete entire table
DROP TABLE person;

DML Operations

--- Insert data
INSERT INTO person (
    first_name,
    last_name,
    gender,
    date_of_birth
) VALUES ('Yu', 'ZHANG', 'MALE', DATE '1997-06-06');

--- Modify data content
UPDATE person SET email = '[email protected]' WHERE id = 20;

--- Delete data content
DELETE FROM person WHERE id = 1;

You can use the ON CONFLICT keyword to handle conflicts:

--- Do nothing when a conflict occurs
INSERT INTO person (
    first_name,
    last_name,
    gender,
    date_of_birth
) VALUES ('Yu', 'ZHANG', 'MALE', DATE '1997-06-06') ON CONFLICT (id) DO NOTHING;

--- Update specified fields when a conflict occurs
INSERT INTO person (
    first_name,
    last_name,
    gender,
    date_of_birth
) VALUES ('Yu', 'ZHANG', 'MALE', DATE '1997-06-06') ON CONFLICT (id) DO UPDATE SET email = EXCLUDED.email;

DQL Operations

You can query the table using the SELECT command. The most commonly used command to view the entire table is:

--- View all data in the table
SELECT * FROM person;

--- Query data (specific fields)
SELECT first_name, last_name FROM person;

You can use the WHERE keyword for conditional queries and combination queries with multiple conditions:

--- Query data (condition filtering, WHERE | AND | OR | comparison > | >= | < | <= | = | <>)
SELECT * FROM person WHERE gender = 'MALE' AND (country_of_birth = 'Poland' OR country_of_birth = 'China');

IN, BETWEEN, LIKE, and ILIKE are also some keywords that can be used flexibly for queries.

IN can help us filter multiple values of a certain field.

--- Query data (using IN keyword)
SELECT * FROM person WHERE country_of_birth IN ('China', 'Brazil', 'France');

BETWEEN can help us filter a range of a certain field.

--- Query data (using BETWEEN keyword)
SELECT * FROM person WHERE date_of_birth BETWEEN DATE '2021-10-10' AND '2022-08-31';

LIKE can help us perform some fuzzy searches for inclusion relationships. % can match any character, _ can match a single character. ILIKE is case-insensitive LIKE.

--- Query data (using LIKE/ILIKE keywords, _ | %)
SELECT * FROM person WHERE email LIKE '%@bloomberg.%';
SELECT * FROM person WHERE email LIKE '________@google.%';
SELECT * FROM person WHERE country_of_birth ILIKE 'p%';

In practical applications, the data volume of data tables is often very large, and data needs to be grouped according to corresponding conditions. This requires the use of the GROUP BY keyword, and HAVING is used for further filtering conditions. GROUP BY needs to be used in conjunction with aggregate functions.

--- Query data (using GROUP BY keyword for grouped queries, using HAVING keyword to add conditions, using AS for result aliases)
SELECT country_of_birth, COUNT(*) AS Amount FROM person GROUP BY country_of_birth HAVING Amount > 5 ORDER BY country_of_birth;

Sometimes we only need to return unique values and need to remove duplicate data. In this case, we can use the DISTINCT keyword.

--- Query data (remove duplicates)
SELECT DISTINCT country_of_birth FROM person;

In practical applications, it is also very likely that we need to rank certain product transaction volumes, arrange some values, or display blog posts in chronological order, etc. This requires the use of the ORDER BY keyword, which defaults to ASC ascending order, and can be manually set to DESC for descending order.

--- Query data (sort ASC | DESC)
SELECT * FROM person ORDER BY id, country_of_birth;

At the same time, some databases have very large amounts of data, and returning all data at once is resource-consuming. Therefore, we can use the LIMIT keyword to constrain the number of returned records, and use OFFSET to specify the offset.

--- Query data (specify quantity and offset)
SELECT * FROM person OFFSET 5 LIMIT 10;
SELECT * FROM person OFFSET 5 FETCH FIRST 5 ROW ONLY;

Core Concepts

PRIMARY KEY

The primary key is the unique identity record in the data table, created and modified with the following commands:

--- Add primary key
CREATE TABLE person (
    id BIGSERIAL NOT NULL PRIMARY KEY
);

--- Modify primary key
ALTER TABLE person ADD PRIMARY KEY(id);

The primary key usually uses SERIAL/BIGSERIAL incremental INT values, or UUID can be used as the primary key.

CREATE TABLE person (
    id UUID NOT NULL PRIMARY KEY
);

FOREIGN KEY

A foreign key is a special type of primary key that is the primary key of another table, created and modified with the following commands:

--- Add foreign key
CREATE TABLE person (
    id BIGSERIAL NOT NULL PRIMARY KEY,
    car_id BIGINT REFERENCES car (id),
    UNIQUE(car_id)
);

--- Modify foreign key
CREATE TABLE car (
    id BIGSERIAL NOT NULL PRIMARY KEY
)

JOIN

A join query refers to connecting data from multiple tables during a query to retrieve more information. In SQL, we can use the JOIN keyword to implement join queries, use the LEFT JOIN keyword to implement left join queries, and use the RIGHT JOIN keyword to implement right join queries.

--- JOIN query
SELECT * FROM person
JOIN car ON person.car_id = car.id

--- LEFT JOIN query
SELECT * FROM person
LEFT JOIN car ON person.car_id = car.id

You can use the USING keyword to simplify the use of the ON keyword.

SELECT * FROM person
LEFT JOIN car USING (car_id);

Constraints

CONSTRAINT is used to restrict the data in the data table. We can add constraints using the following command:

ALTER TABLE person ADD CONSTRAINT gender_constraint CHECK (gender = 'Female' OR gender = 'Male');

For example, add UNIQUE to indicate uniqueness:

CREATE TABLE person (
    id BIGSERIAL NOT NULL PRIMARY KEY,
    email VARCHAR(150) UNIQUE
);

ALTER TABLE person ADD CONTRAINT unique_email_address UNIQUE (email);

Useful Techniques

Aggregate Functions

Many aggregate functions are built-in, such as COUNT, SUM, AVG, MIN, MAX, etc., used for aggregate calculations on data.

COALESCE

When querying data, we can use COALESCE to fill in default values:

--- Use COALESCE to fill in default values
SELECT COALESCE(email, 'Email Not Provided') FROM person;

NULLIF

Using the NULLIF keyword, when the second parameter is the same as the first, it returns NULL, otherwise it returns the first parameter. This is used to prevent errors such as division by zero.

SELECT COALESCE(10 / NULLIF(0, 0), 0);

Time

The display format for time is as follows:

SELECT NOW();
SELECT NOW()::DATE;
SELECT NOW()::TIME;

We can perform calculations on time:

SELECT NOW() - INTERVAL '1 YEAR';
SELECT NOW() + INTERVAL '10 MONTHS';
SELECT (NOW() - INTERVAL '3 DAYS')::DATE;

We can use EXTRACT to get a certain part of the time:

SELECT EXTRACT(YEAR FROM NOW());
SELECT EXTRACT(MONTH FROM NOW());
SELECT EXTRACT(DAY FROM NOW());
SELECT EXTRACT(DOW FROM NOW());
SELECT EXTRACT(CENTURY FROM NOW());

We can use the AGE keyword to calculate age differences:

SELECT first_name, last_name, AGE(NOW(), date_of_birth) AS age FROM person;

Extended Support

PostgreSQL provides many extensions to implement richer functionality.

Installing Extensions

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

View Extension Methods

df

Using Extension Methods

SELECT uuid_generate_v4();

Conclusion

The above is my explanation of the basic knowledge and practical operations of PostgreSQL. I hope it’s helpful to you.

References

PostgreSQL Official Website
Postgres.app Official Website
TablePlus Official Website

2024-08-05

Weekly Review #67 - Reshaping My Information Input System with Follow

2024-07-30

Weekly Review #66 - 10x Engineers, Technical Passion and Personal Toolkit

2022-09-07

Docker Fundamentals and Practices

2021-03-29

MySQL Fundamentals and Practice

Author

pseudoyu

Backend & Smart Contract Developer, MSc Graduate in ECIC(Electronic Commerce and Internet Computing) @ The University of Hong Kong (HKU). Love to learn and build things. Follow me on GitHub

Comments