Preface#
Recently, I thought about organizing and summarizing the technical points and tools commonly used in my work. On one hand, it helps to clarify these knowledge points and deepen memory of usage, and it can also serve as a reference for future use.
Currently, I mainly plan to introduce core topics related to databases, CI/CD (GitHub Actions + GitLab CI), containers (Docker + k8s), and operations (Ansible, etc.), as well as some summaries of language features, practical Git tips, Shell scripts, and other techniques. Since I have only touched on many of these topics at work and have done some extended learning on my own, they may not fully align with specific corporate practices (mostly based on my own experience and understanding). I hope this can be helpful.
This article is part of the database series focusing on PostgreSQL. I have already summarized MySQL, which can be referenced here — 『MySQL Basics and Practice』.
Overview of Data and Databases#
Data#
First of all, data is essentially a fact or an observed result. It is a logical summary of objective events and a form and carrier of information. People have been managing data for a long time (even before the concept existed). Initially, it was managed manually, and later, file systems emerged (similar to libraries, categorizing different information). With the development of computer technology, a more convenient and efficient model of managing data using databases was formed.
Database#
A database is a repository that organizes, stores, and manages data according to a certain data structure. Its main characteristics are:
- Structured
- Shareable
- Low redundancy
- High independence
- Easy to expand
It is easy to understand that data organized according to different relationships/structures has different characteristics and is suitable for different application scenarios. Currently, databases are mainly divided into hierarchical databases, network databases, and relational databases, with PostgreSQL being the relational database we will focus on.
Database Management System (DBMS)#
A Database Management System (DBMS) is a system for performing various operations on databases. It has core functions such as establishing and maintaining databases, organizing and managing data storage, controlling databases, defining data, manipulating data, and managing communication between data. Different DBMSs handle databases and data differently, and the way data is presented also varies. It is often necessary to choose an appropriate DBMS based on data scale, business needs, and other scenarios. For example, in cases of massive data and high concurrent data reads and writes, the performance of relational databases can decline significantly.
Relational Database (RDBMS)#
Main Features#
Relational databases mainly present data in the form of tables, where each row is a record and each column corresponds to a data field (Field) of the record. Many rows and columns make up a single table, and several single tables constitute a database. Users/systems query the database using SQL (Structured Query Language).
Some operations in relational databases are transactional, adhering to the ACID rules:
- Atomicity
- Consistency
- Isolation
- Durability
Atomicity means that a series of transaction operations must either all complete or all fail; there cannot be a situation where only part of the operations are completed. For example, in a bank transfer scenario, after the transfer occurs, the sender's balance decreases, and if there is an operational error in the database and the recipient's balance does not increase, it can cause serious problems.
Consistency means that after a transaction is completed, the data in the entire database should be consistent, and there should not be cases of the same data being out of sync within the database.
Isolation means that different transactions should run independently and not interfere with each other. Of course, this sacrifices some efficiency but provides better guarantees for data accuracy.
Durability means that once a transaction is completed, the changes it made to the database and its impact on the system are permanent.
Data Integrity#
Data integrity is a crucial requirement and attribute of databases, meaning that the data stored in the database should maintain consistency and reliability. It is mainly divided into the following four types:
- Entity integrity
- Domain integrity
- Referential integrity
- User-defined integrity
Entity integrity requires that each data table has a unique identifier, and the primary key fields in each table cannot be null or duplicated. This mainly means that the data in the table can be uniquely distinguished.
Domain integrity involves imposing additional restrictions on the columns in the table, such as limiting data types, checking constraints, setting default values, allowing null values, and defining value ranges.
--- Enforcing uniqueness on fields when creating a table
CREATE TABLE person (
id INT NOT NULL auto_increment PRIMARY KEY,
name VARCHAR(30),
id_number VARCHAR(18) UNIQUE
);
Referential integrity means that the database does not allow references to non-existent entities. There are often associations between tables in the database, which can be ensured through foreign key constraints.
User-defined integrity refers to applying semantic restrictions to the data based on specific application scenarios and the data involved, such as ensuring that balances cannot be negative. This is generally enforced through rules, stored procedures, and triggers.
Mainstream RDBMS#
Currently, the mainstream relational databases include the following:
- SQL Server
- Sybase
- DB2
- Oracle
- MySQL
- PostgreSQL
The most commonly used ones by enterprises and individuals are Oracle, MySQL, and PostgreSQL. The following sections will provide detailed operational explanations using PostgreSQL as an example.
PostgreSQL#
Installation and Configuration#
PostgreSQL is a modern open-source object-relational database management system.
As an individual user, you can directly download the software installation package to set up a local environment. You can choose different versions based on different systems, and it also provides a convenient graphical interface for starting, stopping, restarting services, and making related configurations. This article takes PostgreSQL 14
on macOS as an example. After installing and performing basic settings from the official website, you can manage the PostgreSQL service on your machine. The version may vary slightly, but the core functionalities are not significantly different.
Graphical Interface#
Opening the PostgreSQL.app application, you can see the following interface:
In this management interface, you can easily start and stop the PostgreSQL service. Clicking on the corresponding database also allows you to enter the command line operation interface.
Command Line Interface#
First, we need to add the path of psql
to the environment variable for future use. I use zsh
, so I added the following content to the ~/.zshrc
file:
# postgres
export PATH=${PATH}:/Applications/Postgres.app/Contents/Versions/14/bin
After that, entering psql
in the terminal will allow access to the PostgreSQL command line interface. You can use the following command to view the list of psql commands:
psql --help
Connecting to PostgreSQL#
We can connect to the database using the following command:
# Connect to the database
psql -h <host> -p <port> -U <username> <database-name>
Of course, we can also use some third-party tools to connect to the database more conveniently. The tool I currently use, TablePlus, supports PostgreSQL databases and is recommended.
Command Line Interaction#
PostgreSQL provides powerful command line interaction capabilities. We can use \
+ keywords to perform operations. We can refer to the documentation or use \?
and the help
command to view command details and help information. Other commonly used commands are as follows:
# View help
help
# View psql command details
\?
# View databases (all)
\l
# View database (specific)
\l <database-name>
# Connect to database
\c <database-name>
# View functions
\df
# View tables (all)
\d
# View tables (only tables)
\dt
# View table (specific)
\d <table-name>
# Execute SQL command
\i <filepath>/<filename>
# Open extended view
\x
# Export to CSV
\copy (SELECT * FROM person LEFT JOIN car ON person.car_id = car.id) TO 'path/to/output.csv' DELIMITER ',' CSV HEADER;
# Exit
\q
Core Syntax#
After configuring and connecting to the local PostgreSQL, we can perform some operations on the database. SQL language is mainly divided into the following four categories:
- DDL Data Definition Language
- DML Data Manipulation Language
- DQL Data Query Language
- DCL Data Control Language
DDL Operations#
--- Create database
CREATE DATABASE <database-name>;
--- Drop database
DROP DATABASE <database-name>;
# Enter a specific database
\c <database-name>;
--- Create table (add constraints)
CREATE TABLE person (
id BIGSERIAL NOT NULL PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
gender VARCHAR(10) NOT NULL,
date_of_birth DATE NOT NULL,
country_of_birth VARCHAR(50),
email VARCHAR(150)
);
--- Modify table
ALTER TABLE person ADD PRIMARY KEY(id);
--- Drop column
ALTER TABLE person DROP column email;
--- Drop entire table
DROP TABLE person;
DML Operations#
--- Insert data
INSERT INTO person (
first_name,
last_name,
gender,
date_of_birth
) VALUES ('Yu', 'ZHANG', 'MALE', DATE '1997-06-06');
--- Update data
UPDATE person SET email = '[email protected]' WHERE id = 20;
--- Delete data
DELETE FROM person WHERE id = 1;
You can use the ON CONFLICT
keyword to handle conflicts:
--- Do nothing on conflict
INSERT INTO person (
first_name,
last_name,
gender,
date_of_birth
) VALUES ('Yu', 'ZHANG', 'MALE', DATE '1997-06-06') ON CONFLICT (id) DO NOTHING;
--- Update specified fields on conflict
INSERT INTO person (
first_name,
last_name,
gender,
date_of_birth
) VALUES ('Yu', 'ZHANG', 'MALE', DATE '1997-06-06') ON CONFLICT (id) DO UPDATE SET email = EXCLUDED.email;
DQL Operations#
You can query tables using the SELECT
command. The most commonly used command to view all data in a table is:
--- View all data in the table
SELECT * FROM person;
--- Query data (specific fields)
SELECT first_name, last_name FROM person;
You can use the WHERE
keyword for conditional queries and combine multiple conditions:
--- Query data (conditional filtering, WHERE | AND | OR | comparison > | >= | < | <= | = | <>)
SELECT * FROM person WHERE gender = 'MALE' AND (country_of_birth = 'Poland' OR country_of_birth = 'China');
IN
, BETWEEN
, LIKE
, and ILIKE
are also flexible keywords for querying.
IN
helps filter multiple values for a specific field.
--- Query data (using IN keyword)
SELECT * FROM person WHERE country_of_birth IN ('China', 'Brazil', 'France');
BETWEEN
helps filter a range for a specific field.
--- Query data (using BETWEEN keyword)
SELECT * FROM person WHERE date_of_birth BETWEEN DATE '2021-10-10' AND '2022-08-31';
LIKE
helps perform fuzzy searches based on inclusion, where %
can match any character and _
can match a single character. ILIKE
is a case-insensitive version of LIKE
.
--- Query data (using LIKE/ILIKE keywords, _ | %)
SELECT * FROM person WHERE email LIKE '%@bloomberg.%';
SELECT * FROM person WHERE email LIKE '________@google.%';
SELECT * FROM person WHERE country_of_birth ILIKE 'p%';
In practical applications, data tables often contain a large volume of data, which may require grouping data based on relevant conditions. This is where the GROUP BY
keyword comes into play, along with HAVING
for further filtering conditions. GROUP BY
needs to be used with aggregate functions.
--- Query data (using GROUP BY keyword for grouping, using HAVING keyword to add conditions, using AS for aliasing results)
SELECT country_of_birth, COUNT(*) AS Amount FROM person GROUP BY country_of_birth HAVING Amount > 5 ORDER BY country_of_birth;
Sometimes, we only need to return unique values and eliminate duplicate data, which can be done using the DISTINCT
keyword.
--- Query data (remove duplicates)
SELECT DISTINCT country_of_birth FROM person;
In practical applications, when the data volume is very large, returning all data at once can be resource-intensive. Therefore, the LIMIT
keyword can be used to constrain the number of returned records, and OFFSET
can specify the offset.
--- Query data (specifying quantity and offset)
SELECT * FROM person OFFSET 5 LIMIT 10;
SELECT * FROM person OFFSET 5 FETCH FIRST 5 ROW ONLY;
Core Concepts#
PRIMARY KEY#
The primary key uniquely identifies records in a data table, created and modified using the following commands:
--- Add primary key
CREATE TABLE person (
id BIGSERIAL NOT NULL PRIMARY KEY
);
--- Modify primary key
ALTER TABLE person ADD PRIMARY KEY(id);
The primary key typically uses SERIAL/BIGSERIAL
to increment INT
values, but UUID
can also be used as a primary key.
CREATE TABLE person (
id UUID NOT NULL PRIMARY KEY
);
FOREIGN KEY#
A foreign key is a special type of primary key that is the primary key of another table, created and modified using the following commands:
--- Add foreign key
CREATE TABLE person (
id BIGSERIAL NOT NULL PRIMARY KEY,
car_id BIGINT REFERENCES car (id),
UNIQUE(car_id)
);
--- Modify foreign key
CREATE TABLE car (
id BIGSERIAL NOT NULL PRIMARY KEY
)
JOIN#
Join queries refer to connecting data from multiple tables during a query to retrieve more information. In SQL, we can use the JOIN
keyword to perform join queries and the LEFT JOIN
keyword for left joins, and the RIGHT JOIN
keyword for right joins.
--- JOIN query
SELECT * FROM person
JOIN car ON person.car_id = car.id
--- LEFT JOIN query
SELECT * FROM person
LEFT JOIN car ON person.car_id = car.id
You can simplify the use of ON
with the USING
keyword.
SELECT * FROM person
LEFT JOIN car USING (car_id);
Constraints#
CONSTRAINTs are used to limit the data in tables. We can add constraints using the following command:
ALTER TABLE person ADD CONSTRAINT gender_constraint CHECK (gender = 'Female' OR gender = 'Male');
For example, adding UNIQUE
to enforce uniqueness:
CREATE TABLE person (
id BIGSERIAL NOT NULL PRIMARY KEY,
email VARCHAR(150) UNIQUE
);
ALTER TABLE person ADD CONSTRAINT unique_email_address UNIQUE (email);
Common Techniques#
Aggregate Functions#
Many built-in aggregate functions, such as COUNT
, SUM
, AVG
, MIN
, MAX
, are available for aggregating data.
COALESCE#
When querying data, we can use COALESCE
to fill in default values:
--- Using COALESCE to fill in default values
SELECT COALESCE(email, 'Email Not Provided') FROM person;
NULLIF#
Using the NULLIF
keyword, when the second parameter is equal to the first, it returns NULL
; otherwise, it returns the first parameter, which helps prevent errors when the divisor is 0
.
SELECT COALESCE(10 / NULLIF(0, 0), 0);
Time#
The display format for time is as follows:
SELECT NOW();
SELECT NOW()::DATE;
SELECT NOW()::TIME;
We can perform calculations on time:
SELECT NOW() - INTERVAL '1 YEAR';
SELECT NOW() + INTERVAL '10 MONTHS';
SELECT (NOW() - INTERVAL '3 DAYS')::DATE;
You can use EXTRACT
to get a specific part of the time:
SELECT EXTRACT(YEAR FROM NOW());
SELECT EXTRACT(MONTH FROM NOW());
SELECT EXTRACT(DAY FROM NOW());
SELECT EXTRACT(DOW FROM NOW());
SELECT EXTRACT(CENTURY FROM NOW());
You can calculate age differences using the AGE
keyword:
SELECT first_name, last_name, AGE(NOW(), date_of_birth) AS age FROM person;
Extension Support#
PostgreSQL provides many extensions to achieve richer functionality.
Install Extensions#
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
View Extension Methods#
df
Use Extension Methods#
SELECT uuid_generate_v4();
Conclusion#
This concludes my explanation of the basic knowledge and practical operations of PostgreSQL. I hope it is helpful to you.