Search Engine
PostgreSQL provides the full text search capability. Through full text search, you can build a text content search engine. This section describes how to quickly set up a search engine for website pages in PostgreSQL.
Create a web page content table to store the content data of web pages.
CREATE TABLE IF NOT EXISTS site_page (
id serial PRIMARY KEY,
url text,
image text,
title text,
content text
);
The site_page
table contains the following columns:
Name | Description |
---|---|
id |
Identifier of the web page. |
url |
URL of the web page. |
image |
URL of a web page image. |
title |
Title of the web page. |
content |
The body of the web page. |
Create a web page content index table to store the index data of the web page content:
CREATE TABLE IF NOT EXISTS site_page_tsi (
id integer,
title tsvector,
content tsvector
);
CREATE INDEX site_page_title_idx ON site_page_tsi USING GIN (title);
CREATE INDEX site_page_content_idx ON site_page_tsi USING GIN (content);
The site_page_tsi
table contains the following columns:
Name | Description |
---|---|
id |
Identifier of the web page. |
title |
Lexemes list of the page title. |
content |
Lexemes list of the page body. |
Of course, you can also combine the tables
site_page
andsite_page_tsi
into one table. Here, we recommend that you store webpage content data and content index data separately, which can improve the query performance of webpage content data and facilitate the maintenance and cleaning of index data separately.By storing the content index data into a single table, we can view the storage space usage of this part of the data. In addition, we can also optimize and adjust the content-related lexemes list when adding web page data, and remove those words that we are not interested in. This can not only save storage space, but also improve the efficiency of full text search.
You can add a page data record to the site_page
table and add a lexemes index record to the site_page_tsi
table.
BEGIN;
INSERT INTO site_page (url, image, title, content) VALUES
('https://doc.rockdata.net/',
'https://doc.rockdata.net/brand.svg',
'Redrock Postgres Documentation',
'Redrock Postgres is an object-relational database management system based on PostgreSQL,
It is intentionally designed as an enterprise grade and cloud native database.')
RETURNING id;
id
----
1
INSERT INTO site_page_tsi (id, title, content) VALUES
(1,
to_tsvector('english', 'Redrock Postgres Documentation'),
to_tsvector('english',
'Redrock Postgres is an object-relational database management system based on PostgreSQL,
It is intentionally designed as an enterprise grade and cloud native database.'));
COMMIT;
In the above example, we add a piece of webpage content data to the table site_page
, get the identifier of the webpage content, and then generate the lexemes list of the webpage content using the function to_tsvector
, specify the corresponding webpage identifier, and insert it into the content Index table site_page_tsi
.
Of course, we can search for matching web content based on search keywords. For example, we can execute the following query to search for web page content whose title matches the keyword “Redrock Documentation”:
SELECT sp.id, sp.url, sp.title
FROM site_page AS sp
LEFT JOIN site_page_tsi AS tsi ON sp.id = tsi.id
WHERE tsi.title @@ plainto_tsquery('english', 'Redrock Documentation');
Result:
id | url | title
----+---------------------------+--------------------------------
1 | https://doc.rockdata.net/ | Redrock Postgres Documentation
The above example is very simple. Generally, when you specify keywords to search web page content, many matching web page data may be returned. In this case, you may consider specifying LIMIT … OFFSET … to display the returned result set in pagination.