Search Engine

PostgreSQL provides the full text search capability. Through full text search, you can build a text content search engine. This section describes how to quickly set up a search engine for website pages in PostgreSQL.

Create Content Table

Create a web page content table to store the content data of web pages.

CREATE TABLE IF NOT EXISTS site_page (
  id        serial PRIMARY KEY,
  url       text,
  image     text,
  title     text,
  content   text
);

The site_page table contains the following columns:

Name	Description
`id`	Identifier of the web page.
`url`	URL of the web page.
`image`	URL of a web page image.
`title`	Title of the web page.
`content`	The body of the web page.

Create Content Index Table

Create a web page content index table to store the index data of the web page content:

CREATE TABLE IF NOT EXISTS site_page_tsi (
  id        integer,
  title     tsvector,
  content   tsvector
);

CREATE INDEX site_page_title_idx ON site_page_tsi USING GIN (title);
CREATE INDEX site_page_content_idx ON site_page_tsi USING GIN (content);

The site_page_tsi table contains the following columns:

Name	Description
`id`	Identifier of the web page.
`title`	Lexemes list of the page title.
`content`	Lexemes list of the page body.

Of course, you can also combine the tables site_page and site_page_tsi into one table. Here, we recommend that you store webpage content data and content index data separately, which can improve the query performance of webpage content data and facilitate the maintenance and cleaning of index data separately.

By storing the content index data into a single table, we can view the storage space usage of this part of the data. In addition, we can also optimize and adjust the content-related lexemes list when adding web page data, and remove those words that we are not interested in. This can not only save storage space, but also improve the efficiency of full text search.

Store Web Page Data

You can add a page data record to the site_page table and add a lexemes index record to the site_page_tsi table.

BEGIN;

INSERT INTO site_page (url, image, title, content) VALUES
  ('https://doc.rockdata.net/',
   'https://doc.rockdata.net/brand.svg',
   'Redrock Postgres Documentation',
   'Redrock Postgres is an object-relational database management system based on PostgreSQL,
It is intentionally designed as an enterprise grade and cloud native database.')
  RETURNING id;
 id
----
  1

INSERT INTO site_page_tsi (id, title, content) VALUES
  (1,
   to_tsvector('english', 'Redrock Postgres Documentation'),
   to_tsvector('english',
   'Redrock Postgres is an object-relational database management system based on PostgreSQL,
It is intentionally designed as an enterprise grade and cloud native database.'));

COMMIT;

In the above example, we add a piece of webpage content data to the table site_page, get the identifier of the webpage content, and then generate the lexemes list of the webpage content using the function to_tsvector, specify the corresponding webpage identifier, and insert it into the content Index table site_page_tsi.

Search Web Content

Of course, we can search for matching web content based on search keywords. For example, we can execute the following query to search for web page content whose title matches the keyword “Redrock Documentation”:

SELECT sp.id, sp.url, sp.title
  FROM site_page AS sp
    LEFT JOIN site_page_tsi AS tsi ON sp.id = tsi.id
  WHERE tsi.title @@ plainto_tsquery('english', 'Redrock Documentation');

Result:

 id |            url            |             title
----+---------------------------+--------------------------------
  1 | https://doc.rockdata.net/ | Redrock Postgres Documentation

The above example is very simple. Generally, when you specify keywords to search web page content, many matching web page data may be returned. In this case, you may consider specifying LIMIT … OFFSET … to display the returned result set in pagination.