Redrock Postgres Documentation
Home Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Query Plan

PostgreSQL devises a query plan for each query it receives. The structure of a query plan is a tree of plan nodes. Nodes at the bottom level of the tree are scan nodes: they return raw rows from a table. There are different types of scan nodes for different table access methods: sequential scans, index scans, and bitmap index scans. There are also non-table row sources, such as VALUES clauses and set-returning functions in FROM, which have their own scan node types. If the query requires joining, aggregation, sorting, or other operations on the raw rows, then there will be additional nodes above the scan nodes to perform these operations. Again, there is usually more than one possible way to do these operations, so different node types can appear here too.

Scan Nodes

This section lists the different scan nodes in PostgreSQL. Note that some nodes are general and several alternatives may be available in a particular query, with the actual node determined by the planning algorithm outlined above (e.g., an Index Scan and a Sequential Scan), and some are only suitable to specific situations and have no alternatives (e.g., a Foreign Scan).

For scan nodes that access table data, it’s useful to understand roughly how data is organized on disk: each table consists of one or more pages, and each page contains a small header and then chunks of data corresponding to the table rows. Because of the multi-version concurrency control mechanism in Redrock Postgres, a page may actually have multiple versions in the shared buffer area, with only one version “visible” to each transaction.

Indexes are separate structures that point to a specific page and then a value within that page. Different types of indexes support different index scan features: hash indexes can only look up specific values, but other index types like btree maintain values in a specific order, and support scanning the data in that order (or even backwards!). This gives index scans some interesting properties: for inequality predicates (e.g., WHERE value > 10 or WHERE value < 100), they can start or end the scan at the right spot, ignoring any non-matching values. They can also be used to satisfy aggregates like max and min by looking at the beginning or end of the index (and ignoring any NULL values), and can produce data in index order (sometimes useful for ORDER BY or Merge nodes).

Sequential Scan

This is the simplest way of fetching data from a table: it scans through every page of data sequentially. Like most other scans, this can apply a filter while reading data, but it needs to read the data first and then discard it. A sequential scan has no way to zero in on just the data you want: it always reads everything in the table. This is generally inefficient unless you need a large proportion of the table to answer your query, but is always available and sometimes may be the only option.

Related Properties:

Index Scan

An index scan uses an index to find either a specific row, or all rows matching a predicate. An index scan will either look up a single row at a time (for a query like WHERE id = 1234, or as the inner table in a nested loop, looking up the row matching the current outer row), or scan through a section of the table in order. An index scan must first look up each row in the index, and then check the actual table data for that index entry. The table data must be checked to ensure that the row it found is actually visible to the current transaction, and also to fetch any columns included in the query that are not present in the index. Because of this, an index scan actually has higher per-row overhead than a sequential scan: its real advantage is that it allows you to read only some of the rows in a table. If your query predicate is not very selective (that is, if few rows are filtered out), a sequential scan may still be more efficient than an index scan.

If your query predicate matches the index exactly, the scan will retrieve just the matching rows. If you have an additional predicate in your query, the index scan can filter rows as it’s reading them, just like a sequential scan.

Related Properties:

Index Only Scan

This is very similar to an Index Scan, but the data comes directly from the index and the visibility check is handled specially, so it can avoid looking at the table data entirely. An index-only scan is faster, but it’s not always available as an alternative to a regular index scan. It has two restrictions: the index type must support Index-Only Scans (the common btree index type always does) and (somewhat obviously) the query must only project columns included in the index. If you have a SELECT * query but don’t actually need all columns, you may be able to use an index-only scan just by changing the column list.

Related Properties:

Bitmap Heap Scan

A bitmap heap scan takes a row location bitmap generated by a Bitmap Index Scan (either directly, or through a series of bitmap set operations via BitmapAnd and BitmapOr nodes) and looks up the relevant data. Each chunk of a bitmap can either be exact (pointing directly to rows) or lossy (pointing to a page containing at least one row matching the predicate).

PostgreSQL prefers using exact blocks, but if limited work_mem is an issue, it will start using lossy blocks as well. The blocks are actually produced as lossy or exact by children of the bitmap heap scan, but that status is more relevant when the blocks are processed to fetch rows, so it is reflected in the Bitmap Heap Scan. If a bitmap block is lossy, the node will need to fetch the entire page, and re-check the specified index condition (since it doesn’t know which rows on the page are needed).

Related Properties:

Bitmap Index Scan

You can think of a bitmap index scan as a middle ground between a sequential scan and an index scan. Like an index scan, it scans an index to determine exactly what data it needs to fetch, but like a sequential scan, it takes advantage of data being easier to read in bulk. A plain index scan fetches one row location at a time from the index, and immediately visits that row in the table. A bitmap scan fetches all the row locations from the index in one go, sorts them using an in-memory “bitmap” data structure, and then visits the table rows in physical row location order.

The bitmap index scan actually operates in tandem with a Bitmap Heap Scan: it does not fetch the data itself. Instead of producing the rows directly, the bitmap index scan constructs a bitmap of potential row locations. It feeds this data to a parent Bitmap Heap Scan, which can decode the bitmap to fetch the underlying data, grabbing data page by page.

A Bitmap Heap Scan is the most common parent node of a Bitmap Index Scan, but a plan may also combine several different Bitmap Index Scans with BitmapAnd or BitmapOr nodes before actually fetching the underlying data. This allows PostgreSQL to use two different indexes at once to execute a query.

Related Properties:

CTE Scan

Scan the result of a common table expression.

Related Properties:

Custom Scan

Scan using a custom scan implementation, which can be added as a separate module and plug into standard PostgreSQL query planning and execution.

Foreign Scan

Scan on a Foreign Table.

Function Scan

Scans the result of a set-returning function (like unnest or regexp_split_to_table).

Related Properties:

Subquery Scan

A Subquery Scan is for scanning the output of a sub-query in the range table. We often need an extra plan node above the sub-query’s plan to perform expression evaluations (which we can’t push into the sub-query without risking changing its semantics).

Related Properties:

Table Sample Scan

Scan a table when the TABLESAMPLE feature is used. Note that this clause does change the semantics of your query, but if you’re looking to gather some statistics about data in a large table, it can be a lot more efficient than a full sequential scan.

Tid Scan

Similar to an Index Scan, but one that can only look up rows using the internal and unstable ctid identifier. You are unlikely to use this type of scan in your queries.

Values Scan

Scan the literal VALUES clause.

Work Table Scan

Scans the work table used in evaluating a recursive common table expression.

Join Nodes

This section describes the three types of join mechanisms in PostgreSQL.

Joins are performed on two tables at a time; if more tables are joined together, the output of one join is treated as input to a subsequent join. When joining a large number of tables, the genetic query optimizer settings may affect what combinations of joins are considered.

Hash Join

Build a hash table from the inner table, keyed by the join key. Then scan the outer table, checking if a corresponding value is present. If the hash table would exceed work_mem, this process needs to happen in several batches writing temporary files to disk, which becomes dramatically slower.

Related Properties:

Merge Join

Joins two children already sorted by their shared join key. This only needs to scan each relation once, but both inputs need to be sorted by the join key first (or scanned in a way that produces already-sorted output, like an index scan matching the required sort order).

Related Properties:

Nested Loop Join

For each row in the outer table, iterate through all the rows in the inner table and see if they match the join condition. If the inner relation can be scanned with an index, that can improve the performance of a Nested Loop Join. This is generally an inefficient way to process joins but is always available and sometimes may be the only option.

Related Properties:

Materialize Nodes

This section lists the different materialize nodes in PostgreSQL. These nodes materialize the result of its child node in memory (to avoid re-computing the values).

Aggregate

An Agg node implements plain or grouped aggregation. For grouped aggregation, PostgreSQL can work with presorted input or unsorted input; the latter strategy uses an internal hashtable.

Related Properties:

Bitmap And

Generate a bitmap of the intersection of two physical row location bitmaps (that is, only locations that occur in both bitmaps). The bitmaps can come from Bitmap Index Scans or other BitmapOr or BitmapAnd child nodes.

Note that, due to internal implementation limitations, BitmapAnd nodes do not track the number of rows they produce. Their row count will always be listed as “Unknown” and they will not be flagged as mis-estimates.

Bitmap Or

Generate a bitmap of the union of two physical row location bitmaps (that is, locations that occur in either bitmap). The bitmaps can come from Bitmap Index Scans or other BitmapOr or BitmapAnd child nodes.

Note that, due to internal implementation limitations, BitmapOr nodes do not track the number of rows they produce. Their row count will always be listed as “Unknown” and they will not be flagged as mis-estimates.

Group

Used for queries with GROUP BY (but no aggregates) specified. The input must be presorted according to the grouping columns.

Related Properties:

  • Group Key

Hash

Reads data into a hash table, where it can easily be looked up by the hash key. This is used for hash joins and hash aggregates.

Related Properties:

Materialize

Materialize the result of its child node in memory (to avoid re-computing the values).

Result

If no outer plan, evaluate a variable-free targetlist. If outer plan, return tuples from outer plan (after a level of projection as shown by targetlist). If with constant qualifications, it represents a one-time qualification test (i.e., one that doesn’t depend on any variables from the outer plan, so needs to be evaluated only once).

SetOp

Combines two datasets for set operations like UNION, INTERSECT, and EXCEPT. Note that the query structure for SetOps is different than you may expect: rather than being the direct parent of the sets on which it operates, a SetOp node only has a single Append child, which has a Subquery Scan for each node to combine.

Related Properties:

Sort

Takes data from a child node and produces sorted output, using either memory if available (depending on the work_mem setting) or “spilling” to disk. This is obviously necessary if the output needs to be sorted, though sometimes the input can be scanned in an already-sorted manner instead—e.g., if scanning a btree index compatible with the desired sort order.

Related Properties:

Unique

Like the UNIX command uniq, takes sorted input and eliminates adjacent duplicates. Useful for DISTINCT clauses if the input is already sorted (e.g., the query also includes an ORDER BY).

Window Aggregate

Implements aggregation in window functions.

Control Nodes

This section lists the different control nodes in PostgreSQL. These nodes control the behavior of its child nodes.

Append

Generate the concatenation of the results of sub-plans. For example, this can be used for a UNION ALL query.

Gather Merge

Merge the results of pre-sorted worker outputs—similar to the Merge Append node.

Related Properties:

Gather

Gather data from multiple workers—similar to the Append node.

Related Properties:

Limit

Takes data from a child node and produces sorted output, using either memory if available (depending on the work_mem setting) or “spilling” to disk. This is obviously necessary if the output needs to be sorted, though sometimes the input can be scanned in an already-sorted manner instead—e.g., if scanning a btree index compatible with the desired sort order.

Related Properties:

Lock Rows

Executes the locking behavior of a FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE or FOR KEY SHARE clause.

Merge Append

Merge the results of pre-sorted sub-plans to preserve the ordering.

Related Properties:

Modify Table

Apply rows produced by a subplan to a result table by inserting, updating, or deleting rows corresponding to the input.

Related Properties:

  • Conflict Arbiter Indexes
  • Conflict Filter
  • Conflicting Tuples
  • Conflict Resolution
  • Operation
  • Rows Removed by Conflict Filter
  • Tuples Inserted

Project Set

Apply a projection that includes set-returning functions to the output tuples of the outer plan.

Recursive Union

Generate a recursive union of two sub-plans. This is used in evaluating recursive common table expressions.

Node Properties

This section lists properties in plan nodes.

Filter

When present, this is a filter used to remove rows.

The important thing to note is that this is a filter in the traditional sense: these rows are read in (either from a data source or another operation in the plan), inspected, and then approved or removed based on the filter.

Although similar in purpose to the “Index Cond” that you see on Index Scans or Index Only Scans, the implementation is completely different. In an “Index Cond”, the index is used to select rows based on their indexed values, without examining the rows themselves at all. In fact, in some cases, you might see an “Index Cond” and a “Filter” on the same operation.

If you notice a lot of rows being filtered late in a query plan, this might be due to an operation (like a GROUP BY or SORT) using more columns than are necessary, requiring joins to take place before rows can be discarded. It is also sometimes a result of unnecessary or ill-advised uses of DISTINCT.

Index Cond

The condition used to find the locations of rows from the index.

Postgres uses the structured nature of the index to quickly jump to the rows it’s looking for. Different index types use different strategies.

Although similar in purpose to “Filter”, the implementation is completely different. In a “Filter”, rows are retrieved and then discarded based on their values. As such, you can find an “Index Cond” and a “Filter” on the same operation.

Rows Removed by Filter

The rows removed by the Filter. This is a per-loop average.

If a high proportion of rows are being removed, you may want to investigate whether a (more) selective index could help.

Rows Removed by Index Recheck

The number of rows returned by the index scan that do not meet the condition and are subsequently removed.

This will either be the result of a lossy bitmap scan, or an index type that doesn’t guarantee that a row matches the condition.

Scan Direction

Postgres is able to do a forward or backward scan of a (b-tree) index to get data in sorted order. In the default (text) format, the direction is forward unless stated otherwise.

You may also see scan direction of “NoMovement” in scans of other index types.

Heap Fetches

The number of rows Postgres had to look up in the table, rather than the index, during an index-only scan.

Exact Heap Blocks

The number of non-lossy blocks a bitmap heap scan visited.

If a row bitmap is too big to fit into working memory (work_mem), some parts of it are made “lossy” – i.e. they refer to whole pages rather than specific rows.

Lossy Heap Blocks

The number of lossy blocks a bitmap heap scan visited.

If a row bitmap is too big to fit into working memory (work_mem), some parts of it are made “lossy” – i.e. they refer to whole pages rather than specific rows.

Recheck Cond

The Recheck Cond is the condition a bitmap scan might use to filter rows out after fetching them.

It is only needed if the bitmap scan is lossy, or if it has used any lossy index types that do not guarantee that rows match the condition (BRIN indexes are one example).

For more info on whether a recheck was actually needed, look out for non-zero Lossy Heap Blocks and non-zero Rows Removed by Index Recheck.

Hash Cond

The condition used to match rows from the outer plan to the inner plan in a hash join.

Inner Unique

This is true if no more than one inner row could match any given outer row. This allows the executor to skip searching for additional matches.

Join Filter

A filter that can happen before the join, allowing Postgres to avoid looking up the rows and filtering them out afterwards.

Join Type

Which type of join was performed: Inner, Full, Left, Right, Semi, or Anti.

In text format plans, you can assume it is an inner join unless stated otherwise.

Rows Removed by Join Filter

The number of rows removed by the join filter.

If a high proportion of rows are being removed, you may want to investigate whether a (more) selective index could help.

Merge Cond

The condition used to match rows from the outer plan to the inner plan in a Merge Join.

Partial Mode

If this is Simple, the operation happens in one step.

Otherwise, Partial operations do chunks of the operation in parallel, and a separate Finalize operation joins the different parts together.

From the docs: “Aggregate functions which support Partial Mode are eligible to participate in various optimizations, such as parallel aggregation.” [source]

Strategy

Which strategy of aggregation was performed: Plain, Hashed, Sorted, or Mixed.

In non-text format query plans, this lets you see whether Postgres is doing a HashAggregate (Hashed), a GroupAggregate (Sorted), or a MixedAggregate (Mixed).

Hash Batches

If the hash is done in memory, there will only be a single batch.

A difference with Original Hash Batches is due to Postgres choosing to increase the number of batches in order to reduce memory consumption.

Hash Buckets

Hashed data is assigned to hash buckets.

Buckets are doubled until there are enough, so they are always a power of 2.

Sometimes Postgres will choose to increase the original number of buckets in order to reduce the number of tuples per bucket.

Original Hash Batches

If the hash is done in memory, there will only be a single batch.

A difference with Hash Batches is due to Postgres choosing to increase the number of batches in order to reduce memory consumption.

Original Hash Buckets

Hashed data is assigned to hash buckets.

Buckets are doubled until there are enough, so they are always a power of 2.

A difference with Hash Buckets is due to Postgres choosing to increase the original number of buckets in order to reduce the number of tuples per bucket.

Peak Memory Usage

The memory, in kB, used in the largest batch.

Command

The set operation: Intersect or Except.

UNION operations are handled by Append.

Sort Method

If the memory it requires is lower than your work_mem setting, Postgres can use a quicksort, or (if there is a LIMIT) a top-N heapsort.

Otherwise, it will do an external sort on-disk, which is slow but may be necessary for a large sort. If the sort is by a single column, or multiple columns from the same table, you may be able to avoid it entirely by adding an index with the desired order.

An external sort has a lower footprint than a quicksort, so it is possible to see external sorts with sort space used lower than work_mem.

Sort Key

The order by condition.

If the sort is by a single column, or multiple columns from the same table, you may be able to avoid the sort entirely by adding an index with the desired order.

Sort Space Type

Whether the sort was done in memory or on disk.

On-disk sorts have a lower footprint than in-memory sorts, so it is possible to see on-disk sorts with sort space used lower than work_mem.

Sort Space Used

The amount of memory or disk space, in kB, used for the sort.

An external sort has a lower footprint than a quicksort, so it is possible to see external sorts with sort space used lower than work_mem.

Workers Launched

The number of additional worker processes. This does not include the main process.

How Postgres determines this number, and any discrepancy with Workers Planned, are well documented in the official parallel query docs.

Workers Planned

The number of additional worker processes requested by the planner. This does not include the main process.

How Postgres determines this number, and any discrepancy with Workers Launched, are well documented in the official parallel query docs.

Single Copy

This is true if Postgres will not execute a plan more than once.

As such, if you see this alongside “Workers Planned: 1” then the main process would (or did) not do anything in parallel.

If you do not see it in a text format query plan, you can assume it is false.