PostgreSQL HyperLogLog (HLL) & Cardinality Estimation

Avatar

By squashlabs, Last Updated: Oct. 30, 2023

PostgreSQL HyperLogLog (HLL) &  Cardinality Estimation

PostgreSQL HLL is a useful extension that provides a probabilistic data structure called HyperLogLog (HLL) for approximate counting and cardinality estimation in database management. It is designed to efficiently estimate the number of distinct elements in a large dataset without requiring excessive memory or computational resources.

Traditional counting methods, such as using a unique constraint or relying on an index, can be slow and resource-intensive, especially when dealing with large-scale databases. PostgreSQL HLL offers an alternative approach by leveraging the HLL algorithm, which provides an approximate count of distinct elements with a small margin of error.

Example 1: Using PostgreSQL HLL for Cardinality Estimation

To illustrate the usage of PostgreSQL HLL for cardinality estimation, let's consider a scenario where we have a table named "users" with a column named "email" containing email addresses of users. We want to estimate the number of distinct email addresses in the table using PostgreSQL HLL.

First, we need to install the PostgreSQL HLL extension. Assuming we are using PostgreSQL version 12 or higher, we can use the following command to install the extension:

CREATE EXTENSION hll;

Once the extension is installed, we can create an HLL sketch for the "email" column using the hll_add() function. The hll_add() function takes the column value as an argument and updates the HLL sketch accordingly.

SELECT hll_add(hll_hash_text(email)) AS hll_sketch
FROM users;

The above query will calculate the HLL sketch for each email address in the "users" table. The result will be a single HLL sketch that represents the estimated distinct count of email addresses.

To retrieve the estimated distinct count from the HLL sketch, we can use the hll_cardinality() function. This function takes the HLL sketch as an argument and returns the approximate count of distinct elements.

SELECT hll_cardinality(hll_sketch) AS estimated_distinct_count
FROM (
  SELECT hll_add(hll_hash_text(email)) AS hll_sketch
  FROM users
) AS subquery;

The above query will calculate the estimated distinct count of email addresses in the "users" table using PostgreSQL HLL.

Related Article: Tutorial: PostgreSQL Array Literals

Example 2: Using PostgreSQL HLL for Multiple Columns

Related Article: Comparing PostgreSQL and Redis: A Technical Analysis

PostgreSQL HLL also supports estimating the distinct count of multiple columns. This can be useful when dealing with composite keys or when estimating the cardinality of a combination of columns.

Let's consider a scenario where we have a table named "orders" with two columns: "user_id" and "product_id". We want to estimate the number of distinct combinations of "user_id" and "product_id" using PostgreSQL HLL.

To achieve this, we can create an HLL sketch for each combination of "user_id" and "product_id" using the hll_hash_any() function, which takes an array of column values as an argument.

SELECT hll_add(hll_hash_any(ARRAY[user_id, product_id])) AS hll_sketch
FROM orders;

The above query will calculate the HLL sketch for each combination of "user_id" and "product_id" in the "orders" table.

To retrieve the estimated distinct count from the HLL sketch, we can use the hll_cardinality() function, as shown in the previous example.

SELECT hll_cardinality(hll_sketch) AS estimated_distinct_count
FROM (
  SELECT hll_add(hll_hash_any(ARRAY[user_id, product_id])) AS hll_sketch
  FROM orders
) AS subquery;

The above query will calculate the estimated distinct count of combinations of "user_id" and "product_id" in the "orders" table using PostgreSQL HLL.

How to Extract Data from PostgreSQL Databases: PSQL ETL

In this article, we will guide you through the process of extracting data from PostgreSQL databases using PSQL ETL. You will learn about various tech… read more

Tutorial: Installing PostgreSQL on Amazon Linux

Installing PostgreSQL on Amazon Linux is made easy with this detailed guide. Learn the step-by-step process of installing PostgreSQL, configuring Ama… read more

Resolving Scalar Join Issues with SQL Tables in Databases

Troubleshoot and solve scalar join issues in SQL databases with this article. Learn about working with scalar values, structuring and managing databa… read more

Tutorial: PostgreSQL Array Literals

Using PostgreSQL array literals in databases can greatly enhance your data management capabilities. Whether you need to insert array literals, use th… read more

How to Determine the Length of Strings in PostgreSQL

Determining the length of a string in PostgreSQL is essential for various database operations. This article provides an in-depth exploration of diffe… read more

Working With PostgreSQL: Extracting Day of Week

Learn to extract the day of the week from dates with PostgreSQL. Understand the difference between date_part and extract, and how to format the day o… read more

Analyzing Postgres: Maximum Query Handling Capacity

The article provides a detailed look into how many queries Postgres can handle simultaneously. The article covers various topics such as query optimi… read more

Tutorial: Dealing with Non-Existent Relations in PostgreSQL

Handling the 'relation does not exist' error in PostgreSQL databases can be a challenging task. In this tutorial, you will learn how to deal with non… read more

How to Compare & Manipulate Dates in PostgreSQL

Learn to compare dates in PostgreSQL. A comprehensive resource for PostgreSQL date comparisons. This article covers date comparison in PostgreSQL, da… read more

Tutorial: Testing Cassandra Query Speed

Accurately measuring the performance of your Cassandra queries is essential for optimizing your database's efficiency. In this tutorial, we will guid… read more