Monitoring time based lag in PostgreSQL pglogical replication

Vijay Reddy G
6 min readJun 11, 2021

Pglogical, a logical replication system for PostgreSQL, allows for replication of individual tables or sets of tables to target servers in AWS cloud environments. However, it lacks a built-in feature for measuring time based replication lag, making it difficult to monitor the status of replication. It’s also not available in on-premise PostgreSQL pglogical replication setups.

Due to the absence of a built-in time-based monitoring tool for RDS/Aurora PostgreSQL, I developed a custom monitoring solution to check the status of replication.

The idea is simple. Create a table with two columns — ‘dbname’ to identify the publisher, and ‘last_updated’ — on both the publisher and subscriber side. The table will have a maximum of two rows, one for each direction of replication. A Lambda function is set up to execute periodically, updating the ‘last_updated’ column at the source side for each direction. If the ‘last_updated’ timestamp on the publisher and subscriber side match, replication is considered good. In case of a lag, the time difference is calculated and if it exceeds the configured lag threshold, an alert is sent. This solution can handle both unidirectional and bidirectional replication and can be scheduled as per the customer’s requirement.

The solution handles below scenarios:

  • Can also be adapted for on-prem setups with minor modifications.
  • Uni-direction and bi-directional replication monitoring in single function.
  • Alert whenever the lag is greater than equal to lag threshold.
  • Alert again when the lag is resolved either manually or automatically.
  • No duplicate alerts in a same day if lag is not resolved.
  • The script can be easily extended to include logic for suppressing duplicate alerts for a specified period of time as below:

To suppress duplicate alerts for a certain period of time, for example, x hours, the script can be modified to store the date and time of the last alert in a parameter store. Currently, I use the date without the timestamp and store it as the ‘last alerted date’ in the parameter store. By including the timestamp in this value, we can more accurately track the frequency of alerts and suppress duplicate alerts within the specified time period (x hours).
- Set the initial “last alerted date” for the publisher as a default date.
- If the “last alerted date” is the default date, update it with the current date and timestamp.
- Compare the difference between the “last alerted date” and the current date and time. If the difference is greater than or equal to x hours, send an alert. If not, ignore the alert. Additional edge cases can be handled accordingly.

  • Send alerts in case of any exceptions/errors like API exceptions or database connection errors etc.
  • In fact, it can be used for any databases in where the replication is configured.

Code samples available at below git repository:

https://github.com/VijayReddyG81/pglogical_replication_lag

AWS pglogical monitoring architecture:

Create heartbeat table:

Create heartbeat table at both publisher and subscriber databases as below. Remember to log into correct schema and update the python code in lambda with correct queries and schema names.

publisher database:-

create table {schema_name}.pglogical_heartbeat_monitor(dbname varchar ( 50 ) primary key, last_updated timestamp);
insert into {schema_name}.pglogical_heartbeat_monitor values('{provide publisher db name here}',now());

subscriber database:-

create table {schema_name}.pglogical_heartbeat_monitor(dbname varchar ( 50 ) primary key, last_updated timestamp);
Incase of bidirectional replication, run below to insert command at subscriber database:-
insert into {schema_name}.pglogical_heartbeat_monitor values(‘{provide subscriber db name here}’,now());

Create SNS topic:

Create a new SNS topic to send alert or use same SNS topic that was created at the time of RDS/aurora instance builds. The IAM policy should have the same SNS topic used here.

Create secrets:

To connect to the databases, it is necessary to retrieve the password from the Secret Manager. You can either create a new secret or use the same secret ARN that was created during the build process for each publisher and subscriber database.

Create a Policy:

Create a custom policy with name as “pglogical-rep-monitor-access- policy”, that will provide an access to the parameter store, SNS topic, cloudwatch logs, secret manager and lambda invocation.

Copy and paste the pg_logical_lag_policy.json snippet from the Git Repository while creating the policy and replace names if any.

Create a Role:

Create a role named pglogical-rep-monitor-access-role and assign below policies and tag it appropriately:

At AWS console: IAM -> Roles -> Create Role with name “pglogical-rep-monitor-access-policy” -> AWS Service -> Choose use case as Lambda -> click Next -> Search below policies and attach -> Set permissions boundary if any-> Tag the role appropriately.

You should attach any regular lambda execution and VPC access policies apart from above.

Create Layers:

The script uses the psycopg2 python package, which should be provided to the Lambda function as a layer. In addition, an RDS connection certificate is required to connect to the RDS database, and should also be supplied as a layer. It’s worth noting that alternatively, you can include these packages with your python lambda function code and zip it before uploading it to the Lambda function.

Go to AWS Lambda -> Layers -> Create Layer:
Name: lambda-pglogical-rep-monitor-psycopg2-layer
Description: A layer of extra code for “psycopg2”
Upload zip “lambda-pglogical-rep-monitor-psycopg2-layer.zip” in this git repo.
Runtime: Python 3.8 and then Create the layer

Create another layer for rds db connection certificate: Go to AWS Lambda -> Layers -> Create Layer :
Name: lambda-pglogical-rep-monitor-rdsdbconnectcerti-layer
Description: A certificate to connect to rds database from lambda function Upload zip “lambda-pglogical-rep-monitor-rdsdbconnectcerti-layer.zip” in this git repo.

Create a lambda function:

At AWS console -> Lambda -> Functions -> Create Function -> Author from scratch -> Function Name : lambda-pglogical-rep-monitor -> Runtime: Python 3.8 -> Role Selection: Use existing role : pglogical-rep-monitor-access-role -> Network : VPC : Select appropriate VPC -> Subnets: Select appropriate subnet -> Security groups: select appropriate security groups -> Click on Create Function and it will create a function. Tag the function appropriately.

Add layers to the lambda function:

Add layers to the lambda function using layers ARN numbers which are created from “Create Layers” section by clicking the “Add Layer ->Specify an ARN”.

Add python code to the lambda function:

In the code section of the function, copy and paste “pglogical_lag_monitor.py” python code from git repo. Before pasting the code, adjust the variable values.

Check VPC:

In configuration section, check the VPC details are properly configured.

Test the Function:

In Test -> Create a dummy event and save it.

Edit the timeout:

Edit the timeout setting of default execution time of the function from 3 sec to 30 sec.
Configuration -> General Configuration -> Edit

Execute the function:

Click on Code -> Deploy -> Test Review the result.

Schedule the Function:

Use CloudWatch to trigger the Lambda function periodically or at a specific time. Ensure that the schedule time in CloudWatch and the lag threshold in the code match:
1. Login to console and navigate to CloudWatch.
2. Under Events, select Rules & click “Create Rule”
3. Select Event Source as “Schedule” with fixed rate of 5 min.
4. Select the Target as a Lambda function and select the function as “lambda-pglogical-rep-monitor”
5. Select the version as default and Configure input as Matched event.
6. Click on Configure Details and provide name.
7. Add description, select state as enabled and click on create rule.

Log Output:

The log output of this function can be found at below:
CloudWatch -> Logs -> Log groups -> /aws/lambda/lambda-pglogical-rep-monitor
Note: You can reduce the log files retention by clicking Actions -> Edit retention setting

--

--

Vijay Reddy G

Solutions Architect, interested in cloud, databases and ML