Rally has used PostgreSQL as a backing database for some of our applications for several years. We started, as many organizations do, rolling our own database instances on virtual servers. That gave us great configurability at a cost of equally great complexity. Eventually we realized that Amazon’s RDS hosted Postgres option solved all of our needs while reducing the number of moving parts that our database admins were responsible for. We switched and have been really happy with the stability and performance we’ve seen.
One of the features that didn’t easily make the transition to a hosted database was our metrics collection agent. We use Datadog to store and visualize metrics across our infrastructure. When the database was on a server that we controlled, installation of an agent to pull metrics was simple. Authentication was implied by the same-server connection, and our collector could comfortably expect superuser privileges as it ran. In a hosted environment, there’s not really a “database server” in the same sense. Changing our mental models around how we manage databases proved to be at least as hard as changing the technology.
Luckily, we have grown as an organization in other ways at the same time. Namely, we deploy our applications as collections of Docker containers with good success. Could we leverage that capability to make metric collection easier?
Turns out the answer is yes. Datadog provides a pre-containerized copy of its agent that covers the basic use case well. With Docker containers, it’s pretty easy to modify an upstream image and store your deltas locally. What’s even easier is to override the command that the Docker container runs when it starts to set up its configuration on demand. That keeps us from having to deal with versioning and storing a fork of the upstream container. That helped move this project forward faster at the cost of having to maintain knowledge about what command the container should run outside of the Dockerfile itself.
Getting the database ready begins with creating a user whose least privilege includes access to metrics. First create the user.
CREATE ROLE datadog WITH CREATEDB LOGIN PASSWORD ‘somesecretpassword’;
Datadog wants general, non-superuser access to the statistics database. That’s enough to get both general as well as “function” metrics.
GRANT SELECT ON pg_stat_database to datadog;
Our developers were really interested in a metric around how many active queries are running at any given moment. In order to see what queries are being run on what databases by other users, we need to expand the privilege that the datadog user has. By default, the datadog user would only be able to see its own queries which isn’t helpful in this circumstance.
Do this as the
master user that AWS sets up as an administrator for you.
CREATE OR REPLACE FUNCTION get_sudo_statactivity() RETURNS SETOF pg_stat_activity AS $$ SELECT * FROM pg_catalog.pg_stat_activity; $$ LANGUAGE sql VOLATILE SECURITY DEFINER;
The beginning of the function simply sets up a
SELECT query on the
pg_stat_activity view. The more interesting part is the
SECURITY DEFINER at the end. That clause indicates that the function should run with the permissions of the user who created the function instead of the permissions of the user who called it. The equivalent idea in Unix is the “setuid” bit on executable files.
Now that we have a function, we can create a view off of it that makes it easier for other users to read.
CREATE OR REPLACE VIEW pg_stat_activity_allusers AS SELECT * FROM get_sudo_statactivity();
In particular, we want to grant the datadog user access to the view.
GRANT SELECT ON pg_stat_activity_allusers TO datadog;
With that, the database is ready for the datadog user to pull those sweet metrics out.
Now that the database is ready, we need to tell the container where to look for those metrics. Datadog’s agent is configured with YAML files in a particular directory. The names of the files correspond to the names of the plugins that are looking for them. Importantly for this project, the YAML file below had to show up in
/etc/dd-agent/conf.d/postgres.yaml in order to be processed by the postgres plugin. The content of the yaml file is described in Datadog’s documentation on their PostgreSQL agent.
init_config: instances: - host: postgres-1.abcdefg.us-east-1.rds.amazonaws.com port: 5432 username: datadog password: somesecretpassword tags: - dbinstanceidentifier:postgres-1 custom_metrics: - # Active queries relation: false metrics: sum(CASE WHEN state='active' THEN 1 ELSE 0 END): [postgresql.active_queries, GAUGE] sum(CASE WHEN state='active' THEN 0 ELSE 1 END): [postgresql.inactive_queries, GAUGE] descriptors: - [ datname, db ] query: "select datname, %s from pg_stat_activity_allusers group by datname;" collect_function_metrics: True
tags attribute of
dbinstanceidentifier is pretty consistent across Datadog’s various SQL integrations, and lets us correlate metrics that we pull from the instance itself with data we get from AWS about system-level utilization like number of IOPs. The format of the
query was a bit of a surprise, with the
%s as a placeholder for the column coming out of the
metrics attribute. Datadog not only interpolates the column name in there, but a lot of the assumptions that the agent makes requires that the interpolation happens. In this config, Datadog will run with the
sum(...) statement in as the
%s, then name the output either
postgresql.inactive_queries depending on which statement it ran.
The count of active queries is a custom metric for Datadog. In the steady state, that number will be very close to zero, but if that number grows it can indicate that the database is getting backed up and performance of the app is suffering. Since requesting the number of active queries is itself a query, there will always be at least one on the
postgresql database every time the agent samples. If that’s bothersome to you, you can exclude it in the query.
Keeping with the goal of using the container without storing additional layers, we then had to tell the container from where to get its configuration including the secret password and API key for Datadog. We store those in a separate secret management system. Downloading the configuration is as easy as overriding the normal command that the agent is expecting. After hunting through the Dockerfile for the container, it turns out that it expects both the
/entrypoint.sh script to be called first, then runs a local copy of
supervisord to manage the agent itself.
Putting it all together, this is the command that we have download the configuration and start the Datadog agent. It’s injected directly into the running context for the Docker container.
apt-get update && apt-get install -y awscli; get-secret datadog-agent-configuration | base64 -d > /etc/dd-agent/conf.d/postgres.yaml; export API_KEY=$(get-secret datadog-api-key); /entrypoint.sh /opt/datadog-agent/embedded/bin/python /opt/datadog-agent/bin/supervisord -n -c /etc/dd-agent/supervisor.conf
We found a few easily digestible nuggets of wisdom in creating this.
We have a mechanism for getting metrics out of hosted stuff like RDS. Despite having databases that run in one section of the cloud and code that runs in another, it’s still just a question of being able to point one at the other.
Postgres has a “sudo” function. Losing the immediate power of being a superuser everywhere is annoying at first, but being thoughtful about what kinds of access that a user really needs is a fun exercise.
Secret management is still really hard, and we’re not great at it. Keeping our secrets in a separate yet consolidated store made this project go pretty smoothly, but there are some obvious ways that it’s going to go wrong in the future.