Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to collector main branch #1

Open
wants to merge 78 commits into
base: main
Choose a base branch
from

Conversation

jwieringa
Copy link
Member

No description provided.

msakrejda and others added 30 commits September 23, 2024 15:13
Collect system info and metrics for Azure Flexible Server and Cosmos DB for PostgreSQL.
To start using this, the customer needs to supply a new config variable AZURE_SUBSCRIPTION_ID to the collector, as well as setting up the managed identity. When AZURE_SUBSCRIPTION_ID is not supplied, the collector doesn't error out and simply won't collect the system info.
…er with text format (#586)

* Add auto_explain parameters support for JSON format
* Support both new lines and Query Parameters with text format auto_explain better
* Make sure not to include query params with explain JSON with normalized filter
In #494, I missed updating the --analyze-logfile flag. The flag is
intended to analyze a logfile "offline", which does not work with the
new mechanism, since that relies on being able to read the server's
log_timezone and log_line_prefix.

This adds additional flags to pass that information to the
--analyze-logfile path. The timezone defaults to UTC; the prefix is
required.
In passing, this also renames max_dead_tuples and num_dead_tuples
to max_dead_item_ids/num_dead_item_ids, to track the upstream
clarification on the naming.
Previously the collector would emit system catalogs that are only existing
on EPAS, causing unnecessary information to show in the pganalyze dashboard.

Additionally, EPAS allows function signatures to be re-used between functions
and procedures (presumably this was allowed before upstream Postgres added
CREATE PROCEDURE), which causes a server-side error with pganalyze. Since this
is an edge case and there are no intents to support this upstream, ignore
duplicate functions/procedures that have the same signature.
In some cases, it's possible for the user, database, or other free-form parts of
the log line prefix to match too much of the input string, leaving the LogLevel
to unintentionally match part of the content.

E.g., with CloudSQL the vacuum analysis logs are output as

> 2024-10-04 06:38:28.808 UTC [2569017]: [1-1] db=,user= LOG: automatic vacuum of table \"some_database.some_schema.some_table\": index scans: ...

This is matched as user " LOG: automatic vacuum of table \"some_database.some_schema.some_table\": index"
and log level "scans". The bogus log level is then mapped to 0, and the line is not
handled correctly.

To avoid this, change our regexp to only look for the log levels (and additional
context strings that appear in place of log level on detail lines) we recognize.
RPM base packages have changed and no longer contain an init system by
default. Since we mainly care about having a process running to then
issue exec commands under, we can use a sleep here instead to avoid
failure.
This asks for confirmation which we don't want in the automated flow.
* Bump protoc version to 28.2
* Update CI/release protoc version to 28.2

Also update the setup-protoc version to 3. Note that we still use the
forked version to reduce a security risk:
#308
…616)

* Track cluster identifier (cluster ID) as part of system information

This is intended to help identify servers that belong together, to allow
the pganalyze app to group them in an overview, offer data in combined
views, or allow Index Advisor to avoid considering indexes unused if
they are used on a replica.

To start with this identifies the cluster ID for given providers:
- AWS: DBClusterIdentifier (user-assigned cluster name)
- Azure: Source Server ID (replication parent server ID + resource group)
- Crunchy Bridge: Cluster Parent ID (replication parent server ID)
- Self hosted: pg_controldata system identifier (unique ID shared by all physical replicas)

There are providers (e.g. GCP) and edge cases (e.g. when helper doesn't
run for self hosted) where this doesn't work. Handling those cases, and
allowing manual overrides of the cluster ID is left for a later follow-up.
seanlinsley and others added 30 commits December 2, 2024 12:44
* Make getFirstMetricValue to return the first value

This function was previously returning the last value instead of the
first value. This fix changes it and returns the value as soon as
reading it.

Co-authored-by: Maciek Sakrejda <[email protected]>
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.25.0 to 0.31.0.
- [Commits](golang/crypto@v0.25.0...v0.31.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This adds the equivalent of the ini-style setting
"max_buffer_cache_monitoring_gb" for the environment-based settings,
which was missed when the setting was first added.
In passing, revise error text to include manual package install instructions
Fixes CVE-2024-45338, which is a DOS style vulnerability that's not a
practical problem, but may be flagged by dependency scanners.
Some time ago (at least 9+ years) we introduced a maximum connection
lifetime of 30 seconds via the SetConnMaxLifetime database/sql function.

This was likely intended as an extra safety measure in case the collector
didn't clean up its DB handle correctly. We've had a few minor issues
with incorrect cleanups over the years (though usually not with the DB
connection), but at this point our connection cleanup should be reliable.

The connection lifetime causes an issue because it causes Go to
reconnect to the database after 30 seconds have elapsed. This then
effectively removes any connection local settings, in particular
a statement_timeout.

Effectively this meant that queries may unintentionally not have been
subject to the statement_timeout, if the same connection previously took
30 seconds to do its work. To resolve, remove the max lifetime.
This executes on-demand query runs for the Query Tuning feature via
a helper function that prevents reading the table data directly, as well
as granting the collector user unnecessary permissions.

This is now the only way that query runs can execute EXPLAIN ANALYZE,
direct execution without the helper is intentionally not supported.

In passing, expand the protobufs to pass down parameters and parameter
types and optional planner settings, and refactor the query run code to
handle work requiring a DB connection in a separate Go function.

Further, this adds tests that utilize a new TEST_DATABASE_URL enviroment
setting, which is used to run unit tests that require a database
connection (previously only the integration tests used the database).
Like "--generate-stats-helper-sql" this allows generating a SQL script
that installs the pganalyze.explain_analyze helper into all databases on
a server, as specified by the server section name passed as an argument.

Additionally, the "--generate-helper-explain-analyze-role" option allows
setting the role that the helper function should be owned by as a value,
defaulting to "pganalyze_explain".

Note that the role creation, as well as any GRANT statements for the role
must be taken care of separately.
#660)

This acts as an additional defense in case the user does not revoke
permissions from non-superuser for dblink as we recommend.
In some edge cases, for example unknown CONTEXT patterns, we may not
detect the contents of a secondary line. If filtering of unidentified
content is enabled, such lines need to be redacted.
This can occur in certain cases, for example auto_explain output
on some Postgres versions. We previously would have not detected
this CONTEXT line, but are now correctly detecting it as containing
statement_parameter log secrets.

In passing add a test case for the new "Query Parameters" field in
auto_explain which gets redacted based on the statement_text filter that
filters out the whole plan text (we normalize the individual fields in
resulting query samples, but the log text for auto_explain is redacted
altogether).
Because the buffercache logic initializes its own database (so it
can run in parallel to other collection tasks), it needs to also ensure
to close its DB handle so the connection is released again.

In passing avoid the double assignment of "db" in the later parts of the
function.
This makes it clear that we must clean up the database connection that
was established, since we leak connections otherwise.

To support this, refactor the log-based EXPLAIN code so defer can be
used easily.
In almost all cases we have an existing database connection, so use
that to retrieve the value of Postgres settings.
* Start collecting and sending pg_stat_statements_info stats
* Send diff dealloc number instead of absolute number
Due to security enhancements in newer macOS versions, unsigned Go
binaries may hang when built and executed locally. To prevent this,
add code signing to `make build`.
Currently, the collector log processing code assumes that the log
parser has always been initialized by the time we receive log messages
to process.  This is generally true, but it can be false if the
collector failed to connect to the database (to fetch log_line_prefix
to initialize the parser). That can lead to confusing errors (that
are not relevant to the root cause of the connection problem).

Skip attempting to process log lines when the log parser is not yet
initialized.
#669)

* Dockerfile: Update alpine base image from 3.18 to 3.21
* Bump go version to 1.23.5
* Explicitly make a copy of logLine

With Go version up of 1.22+, this part was warned with "loop variable
logLine now per-iteration, heap-allocated". To potentially avoid the
heal allocation, explicitly make a copy of logLine.
This adds a new configuration setting, `gcp_region` / `GCP_REGION`.
If `db_use_iam_auth` / `DB_USE_IAM_AUTH` is enabled and `gcp_region` / `GCP_REGION` is specified, the collector uses a service account associated with the collector VM as a monitoring user, instead of traditional passwords.
In order to use this setting, a service account needs to be created with proper roles, the database needs to be configured for IAM database authentication, and a service account needs to be added to the database, so that the service account can access to the database to collect statistics.
We currently do not resolve symlinks when looking up $PGDATA from the
environment or the `data_directory` GUC setting. This can misattribute
storage costs to the wrong partition.

Always follow symlinks in the helper that resolves the data directory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants