RFC: New vtgate Metrics #17585

systay · 2025-01-21T08:58:36Z

1. Summary

Vitess currently classifies query plan types in a way that is neither intuitive nor helpful for performance analysis. In particular, QueriesProcessed and QueriesRouted rely on plan-type designations that are inconsistent across different operators (e.g., IN, Concatenate, DDL, Reference, FkCascade, and InsertSelect). This proposal introduces two new metrics and deprecates the older, less-informative ones.

2. Motivation

Inconsistent Plan-Type Metric
- The plan type was derived from the root operator’s name, leading to wide variability in reported plan types.
- Some operators (like Route) reported the route type, while others forwarded whatever their child operator returned.
Limited Usefulness
- Metrics such as QueriesProcessed and QueriesRouted provide only a coarse breakdown.
- Deeper insight (e.g., how complex or how many shards were involved) is missing or buried in logs/other metrics.
Need for Clarity
- We want an easily understandable categorization of query complexity to better diagnose query performance in vtgate.
- We also want to track how many shards each query touches, to inform sharding logic decisions and query optimization.

3. Proposed Changes

3.1 Deprecation

Metrics to Deprecate:
- QueriesProcessed
- QueriesRouted
These will be marked as deprecated but still exist for at least one release to allow for a smooth transition.

3.2 New Metric: `QueriesProcessedByQueryType`

We propose categorizing queries into eight distinct buckets to capture both common and potentially problematic execution patterns:

Passthrough
- The query is forwarded directly to a single shard without modification.
- Minimal overhead in vtgate; the fastest possible path.
MultiShard
- The query is routed to multiple shards, but not all shards in the cluster.
- vtgate may need to concatenate results but does not perform extensive operator logic beyond routing.
Scatter
- The query is sent to all shards.
- Indicates potential performance overhead, as every shard must be involved in the request.
Lookup
- The query requires at least two calls: typically a vindex lookup first, followed by the main query.
- A common pattern for partial fan-out or locating specific shard(s) through a lookup vindex.
Join
- The plan includes at least one join operator at vtgate level (e.g., join of two routes).
- Useful for quickly spotting queries that might be combining data across multiple shards.
Complex
- Catches any plan more involved than the above categories (e.g., subqueries, nested operators, multi-stage pipelines).
- Indicates a need for further investigation or optimization if it appears frequently.
OnlineDDL
- Vitess-managed DDL statements performed online (e.g., schema migrations orchestrated by vtgate).
- Tracked separately to measure usage and performance impact of online operations.
DirectDDL
- DDL statements that are directly passed to the underlying MySQL instances.
- Does not go through Vitess’s online migration flow.

Example Metric Name

QueriesProcessedByQueryType{queryType="Passthrough"}
QueriesProcessedByQueryType{queryType="Complex"}, etc.

3.3 New Metric: `QueriesProcessedByStatementType`

Purpose: Categorize queries by the high-level SQL statement type.

Possible Categories (not exhaustive):

SELECT
INSERT
UPDATE
DELETE
SET
DDL (could be further subdivided if desired)
Others as needed (ALTER, CREATE, DROP, etc.)

3.4 New Histogram: “Shards Accessed per Query”

Purpose: Track how many shards a query invocation talks to.
Buckets: 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512
Metric Name: ShardsAccessedHistogram (or similar).

This histogram shows the distribution of queries across shard counts, helping identify where queries may be fanning out more than expected.

4. Backward Compatibility

We will keep the old metrics running in parallel, marked as “DEPRECATED,” for at least one release cycle.
Developers and operators should be encouraged to migrate to the new metrics.

5. Open Questions

Granularity: Do we need more nuanced categories within “Complex”?
Bucket Sizes: Are the proposed histogram buckets (0, 1, 2, 4, …) adequate for most production workloads?
Eventual Removal: What is the timeline for fully removing QueriesProcessed and QueriesRouted?

The text was updated successfully, but these errors were encountered:

systay · 2025-01-21T10:40:06Z

I believe that implementing this RFC would solve the following issue: #16391

systay · 2025-01-22T10:15:33Z

I thought it would be useful to list the personas that might influence or benefit from these metrics and would be useful to consider in the analysis. Here are the most common ones in the context of a system like Vitess:

SRE / DevOps Engineer

Key Interests: Reliability, uptime, and scaling. They need quick insights into system health and clear signals for when to scale or investigate issues.
Metric Needs:
1. High-level “are we stable or at risk?” metrics (throughput, errors, latency).
2. Immediate detectability of anomalies (e.g., sudden spike in multi-shard queries).
3. Trends for capacity planning and alert thresholds.

Database Administrator (DBA)

Key Interests: Query performance, indexing strategies, schema design, and resource optimization.
Metric Needs:
1. Detailed breakdown of query behaviors (scatter vs. single-shard vs. multi-operator).
2. Histograms that reveal read/write patterns and fan-out.
3. Ability to diagnose root causes of slow or blocking queries (though DBAs often have more system access than other roles).

Application Developer

Key Interests: Ensuring queries are correct, efficient, and aligned with application logic. They may need feedback if their changes degrade or improve performance.
Metric Needs:
1. Simple categories (e.g., which queries are “simple vs. complex”) to spot if an app refactor introduced heavier queries.
2. Statement-type metrics (e.g., INSERT/SELECT spikes) to confirm expected usage patterns.
3. Real-time visibility into the effect of changes without having to parse logs or run deeper queries.

Customer Engineering (CE)

Key Interests: Diagnosing customer issues without direct access to logs or ability to run VEXPLAIN. Confirming that recommended fixes have been implemented.
Metric Needs:
1. Snapshot “fingerprint” metrics showing plan/operator usage.
2. Time-series data to validate changes over time (did queries or problematic plans?).
3. Enough granularity to detect newly introduced “bad” patterns or complex plans.

systay · 2025-01-22T11:31:56Z

Updated the categories for QueriesProcessedByQueryType , adding two new ones: Join and MultiShard

systay added Type: RFC Request For Comment Component: Query Serving labels Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: New vtgate Metrics #17585

RFC: New vtgate Metrics #17585

systay commented Jan 21, 2025 •

edited

Loading

systay commented Jan 21, 2025

systay commented Jan 22, 2025 •

edited

Loading

systay commented Jan 22, 2025 •

edited

Loading

RFC: New vtgate Metrics #17585

RFC: New vtgate Metrics #17585

Comments

systay commented Jan 21, 2025 • edited Loading

1. Summary

2. Motivation

3. Proposed Changes

3.1 Deprecation

3.2 New Metric: QueriesProcessedByQueryType

Example Metric Name

3.3 New Metric: QueriesProcessedByStatementType

3.4 New Histogram: “Shards Accessed per Query”

4. Backward Compatibility

5. Open Questions

systay commented Jan 21, 2025

systay commented Jan 22, 2025 • edited Loading

systay commented Jan 22, 2025 • edited Loading

systay commented Jan 21, 2025 •

edited

Loading

3.2 New Metric: `QueriesProcessedByQueryType`

3.3 New Metric: `QueriesProcessedByStatementType`

systay commented Jan 22, 2025 •

edited

Loading

systay commented Jan 22, 2025 •

edited

Loading