Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: New vtgate Metrics #17585

Open
systay opened this issue Jan 21, 2025 · 3 comments
Open

RFC: New vtgate Metrics #17585

systay opened this issue Jan 21, 2025 · 3 comments
Labels

Comments

@systay
Copy link
Collaborator

systay commented Jan 21, 2025

1. Summary

Vitess currently classifies query plan types in a way that is neither intuitive nor helpful for performance analysis. In particular, QueriesProcessed and QueriesRouted rely on plan-type designations that are inconsistent across different operators (e.g., IN, Concatenate, DDL, Reference, FkCascade, and InsertSelect). This proposal introduces two new metrics and deprecates the older, less-informative ones.

2. Motivation

  1. Inconsistent Plan-Type Metric

    • The plan type was derived from the root operator’s name, leading to wide variability in reported plan types.
    • Some operators (like Route) reported the route type, while others forwarded whatever their child operator returned.
  2. Limited Usefulness

    • Metrics such as QueriesProcessed and QueriesRouted provide only a coarse breakdown.
    • Deeper insight (e.g., how complex or how many shards were involved) is missing or buried in logs/other metrics.
  3. Need for Clarity

    • We want an easily understandable categorization of query complexity to better diagnose query performance in vtgate.
    • We also want to track how many shards each query touches, to inform sharding logic decisions and query optimization.

3. Proposed Changes

3.1 Deprecation

  • Metrics to Deprecate:
    • QueriesProcessed
    • QueriesRouted
  • These will be marked as deprecated but still exist for at least one release to allow for a smooth transition.

3.2 New Metric: QueriesProcessedByQueryType

We propose categorizing queries into eight distinct buckets to capture both common and potentially problematic execution patterns:

  1. Passthrough

    • The query is forwarded directly to a single shard without modification.
    • Minimal overhead in vtgate; the fastest possible path.
  2. MultiShard

    • The query is routed to multiple shards, but not all shards in the cluster.
    • vtgate may need to concatenate results but does not perform extensive operator logic beyond routing.
  3. Scatter

    • The query is sent to all shards.
    • Indicates potential performance overhead, as every shard must be involved in the request.
  4. Lookup

    • The query requires at least two calls: typically a vindex lookup first, followed by the main query.
    • A common pattern for partial fan-out or locating specific shard(s) through a lookup vindex.
  5. Join

    • The plan includes at least one join operator at vtgate level (e.g., join of two routes).
    • Useful for quickly spotting queries that might be combining data across multiple shards.
  6. Complex

    • Catches any plan more involved than the above categories (e.g., subqueries, nested operators, multi-stage pipelines).
    • Indicates a need for further investigation or optimization if it appears frequently.
  7. OnlineDDL

    • Vitess-managed DDL statements performed online (e.g., schema migrations orchestrated by vtgate).
    • Tracked separately to measure usage and performance impact of online operations.
  8. DirectDDL

    • DDL statements that are directly passed to the underlying MySQL instances.
    • Does not go through Vitess’s online migration flow.

Example Metric Name

  • QueriesProcessedByQueryType{queryType="Passthrough"}
  • QueriesProcessedByQueryType{queryType="Complex"}, etc.

3.3 New Metric: QueriesProcessedByStatementType

  • Purpose: Categorize queries by the high-level SQL statement type.

Possible Categories (not exhaustive):

  • SELECT
  • INSERT
  • UPDATE
  • DELETE
  • SET
  • DDL (could be further subdivided if desired)
  • Others as needed (ALTER, CREATE, DROP, etc.)

3.4 New Histogram: “Shards Accessed per Query”

  • Purpose: Track how many shards a query invocation talks to.
  • Buckets: 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512
  • Metric Name: ShardsAccessedHistogram (or similar).

This histogram shows the distribution of queries across shard counts, helping identify where queries may be fanning out more than expected.

4. Backward Compatibility

  • We will keep the old metrics running in parallel, marked as “DEPRECATED,” for at least one release cycle.
  • Developers and operators should be encouraged to migrate to the new metrics.

5. Open Questions

  1. Granularity: Do we need more nuanced categories within “Complex”?
  2. Bucket Sizes: Are the proposed histogram buckets (0, 1, 2, 4, …) adequate for most production workloads?
  3. Eventual Removal: What is the timeline for fully removing QueriesProcessed and QueriesRouted?
@systay
Copy link
Collaborator Author

systay commented Jan 21, 2025

I believe that implementing this RFC would solve the following issue: #16391

@systay
Copy link
Collaborator Author

systay commented Jan 22, 2025

I thought it would be useful to list the personas that might influence or benefit from these metrics and would be useful to consider in the analysis. Here are the most common ones in the context of a system like Vitess:

  1. SRE / DevOps Engineer
  • Key Interests: Reliability, uptime, and scaling. They need quick insights into system health and clear signals for when to scale or investigate issues.
  • Metric Needs:
    1. High-level “are we stable or at risk?” metrics (throughput, errors, latency).
    2. Immediate detectability of anomalies (e.g., sudden spike in multi-shard queries).
    3. Trends for capacity planning and alert thresholds.
  1. Database Administrator (DBA)
  • Key Interests: Query performance, indexing strategies, schema design, and resource optimization.
  • Metric Needs:
    1. Detailed breakdown of query behaviors (scatter vs. single-shard vs. multi-operator).
    2. Histograms that reveal read/write patterns and fan-out.
    3. Ability to diagnose root causes of slow or blocking queries (though DBAs often have more system access than other roles).
  1. Application Developer
  • Key Interests: Ensuring queries are correct, efficient, and aligned with application logic. They may need feedback if their changes degrade or improve performance.
  • Metric Needs:
    1. Simple categories (e.g., which queries are “simple vs. complex”) to spot if an app refactor introduced heavier queries.
    2. Statement-type metrics (e.g., INSERT/SELECT spikes) to confirm expected usage patterns.
    3. Real-time visibility into the effect of changes without having to parse logs or run deeper queries.
  1. Customer Engineering (CE)
  • Key Interests: Diagnosing customer issues without direct access to logs or ability to run VEXPLAIN. Confirming that recommended fixes have been implemented.
  • Metric Needs:
    1. Snapshot “fingerprint” metrics showing plan/operator usage.
    2. Time-series data to validate changes over time (did queries or problematic plans?).
    3. Enough granularity to detect newly introduced “bad” patterns or complex plans.

@systay
Copy link
Collaborator Author

systay commented Jan 22, 2025

Updated the categories for QueriesProcessedByQueryType , adding two new ones: Join and MultiShard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant