Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up SpanReader::findTraceIDs() for Elasticsearch #1475

Closed
wants to merge 1 commit into from

Conversation

bobrik
Copy link
Contributor

@bobrik bobrik commented Apr 16, 2019

We have a service that generates two spans per trace, around 4K spans per second. It's very slow to query for the latest traces in this service.

This is the default query that is currently generated:

{
  "aggregations": {
    "traceIDs": {
      "aggregations": {
        "startTime": {
          "max": {
            "field": "startTime"
          }
        }
      },
      "terms": {
        "field": "traceID",
        "order": [
          {
            "startTime": "desc"
          }
        ],
        "size": 20
      }
    }
  },
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "startTime": {
              "from": 1555437540000000,
              "include_lower": true,
              "include_upper": true,
              "to": 1555437600000000
            }
          }
        },
        {
          "match": {
            "process.serviceName": {
              "query": "nginx-ssl"
            }
          }
        }
      ]
    }
  },
  "size": 0
}

This takes pretty much forever if there are many spans to aggregate:

{
  "took": 38372,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 331761,
    "max_score": 0,
    "hits": []
  },
  "other": "stuff"
}

Presumably, because ES takes full window, then sorts as many buckets
as there are traces.

Instead, we can pull the last X spans, and find unique traceIDs there:

{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "startTime": {
              "from": 1555437540000000,
              "include_lower": true,
              "include_upper": true,
              "to": 1555437600000000
            }
          }
        },
        {
          "match": {
            "process.serviceName": {
              "query": "nginx-ssl"
            }
          }
        }
      ]
    }
  },
  "size": 100,
  "sort": [
    {
      "startTime": "desc"
    }
  ]
}

This is a whole lot faster:

{
  "took": 26,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 331761,
    "max_score": null,
    "hits": [
      "100 hits here"
    ]
  },
  "other": "stuff"
}

So this is 38372 / 26 = 1475x faster, which is pretty nice. Generally speedup is under 1000x, though.

@bobrik
Copy link
Contributor Author

bobrik commented Apr 16, 2019

Ha, this issue is #1475 just like speedup is 1475x 😦

@vprithvi
Copy link
Contributor

Thanks for looking into this! It could be pretty sweet!

I'm curious about what would happen if there are only few traces with a large number of spans per service.

For e.g., let's assume that there are 10 unique traces with 10k spans all for the same service - if the new query is executed with a size of 10, would it retrieve all 10 traceIDs?

@bobrik bobrik force-pushed the ivan/faster-es-search branch from b7f8a44 to 264dc52 Compare April 16, 2019 21:07
@bobrik
Copy link
Contributor Author

bobrik commented Apr 16, 2019

It will only return 10 unique traceIDs across the latest 1000 (value of es.max-num-spans) spans that match the query, which may as well be just one traceID. I'd personally take this tradeoff against waiting for tens of seconds for the query to finish.

I'm looking into test failures.

@bobrik bobrik force-pushed the ivan/faster-es-search branch 4 times, most recently from 17d5a7e to a6feb77 Compare April 17, 2019 00:29
@vprithvi
Copy link
Contributor

I understand your use case, but I'm not a fan of trading completeness for speed.

I'm wondering whether we can trigger the strategy of searching by spans and looking at traceIDs only for certain search patterns. For instance, only applying it if the search interval is less than a minute.

Or even better, are there strategies for search that are both complete and quick?

@bobrik bobrik force-pushed the ivan/faster-es-search branch 2 times, most recently from 4621433 to 0812ed9 Compare April 17, 2019 03:14
@codecov
Copy link

codecov bot commented Apr 17, 2019

Codecov Report

Merging #1475 into master will increase coverage by 1.51%.
The diff coverage is 78.94%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1475      +/-   ##
==========================================
+ Coverage   98.21%   99.73%   +1.51%     
==========================================
  Files         195      179      -16     
  Lines        9602     8555    -1047     
==========================================
- Hits         9431     8532     -899     
+ Misses        134       12     -122     
+ Partials       37       11      -26
Impacted Files Coverage Δ
plugin/storage/es/spanstore/reader.go 96.67% <78.94%> (-3.33%) ⬇️
plugin/storage/badger/spanstore/writer.go 97.22% <0%> (-0.31%) ⬇️
plugin/storage/badger/spanstore/reader.go 96.58% <0%> (-0.14%) ⬇️
cmd/agent/app/reporter/tchannel/builder.go 100% <0%> (ø) ⬆️
plugin/storage/cassandra/spanstore/reader.go 100% <0%> (ø) ⬆️
model/adjuster/clockskew.go 100% <0%> (ø) ⬆️
...lugin/storage/cassandra/spanstore/dbmodel/model.go 100% <0%> (ø) ⬆️
cmd/agent/app/testutils/fixture.go 100% <0%> (ø) ⬆️
.../app/testutils/mock_baggage_restriction_manager.go 100% <0%> (ø) ⬆️
... and 151 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d339ef...60bf9f3. Read the comment docs.

@bobrik
Copy link
Contributor Author

bobrik commented Apr 17, 2019

I understand your use case, but I'm not a fan of trading completeness for speed.

The tradeoff is "working ui" vs "not working ui" for me, unfortunately.

I'm wondering whether we can trigger the strategy of searching by spans and looking at traceIDs only for certain search patterns. For instance, only applying it if the search interval is less than a minute.

The larger the window, the cheaper it is to search by spans. Aggregation looks at all matched docs, while search only looks at the latest ones.

Or even better, are there strategies for search that are both complete and quick?

I created a topic in elasticsearch forum to see if there's a proper way to do this:

@ledor473
Copy link
Member

ledor473 commented May 3, 2019

@bobrik I tried to replicate that slow query in our environment but it ran pretty fast:

{
    "took": 177,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1087815,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "traceIDs": {
            "doc_count_error_upper_bound": -1,
            "sum_other_doc_count": 1087569,
            "buckets": [.....]
        }
    }
}

I tried to pick something that would match the hits.total value in your issue (331 761) but ended finding something that is 3x what is mentioned (1 087 815).

Obviously, this is not a perfect comparison, but I still wanted to try to replicate it and report back 👍

@xdaijin
Copy link

xdaijin commented Sep 11, 2019

@bobrik I tried to replicate that slow query in our environment but it ran pretty fast:

{
    "took": 177,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1087815,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "traceIDs": {
            "doc_count_error_upper_bound": -1,
            "sum_other_doc_count": 1087569,
            "buckets": [.....]
        }
    }
}

I tried to pick something that would match the hits.total value in your issue (331 761) but ended finding something that is 3x what is mentioned (1 087 815).

Obviously, this is not a perfect comparison, but I still wanted to try to replicate it and report back 👍

May I know something detail about the config of your environment?

@ledor473
Copy link
Member

@NijiadeIX I've shared number in a talk. You can get the slides from that page: https://kccnceu19.sched.com/event/MPbm/deploy-scale-and-extend-jaeger-louis-etienne-dorval-ticketmaster

Look at slide 14 and 37

@bobrik
Copy link
Contributor Author

bobrik commented Sep 18, 2019

@NijiadeIX this is how much data we have on daily basis (33B spans / 2TB of data per replica):

$ curl 'https://foo.bar/_cat/indices/jaeger-span-*?v&s=index&h=index,pri,docs.count,store.size'
index                  pri  docs.count store.size
jaeger-span-2019-09-11  38 26600483894      3.2tb
jaeger-span-2019-09-12  38 33379018750        4tb
jaeger-span-2019-09-13  38 33201947406        4tb
jaeger-span-2019-09-14  38 31263193630      3.8tb
jaeger-span-2019-09-15  38 31037438456      3.7tb
jaeger-span-2019-09-16  38 32592635484      3.9tb
jaeger-span-2019-09-17  38 32561140996      3.9tb
jaeger-span-2019-09-18  38 28266696032      3.5tb

We have a service that generates two spans per trace,
around 4K spans per second. It's very slow to query
for the latest traces in this service.

This is the default query that is currently generated:

```json
{
  "aggregations": {
    "traceIDs": {
      "aggregations": {
        "startTime": {
          "max": {
            "field": "startTime"
          }
        }
      },
      "terms": {
        "field": "traceID",
        "order": [
          {
            "startTime": "desc"
          }
        ],
        "size": 20
      }
    }
  },
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "startTime": {
              "from": 1555437540000000,
              "include_lower": true,
              "include_upper": true,
              "to": 1555437600000000
            }
          }
        },
        {
          "match": {
            "process.serviceName": {
              "query": "nginx-ssl"
            }
          }
        }
      ]
    }
  },
  "size": 0
}
```

This takes pretty much forever if there are many spans to aggregate:

```json
{
  "took": 38372,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 331761,
    "max_score": 0,
    "hits": []
  },
  "other": "stuff"
}
```

Presumably, because ES takes full window, then sorts as many buckets
as there are traces.

Instead, we can pull the last X spans, and find unique traceIDs there:

```json
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "startTime": {
              "from": 1555437540000000,
              "include_lower": true,
              "include_upper": true,
              "to": 1555437600000000
            }
          }
        },
        {
          "match": {
            "process.serviceName": {
              "query": "nginx-ssl"
            }
          }
        }
      ]
    }
  },
  "size": 100,
  "sort": [
    {
      "startTime": "desc"
    }
  ]
}
```

This is a whole lot faster:

```json
{
  "took": 26,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 331761,
    "max_score": null,
    "hits": [
      "100 hits here"
    ]
  },
  "other": "stuff"
}
```

So this is 38372 / 26 = 1475x faster, which is pretty nice.
Generally speedup is under 1000x, though.

Signed-off-by: Ivan Babrou <[email protected]>
@bobrik bobrik force-pushed the ivan/faster-es-search branch from 0812ed9 to 60bf9f3 Compare September 22, 2019 19:51
@wqtty
Copy link

wqtty commented Dec 17, 2019

We have also experienced very slow query performance, we altered the index setting, added
"eager_global_ordinals": true, to traceID field, it looks like below

"traceID": { "type": "keyword", "eager_global_ordinals": true, "ignore_above": 256 }

@pavolloffay
Copy link
Member

@wqtty how much data do you ingest? What is your data retention?

With that setting do you experience increased disk, CPU overhead at ingestion time?

Note: this will, however, be shifting the cost of building global ordinals from search-time to refresh-time. For many use cases, this is a price that’s acceptable to pay for the improvements at search time.
https://medium.com/driven-by-code/elasticsearch-global-ordinals-31df2806391f

@wqtty
Copy link

wqtty commented Dec 19, 2019

@wqtty how much data do you ingest? What is your data retention?

With that setting do you experience increased disk, CPU overhead at ingestion time?

Note: this will, however, be shifting the cost of building global ordinals from search-time to refresh-time. For many use cases, this is a price that’s acceptable to pay for the improvements at search time.
https://medium.com/driven-by-code/elasticsearch-global-ordinals-31df2806391f

we are having about 14k spans per second, and keep them in es for only 2 days, meaning we are having a jaeger-span index about 2.3T every day.

With that setting do you experience increased disk, CPU overhead at ingestion time?

Sorry, I didn't catch you.

@pavolloffay
Copy link
Member

If you experience increased CPU load or increased disk space allocation after you applied that setting.

@wqtty
Copy link

wqtty commented Dec 25, 2019

If you experience increased CPU load or increased disk space allocation after you applied that setting.

not very obvious

@Stono
Copy link

Stono commented May 7, 2020

Our elasticsearch/jaeger setup grinds to a halt as the day goes on! Being able to control the max number of spans in the query result as a config option feels like something many folks (including me) would opt into.

@xdaijin
Copy link

xdaijin commented Jul 2, 2020

I query the data of today can be very slow:

{
  "took" : 6864,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 95482,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "traceIDs" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 95342,
      "buckets" : [...]
    }
  }
}

But I query the data of the yesterday can be very fast:

{
  "took" : 40,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 98014,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "traceIDs" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 97875,
      "buckets" : [...]
    }
  }
}

@pavolloffay
Copy link
Member

Could somebody submit a new PR or rebase this one?

@Stono
Copy link

Stono commented Jul 7, 2020

FYI re my previous comment, we've pretty much solved our performance problems by adding "eager_global_ordinals": true to the tradeId field, we currently ingest about 3.3billion spans a day and find performance pretty good so long as you're not querying across multiple days

@RashmiRam
Copy link

I suppose we rely on aggregations to get unique traceIds. Is there any way to distinguish the root span alone so that we could filter based on certain field to get all root spans and their traceIDs? I saw references will be empty array for root spans. But again, it is a nested field. We could rely on parentSpanId field. But, It saw it is deprecated. Is there any possibility that we could add a new field or leverage existing field to find out root spans alone so that we can avoid this aggregation totally?

@bobrik
Copy link
Contributor Author

bobrik commented Jul 8, 2020

@pavolloffay, I can rebase, but it will still fail a test for reading past 10k spans and I don't really have much time on my hands to work around that.

Internally we've moved on from this a bit further, which complicates things.

Our index is sorted, so that queries can read data in order and terminate early. This required us to get rid of indexing for nested docs, which means no searching for tags in logs.

Overall there's a massive improvement in terms of docs (we used to run out of docs per segment because of nested docs!), and memory footprint:

index                  pri  docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
# Before (cluster one)
jaeger-span-2020-05-20  38 81604373583      8.9tb          17.1gb                         38gb                    0b       17.3gb
# After (cluster two)
jaeger-span-2020-05-20  38  8406585575      4.8tb         192.1mb                           0b                    0b      192.1mb

Plus we're looking into Clickhouse as the backend #1438.

@bobrik
Copy link
Contributor Author

bobrik commented Jul 8, 2020

@RashmiRam, the aggregation is looking for traceIDs that match the tags. You can't look only at root spans, since they don't have tags of children.

@RashmiRam
Copy link

Aah. I somehow missed it. Thanks @bobrik

@sivatarunp
Copy link

@Stono Can I know what is your refresh interval and jvm size. I have tried using eager_global_ordinals but found no improvement after that, And our rate is very less when compared to yours.

@mehta-ankit
Copy link
Member

mehta-ankit commented Nov 2, 2020

@pavolloffay @wqtty @bobrik
I wanted to share big improvements we saw after adding "eager_global_ordinals": true to traceID field as mentioned in this comment.
#1475 (comment)
Along with this change we also modified refresh_interval for the indices to 60s (default is 1 s for Elasticsearch).

Improvements we saw were following:

  1. For non search endpoints (all endpoints except "/api/traces" that jaeger query makes using jaeger_rpc_request_latency_bucket metric) we saw 65-75% improvement for 90th percentile of latencies.
  2. Same metric value for search endpoint (/api/traces) we saw 85% improvement for average latencies. (since highest bucket for this metric is 10s, earlier we could not even plot 90th percentile as it would flat line at 10s value. Hence we used average's 😄 )
  • Since setting eager_global_ordinals to true moves the cost of building global ordinals to refresh phase from search refresh interval helped with overall stability i believe.
  • We saw heap usage increase by 7-9% after this change.
  • Our daily index's primary store size is 200-300GB (with default settings of 5 primary shards and 1 replica). We retain data for 7 days and on average we keep 1.5-2TB of data in elasticsearch. Spans received rate for us is between ~ 5-8K spans/sec.
  • We give our elasticsearch data nodes 14G of memory and Heap size is 7G. Heap usage normally stays around 40-60% (depending on ingest rates and search requests).

Let me know if you all have any questions.

@sivatarunp
Copy link

sivatarunp commented Nov 2, 2020

@pavolloffay @wqtty @bobrik
I wanted to share big improvements we saw after adding "eager_global_ordinals": true to traceID field as mentioned in this comment.
#1475 (comment)
Along with this change we also modified refresh_interval for the indices to 60s (default is 1 s for Elasticsearch).

Improvements we saw were following:

  1. For non search endpoints (all endpoints except "/api/traces" that jaeger query makes using jaeger_rpc_request_latency_bucket metric) we saw 65-75% improvement for 90th percentile of latencies.
  2. Same metric value for search endpoint (/api/traces) we saw 85% improvement for average latencies. (since highest bucket for this metric is 10s, earlier we could not even plot 90th percentile as it would flat line at 10s value. Hence we used average's 😄 )

Since setting eager_global_ordinals to true moves the cost of building global ordinals to refresh phase from search refresh interval helped with overall stability i believe.

We saw heap usage increase by 7-9% after this change.

Our daily index's primary store size is 200-300GB (with default settings of 5 primary shards and 1 replica). We retain data for 7 days and on average we keep 1.5-2TB of data in elasticsearch.

Let me know if you all have any questions.

@mehta-ankit can we know your cardinality of traceID field, and what is the value of latency you have observed. For us setting the global_ordinals to true didn't help . Our refresh interval is also around 60s. Wanted to know anything I missed

@mehta-ankit
Copy link
Member

mehta-ankit commented Nov 2, 2020

@sivatarunp I don't have the value for cardinality of traceID field. I am assuming i ll have to use cardinality aggregation to figure this out.

Looks like i figured out the cardinality. It seems to be 105 million for a single day's index.

"aggregations" : {
    "traceid_count" : {
      "value" : 105056553
    }
  }

Used the following query

curl -H "Content-Type: application/json" -X GET "http://localhost:9200/<index_name>/_search?size=0&pretty" -d'{     
   "aggs": {
         "traceid_count": {
             "cardinality": {
                 "field": "traceID"
             }
         }
     }
}'

Avg Latencies for /api/traces endpoint have been around 500ms-1s in last 1 hour (its still early for us to hit high enough traffic on search right now). Of course it is dependent on params passed around from jaeger query (results limit, lookback value etc).

Question,

  1. how did you set this value ? Did you set it on all your daily indices ?
    We use a cronjob to set this value on each daily index when its created. Otherwise values like refresh interval and this one can be overwritten(Modify elasticsearch index template for spans jaeger-operator#1077 (comment))

@sivatarunp
Copy link

@mehta-ankit you can have a rough estimation of cardinality by traces/s metric as well. For us for lookback value of 1hr and results_limit to 20, Its taking huge time around 25-30s. Hence the worry
This is on an index which has around 40million queryable docs in the time range.
As far as to answer your question. We use es-aliases. We have the setting update in the jaeger-span-template. So each rollover will have the index created with this setting enabled

@sivatarunp
Copy link

sivatarunp commented Nov 3, 2020

@sivatarunp I don't have the value for cardinality of traceID field. I am assuming i ll have to use cardinality aggregation to figure this out.

Looks like i figured out the cardinality. It seems to be 105 million for a single day's index.

"aggregations" : {
    "traceid_count" : {
      "value" : 105056553
    }
  }

Used the following query

curl -H "Content-Type: application/json" -X GET "http://localhost:9200/<index_name>/_search?size=0&pretty" -d'{     
   "aggs": {
         "traceid_count": {
             "cardinality": {
                 "field": "traceID"
             }
         }
     }
}'

Avg Latencies for /api/traces endpoint have been around 500ms-1s in last 1 hour (its still early for us to hit high enough traffic on search right now). Of course it is dependent on params passed around from jaeger query (results limit, lookback value etc).

Question,

  1. how did you set this value ? Did you set it on all your daily indices ?
    We use a cronjob to set this value on each daily index when its created. Otherwise values like refresh interval and this one can be overwritten(jaegertracing/jaeger-operator#1077 (comment))

Thanks for the info @mehta-ankit 👍 🙏
Our cardinality is around 130million per day. We have observed the high response times (25-30s) in cases where the querying service has high docs (40m)per shard. This was our test setup, where we ingested more docs of a single service to observe the pattern. We observed better response times as the no:of docs/shard is reduced. Would you mind sharing few details like the AWS instance types which were in use for your current setup

@jpkrohling
Copy link
Contributor

Closing due to inactivity.

@jpkrohling jpkrohling closed this Aug 23, 2021
@Stono
Copy link

Stono commented Aug 3, 2023

Hey for other people coming to this thread, i've summarised all the things I changed to make jaeger -> elasticsearch super performant for our use case. Hope it might be of use to others. https://karlstoney.com/speeding-up-jaeger-on-elasticsearch

@yurishkuro
Copy link
Member

@Stono great post! You didn't always mention the downsides of enabling various settings, so it's a bit hard to judge, but wouldn't some of these changes make sense permanently in the default index template created by Jaeger? Or do you feel that it's hardly ever a good idea to rely on the default template?

@Stono
Copy link

Stono commented Aug 3, 2023

That's a really good point to be fair, I'll update it with the negatives tomorrow (which tbh are negligible for us, slightly more CPU during ingest).

I think it's hard for jaeger offer a best for all template our the box. The default one is pretty good tbh, but I was pushing the scale. When you start doing that sort of volume really I think you'll end up bespoke tweaking to your use case, and hardware.

I personally would however add eager ordinals to traceID and the references (spanID).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.