Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support regex tags search for Elasticseach backend #2049

Merged
merged 8 commits into from
Mar 4, 2020

Conversation

annanay25
Copy link
Member

@annanay25 annanay25 commented Jan 30, 2020

Signed-off-by: Annanay [email protected]

Which problem is this PR solving?

When using the ES backend, adds the ability to

  • use regex search for the values field. So a query like foo=bar.* should match all tags like {foo:bar}, {foo:barbaz}, {foo:bar1}

Related to #684

Short description of the changes

  • Swapped the NewMatchQuery method in ES reader for NewRegexpQuery

/cc @pavolloffay

@annanay25 annanay25 requested a review from a team as a code owner January 30, 2020 06:41
@codecov
Copy link

codecov bot commented Jan 30, 2020

Codecov Report

Merging #2049 into master will decrease coverage by 1.09%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2049      +/-   ##
==========================================
- Coverage   97.39%   96.29%   -1.10%     
==========================================
  Files         207      214       +7     
  Lines       10286    10535     +249     
==========================================
+ Hits        10018    10145     +127     
- Misses        223      332     +109     
- Partials       45       58      +13     
Impacted Files Coverage Δ
cmd/agent/app/agent.go 87.50% <0.00%> (-12.50%) ⬇️
cmd/query/app/static_handler.go 86.84% <0.00%> (-1.76%) ⬇️
...lugin/sampling/strategystore/adaptive/processor.go 99.20% <0.00%> (-0.80%) ⬇️
cmd/env/command.go 100.00% <0.00%> (ø)
cmd/query/app/flags.go 100.00% <0.00%> (ø)
cmd/agent/app/builder.go 100.00% <0.00%> (ø)
model/adjuster/clockskew.go 100.00% <0.00%> (ø)
cmd/collector/app/metrics.go 100.00% <0.00%> (ø)
cmd/collector/app/options.go 100.00% <0.00%> (ø)
pkg/config/tlscfg/options.go 100.00% <0.00%> (ø)
... and 43 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e974ce...0d2fb5e. Read the comment docs.

@annanay25 annanay25 changed the title Add regex tag filter for ES backend Improve query functionality for ES backend Jan 30, 2020
@annanay25
Copy link
Member Author

I've added the "search for presence of a tag" functionality to the cassandra reader as well, but I can submit that in a different PR

@pavolloffay
Copy link
Member

search for presence of a tag

Any ideas how we could support this in UI?

Could you please add integration tests for regex and also escaped version e.g. find exactly hello*.

@pavolloffay
Copy link
Member

It would be better to split the PR for regex and missing tags to two separate PRs.

@annanay25
Copy link
Member Author

@pavolloffay - Please review.

@yurishkuro
Copy link
Member

What are the performance implications of always forcing a regex search?

@annanay25
Copy link
Member Author

@yurishkuro From what I understand, Elasticsearch performance degrades with decrease in prefix length before the wildcard(operator) in the regex query. For ex: http.url=*.example.com will have worse performance as compared to http.url=example.c*

I could not find performance benchmarks for a Match query vs Regexp query for when no operator (*,?,etc) is used.

@pavolloffay
Copy link
Member

I am also interested in performance results ad tag queries are common. Could you please run some perf tests to measure the query time?

I have used this https://github.com/pavolloffay/jaeger-perf-tests before and also @jkandasa should have some perf framework, although I am not sure if that can be used without our infrastructure.

@pavolloffay
Copy link
Member

If there is perf degradation maybe we could use regex type only if there is regex in the query.

@jkandasa
Copy link
Member

jkandasa commented Feb 4, 2020

@annanay25 we have a tool to measure query timings. https://github.com/jkandasa/jaegerperf

@annanay25
Copy link
Member Author

@pavolloffay - Thanks. Is there a guide on which metrics need to be monitored after running the perf-tests?

@jkandasa - Could you please provide some more information on the tool? Does it perf-test the collector/query component, etc? The README doesn't say much.

@annanay25
Copy link
Member Author

@pavolloffay - Also, in the integration test, the "Tag Name + Operation Name" query seems to match the tags_regex_trace.json as well, even though they search on a different ServiceName and tag value.

I'm not sure I understand why that is happening, could you please take a look?

"ExpectedFixtures": ["tags_regex_trace"]
},
{
"Caption": "Tag regex + Operation name + max Duration - Multiple traces",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavolloffay - Also, in the integration test, the "Tag Name + Operation Name" query seems to match the tags_regex_trace.json as well, even though they search on a different ServiceName and tag value.

@annanay25 do you mean this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the one above, "Tag regex + Operation name + max Duration"

It is supposed to match with just the tags_regex_trace but matches with both tags_regex_trace and tags_opname_trace .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one also looks wrong as it uses different service name https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/integration/fixtures/traces/tags_opname_trace.json

Could you please debug it and or open an issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you fix the issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to mangle the traceID further -
0955efa#diff-1824d3ce6ce534210f34269d2370e966

@jkandasa
Copy link
Member

jkandasa commented Feb 6, 2020

@jkandasa - Could you please provide some more information on the tool? Does it perf-test the collector/query component, etc? The README doesn't say much.

@annanay25 This tool can send spans to the collector component. and can be used to measure jaeger-query API execution time. Both actions are independent.
I have updated the README file. Let me know still it is not clear.

@annanay25
Copy link
Member Author

annanay25 commented Feb 12, 2020

@jkandasa - I was able to setup your tool to do the performance tests. I loved the simplicity, and the concept of creating jobs. I have a few suggestions from my experience of using it -

  • It might be helpful to mention somewhere in the README that the QueryRunner will generate a JSON with results once it completes. It was not obvious to me, at first
  • It would also be useful to have some guidelines on how two generated-jsons can be compared to get an idea of performance
  • In my latest run I get a constant contentLength of 56 in all summary results, seems fishy, is there an explanation?
  • In the Jobs UI page, clicking on the "delete job" icon shows "Status: Completed" even for a running job.
  • I'm also attaching the kubernetes yaml I wrote to set this up - please feel free to include it in the repo if you wish as it might be useful to others.

@annanay25
Copy link
Member Author

Perf results:

From master build -

{
    "summary": [
     {
      "contentLength": 242,
      "elapsed": 4388,
      "errorPercentage": 100,
      "errorsCount": 10,
      "name": "3.last 2 days limit 1000",
      "samples": 10
     },
     {
      "contentLength": 304,
      "elapsed": 5092,
      "errorPercentage": 100,
      "errorsCount": 10,
      "name": "4.last 7 days limit 2000",
      "samples": 10
     },
     {
      "contentLength": 282153,
      "elapsed": 203237,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "1.last 12 hours limit 100",
      "samples": 10
     },
     {
      "contentLength": 2923946,
      "elapsed": 2138975,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "1.last 12 hours limit 1000",
      "samples": 10
     },
     {
      "contentLength": 2970425,
      "elapsed": 21746050,
      "errorPercentage": 50,
      "errorsCount": 5,
      "name": "1.last 12 hours limit 2000",
      "samples": 10
     },
     {
      "contentLength": 229,
      "elapsed": 93426,
      "errorPercentage": 100,
      "errorsCount": 10,
      "name": "2.last 24 hours limit 100",
      "samples": 10
     },
     {
      "contentLength": 229,
      "elapsed": 81108,
      "errorPercentage": 100,
      "errorsCount": 10,
      "name": "2.last 24 hours limit 1000",
      "samples": 10
     },
     {
      "contentLength": 229,
      "elapsed": 7399,
      "errorPercentage": 100,
      "errorsCount": 10,
      "name": "2.last 24 hours limit 2000",
      "samples": 10
     },
     {
      "contentLength": 242,
      "elapsed": 4355,
      "errorPercentage": 100,
      "errorsCount": 10,
      "name": "3.last 2 days limit 100",
      "samples": 10
     },
     {
      "contentLength": 242,
      "elapsed": 4051,
      "errorPercentage": 100,
      "errorsCount": 10,
      "name": "3.last 2 days limit 2000",
      "samples": 10
     },
     {
      "contentLength": 304,
      "elapsed": 4632,
      "errorPercentage": 100,
      "errorsCount": 10,
      "name": "4.last 7 days limit 100",
      "samples": 10
     },
     {
      "contentLength": 304,
      "elapsed": 4352,
      "errorPercentage": 100,
      "errorsCount": 10,
      "name": "4.last 7 days limit 1000",
      "samples": 10
     }
    ]
   }

With this PR -

{
    "summary": [
     {
      "contentLength": 3546187,
      "elapsed": 2750042,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "3.last 2 days limit 2000",
      "samples": 10
     },
     {
      "contentLength": 232174,
      "elapsed": 997305,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "1.last 12 hours limit 100",
      "samples": 10
     },
     {
      "contentLength": 2609034,
      "elapsed": 10571863,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "1.last 12 hours limit 1000",
      "samples": 10
     },
     {
      "contentLength": 3546187,
      "elapsed": 12699653,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "1.last 12 hours limit 2000",
      "samples": 10
     },
     {
      "contentLength": 2609034,
      "elapsed": 2888810,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "2.last 24 hours limit 1000",
      "samples": 10
     },
     {
      "contentLength": 3546187,
      "elapsed": 4955535,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "2.last 24 hours limit 2000",
      "samples": 10
     },
     {
      "contentLength": 230222,
      "elapsed": 571004,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "3.last 2 days limit 100",
      "samples": 10
     },
     {
      "contentLength": 2609034,
      "elapsed": 1898116,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "3.last 2 days limit 1000",
      "samples": 10
     },
     {
      "contentLength": 230222,
      "elapsed": 379544,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "4.last 7 days limit 100",
      "samples": 10
     },
     {
      "contentLength": 2609034,
      "elapsed": 1753636,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "4.last 7 days limit 1000",
      "samples": 10
     },
     {
      "contentLength": 3546187,
      "elapsed": 2979869,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "4.last 7 days limit 2000",
      "samples": 10
     },
     {
      "contentLength": 230222,
      "elapsed": 1225345,
      "errorPercentage": 0,
      "errorsCount": 0,
      "name": "2.last 24 hours limit 100",
      "samples": 10
     }
    ]
   }

Bunch of the test results from the master build give an errorPercentage: 100

@jkandasa
Copy link
Member

@annanay25 Thank you for the comments and feedback on the tool. I have added a metrics page to compare the results
.

  • It might be helpful to mention somewhere in the README that the QueryRunner will generate a JSON with results once it completes. It was not obvious to me, at first

Updated.

  • It would also be useful to have some guidelines on how two generated-jsons can be compared to get an idea of performance

Metrics page includeded

  • In my latest run I get a constant contentLength of 56 in all summary results, seems fishy, is there an explanation?

In this case, you may see errors or no data for the specified timerange. Please refer JSON result. metrics >> raw
Example:

{
       "contentLength": 194,
       "elapsed": 15166842,
       "errorMessage": "",
       "others": {
        "errors": [
         {
          "code": 500,
          "msg": "Get https://elasticsearch:9200/_msearch: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
}
  • In the Jobs UI page, clicking on the "delete job" icon shows "Status: Completed" even for a running job.

Fixed

  • I'm also attaching the kubernetes yaml I wrote to set this up - please feel free to include it in the repo if you wish as it might be useful to others.

Thank you. I have modified and added in the source code.

@annanay25
Copy link
Member Author

Seems like there is some performance degradation, how do you suggest this can be improved @pavolloffay?

@pavolloffay
Copy link
Member

@annanay25 could you please create a summary of the test results? For instance what types of queries take how much time in % against master/head #1969 (comment)

@annanay25
Copy link
Member Author

@pavolloffay - https://docs.google.com/spreadsheets/d/1mzJNQYv1xF0fiMAcpXk5TlNVE3GIqZOlZv2iDZqkraI/edit?usp=sharing

On my system, after one run the perf tests start giving a 100% error for most queries in the master build (1.16) - can someone else test this?

@pavolloffay
Copy link
Member

@annanay25 the results do not look good, the performance degradation is quite significant.

I would suggest detecting the regex and changing the query type on the fly only when it is necessary.

@annanay25
Copy link
Member Author

@pavolloffay - Tests on my machine are quite inconsistent (PTAL at round 2, numbers are in complete contrast with round 1)

Can someone else run a benchmark on their machines? I can publish my image on dockerhub for this

@pavolloffay
Copy link
Member

@annanay25 could you please build the query and collector image and publish it? @jkandasa would you be able to run the same tests as in #1969 ?

@jkandasa
Copy link
Member

@pavolloffay ok I will do.
@annanay25 can you publish the collector and query image?

@annanay25
Copy link
Member Author

Thanks @jkandasa.
For collector the jaegertracing/jaeger-collector image can be used, as there are no changes for the collector in this PR.
For query, I have pushed the image at https://hub.docker.com/r/annanay25/jaeger-query

@smpavlenko
Copy link

Hello,
is there any ETA when it will be released?

@annanay25
Copy link
Member Author

@jkandasa - Did you get a chance to take a look at this?

@jkandasa
Copy link
Member

jkandasa commented Mar 2, 2020

@annanay25 here is the result:

image

Detailed file: https://docs.google.com/spreadsheets/d/1fIOYd4LJHHcmOIlEYuR9TQlAv4SXWdddy22zPmymvvw/edit?usp=sharing

Image Details:
  • 1.17.0
Collecotor: jaegertracing/jaeger-collector:1.17.0
Query: jaegertracing/jaeger-query:1.17.0
Collector: jaegertracing/jaeger-collector:1.17.0
Query: annanay25/jaeger-query:latest (docker.io/annanay25/jaeger-query@sha256:3d700db8fb31b43cced6ecbfb7c17452165fa3def0525b9e7b366b90de9f1f6d)
  • OCP version
Server Version: 4.3.1
Kubernetes Version: v1.16.2

Test data:

target: "collector"
endpoint: http://jaegerqe-collector:14268/api/traces
serviceName: jaegerperf_generator
mode: history
numberOfDays: 10
spansPerDay: 60000
spansPerSecond: 300 # maximum spans limit/sec
childDepth: 2
tags: 
  spans_generator: "jaegerperf"
  days: 10
startTime: 
Query test raw data:

@annanay25
Copy link
Member Author

Thank you for the results @jkandasa.

@pavolloffay - Could you please review these?

Copy link
Member

@pavolloffay pavolloffay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, just minor comments in tests.

@jkandasa thanks for running the tests!

plugin/storage/integration/elasticsearch_test.go Outdated Show resolved Hide resolved
plugin/storage/integration/fixtures/queries_es.json Outdated Show resolved Hide resolved
plugin/storage/integration/fixtures/queries_es.json Outdated Show resolved Hide resolved
plugin/storage/integration/fixtures/queries_es.json Outdated Show resolved Hide resolved
plugin/storage/integration/fixtures/queries_es.json Outdated Show resolved Hide resolved
@pavolloffay pavolloffay changed the title Improve query functionality for ES backend Support regex tags search for Elasticseach backend Mar 4, 2020
@pavolloffay
Copy link
Member

pavolloffay commented Mar 4, 2020

@annanay25 could you please provide more description in the first comment? E.g. what fields could be searched with regex. Is it only values or the keys?

Does it work with our UI?

@annanay25
Copy link
Member Author

Thanks for the review, @pavolloffay. I've addressed all comments. PTAL.

@pavolloffay
Copy link
Member

should match all tags like

what about logs?

@naeimehmhm
Copy link

Is there any update on having regex in tag search?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants