Skip to content

Conversation

@finnegancarroll
Copy link
Contributor

@finnegancarroll finnegancarroll commented Aug 19, 2025

Description

Changes provide support for gRPC transport operations in OSB:

  • Introduces protobuf runners responsible for parsing params into proto request and response.
  • Introduces a unified client which mirrors REST client functionality with additional providers for gRPC stubs
  • Introduces instrumentation for gRPC to measure request start/end at the network level (before any client serialization is executed)
  • Introduces flag to specify host address for gRPC endpoint

Note that these changes depend on several upstream changes:

  1. Schema definitions are taken from opensearch protobufs 0.10.0. To work out of the box OpenSearch core needs to update its server side definitions to this version. See pending PR for this change: [GRPC] Adapt transport-grpc to use opensearch-protobufs 0.13.0  OpenSearch#19007
  2. Python library for opensearch-protobufs is currently built and installed manually. Please see this pending PR to publish these libraries to PyPI: Publish python wheel to PyPi opensearch-protobufs#183
  3. No workload currently utilizes gRPC. Please see pending PR in workloads repository to add a gRPC alternative to the big5 workload: Add gRPC big5/vector search test procedures opensearch-benchmark-workloads#689

Running gRPC benchmarks

To run OSB against the gRPC endpoint please see transport-grpc settings for enabling this endpoint in settings:
https://github.com/opensearch-project/OpenSearch/tree/main/modules/transport-grpc

Clone the OSB workloads feature branch and provide it with the --workload-path:
opensearch-project/opensearch-benchmark-workloads#689

opensearch-benchmark run \
    --pipeline=benchmark-only \
    --workload-path="<repo-path>/opensearch-benchmark-workloads/big5" \
    --target-host=http://localhost:9200 \
    --grpc-target-hosts=http://localhost:9400 \
    --workload-params '{"number_of_shards":"1","number_of_replicas":"0", "ingest_percentage":"1.0"}' \
    --distribution-version="3.3.0" \
    --kill-running-processes

With example output:

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |              Task |       Value |   Unit |
|---------------------------------------------------------------:|------------------:|------------:|-------:|
|                     Cumulative indexing time of primary shards |                   |     6.97163 |    min |
|             Min cumulative indexing time across primary shards |                   |     6.97163 |    min |
|          Median cumulative indexing time across primary shards |                   |     6.97163 |    min |
|             Max cumulative indexing time across primary shards |                   |     6.97163 |    min |
|            Cumulative indexing throttle time of primary shards |                   |           0 |    min |
|    Min cumulative indexing throttle time across primary shards |                   |           0 |    min |
| Median cumulative indexing throttle time across primary shards |                   |           0 |    min |
|    Max cumulative indexing throttle time across primary shards |                   |           0 |    min |
|                        Cumulative merge time of primary shards |                   |     0.41035 |    min |
|                       Cumulative merge count of primary shards |                   |          16 |        |
|                Min cumulative merge time across primary shards |                   |     0.41035 |    min |
|             Median cumulative merge time across primary shards |                   |     0.41035 |    min |
|                Max cumulative merge time across primary shards |                   |     0.41035 |    min |
|               Cumulative merge throttle time of primary shards |                   |    0.111683 |    min |
|       Min cumulative merge throttle time across primary shards |                   |    0.111683 |    min |
|    Median cumulative merge throttle time across primary shards |                   |    0.111683 |    min |
|       Max cumulative merge throttle time across primary shards |                   |    0.111683 |    min |
|                      Cumulative refresh time of primary shards |                   |   0.0752167 |    min |
|                     Cumulative refresh count of primary shards |                   |          22 |        |
|              Min cumulative refresh time across primary shards |                   |   0.0752167 |    min |
|           Median cumulative refresh time across primary shards |                   |   0.0752167 |    min |
|              Max cumulative refresh time across primary shards |                   |   0.0752167 |    min |
|                        Cumulative flush time of primary shards |                   |   0.0570333 |    min |
|                       Cumulative flush count of primary shards |                   |           3 |        |
|                Min cumulative flush time across primary shards |                   |   0.0570333 |    min |
|             Median cumulative flush time across primary shards |                   |   0.0570333 |    min |
|                Max cumulative flush time across primary shards |                   |   0.0570333 |    min |
|                                        Total Young Gen GC time |                   |        4.29 |      s |
|                                       Total Young Gen GC count |                   |        1991 |        |
|                                          Total Old Gen GC time |                   |           0 |      s |
|                                         Total Old Gen GC count |                   |           0 |        |
|                                                     Store size |                   |    0.249005 |     GB |
|                                                  Translog size |                   | 5.12227e-08 |     GB |
|                                         Heap used for segments |                   |           0 |     MB |
|                                       Heap used for doc values |                   |           0 |     MB |
|                                            Heap used for terms |                   |           0 |     MB |
|                                            Heap used for norms |                   |           0 |     MB |
|                                           Heap used for points |                   |           0 |     MB |
|                                    Heap used for stored fields |                   |           0 |     MB |
|                                                  Segment count |                   |           8 |        |
|                                                 Min Throughput | grpc-index-append |     20314.5 | docs/s |
|                                                Mean Throughput | grpc-index-append |       23556 | docs/s |
|                                              Median Throughput | grpc-index-append |     22851.4 | docs/s |
|                                                 Max Throughput | grpc-index-append |     40423.3 | docs/s |
|                                        50th percentile latency | grpc-index-append |     183.964 |     ms |
|                                        90th percentile latency | grpc-index-append |     250.861 |     ms |
|                                        99th percentile latency | grpc-index-append |     409.044 |     ms |
|                                      99.9th percentile latency | grpc-index-append |      801.04 |     ms |
|                                       100th percentile latency | grpc-index-append |     848.214 |     ms |
|                                   50th percentile service time | grpc-index-append |     183.964 |     ms |
|                                   90th percentile service time | grpc-index-append |     250.861 |     ms |
|                                   99th percentile service time | grpc-index-append |     409.044 |     ms |
|                                 99.9th percentile service time | grpc-index-append |      801.04 |     ms |
|                                  100th percentile service time | grpc-index-append |     848.214 |     ms |
|                                                     error rate | grpc-index-append |           0 |      % |
|                                                 Min Throughput |    grpc-match-all |        2.01 |  ops/s |
|                                                Mean Throughput |    grpc-match-all |        2.01 |  ops/s |
|                                              Median Throughput |    grpc-match-all |        2.01 |  ops/s |
|                                                 Max Throughput |    grpc-match-all |        2.01 |  ops/s |
|                                        50th percentile latency |    grpc-match-all |     7.13467 |     ms |
|                                        90th percentile latency |    grpc-match-all |     10.6535 |     ms |
|                                        99th percentile latency |    grpc-match-all |      11.535 |     ms |
|                                       100th percentile latency |    grpc-match-all |     17.7223 |     ms |
|                                   50th percentile service time |    grpc-match-all |     5.52638 |     ms |
|                                   90th percentile service time |    grpc-match-all |     8.48873 |     ms |
|                                   99th percentile service time |    grpc-match-all |     9.38595 |     ms |
|                                  100th percentile service time |    grpc-match-all |     15.5186 |     ms |
|                                                     error rate |    grpc-match-all |           0 |      % |
|                                                 Min Throughput |         grpc-term |        2.01 |  ops/s |
|                                                Mean Throughput |         grpc-term |        2.01 |  ops/s |
|                                              Median Throughput |         grpc-term |        2.01 |  ops/s |
|                                                 Max Throughput |         grpc-term |        2.01 |  ops/s |
|                                        50th percentile latency |         grpc-term |     6.22912 |     ms |
|                                        90th percentile latency |         grpc-term |     9.92944 |     ms |
|                                        99th percentile latency |         grpc-term |     14.6624 |     ms |
|                                       100th percentile latency |         grpc-term |     14.7964 |     ms |
|                                   50th percentile service time |         grpc-term |     4.49331 |     ms |
|                                   90th percentile service time |         grpc-term |      7.2344 |     ms |
|                                   99th percentile service time |         grpc-term |     12.2952 |     ms |
|                                  100th percentile service time |         grpc-term |     12.3938 |     ms |
|                                                     error rate |         grpc-term |           0 |      % |


-----------------------------------
[INFO] ✅ SUCCESS (took 418 seconds)
-----------------------------------

Issues Resolved

N/A

Testing

- [ ] New functionality includes testing

Tested manually.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@finnegancarroll
Copy link
Contributor Author

Initial bulk ingestion benchmarks for big5 workload.
REST API in green gRPC/protobuf in blue.

Throughput:
image

Latency:
image

Difference in latency here is dramatic but flame graphs suggest serialization only makes up a small percentage of index-append workload. Further investigation required.

@finnegancarroll
Copy link
Contributor Author

Big5 term query.
Minor decrease in latency, ~3.5ms to ~3.4ms with gRPC/protobuf.

image

@finnegancarroll
Copy link
Contributor Author

KNN Query needs further investigation.
Throughput of gRPC/protobuf is significantly better while latency is worse across the board.
This suggests some error in measurement on OSB side.

REST API KNN Query

|                                                  Segment count |              |          20 |        |
|                                                 Min Throughput | prod-queries |      104.55 |  ops/s |
|                                                Mean Throughput | prod-queries |      111.16 |  ops/s |
|                                              Median Throughput | prod-queries |      107.38 |  ops/s |
|                                                 Max Throughput | prod-queries |      140.06 |  ops/s |
|                                        50th percentile latency | prod-queries |     8.31753 |     ms |
|                                        90th percentile latency | prod-queries |     9.40799 |     ms |
|                                        99th percentile latency | prod-queries |     10.2258 |     ms |
|                                      99.9th percentile latency | prod-queries |     14.7956 |     ms |
|                                     99.99th percentile latency | prod-queries |     20.3045 |     ms |
|                                       100th percentile latency | prod-queries |     23.1399 |     ms |

gRPC protobuf KNN Query

|                                                  Segment count |              |          20 |        |
|                                                 Min Throughput | prod-queries |       117.1 |  ops/s |
|                                                Mean Throughput | prod-queries |      130.57 |  ops/s |
|                                              Median Throughput | prod-queries |      122.54 |  ops/s |
|                                                 Max Throughput | prod-queries |      178.14 |  ops/s |
|                                        50th percentile latency | prod-queries |     8.37658 |     ms |
|                                        90th percentile latency | prod-queries |     9.48505 |     ms |
|                                        99th percentile latency | prod-queries |     10.3761 |     ms |
|                                      99.9th percentile latency | prod-queries |     19.2337 |     ms |
|                                     99.99th percentile latency | prod-queries |     29.8528 |     ms |
|                                       100th percentile latency | prod-queries |     32.8035 |     ms |

Flame graphs suggest gRPC/protobuf operation is not utilizing derived fields and fetching KNN document source. This should not be the case and accounts for ~10% of CPU time. Will update the above numbers after enabling derived source for gRPC/protobuf.
rest_knn.html
proto_knn.html

@finnegancarroll finnegancarroll changed the title Add proto runners Add support for gRPC transport Aug 28, 2025
@finnegancarroll
Copy link
Contributor Author

Note: These previous latency benchmarks are not accurate. Previously we were measuring serialization time for the gRPC client but not the REST client. Moved gRPC request start/end timers to execute in an intercepter such that we are measuring the time closer to the network. Throughput calculations remain unchanged.

@finnegancarroll
Copy link
Contributor Author

Profiling gRPC vs REST benchmarks I'm noticing an additional issue in this implementation. Benchmark results attached here:
09c4197e-febf-41f0-be92-3669a7268237-REST.txt
47298f02-ed9b-4a8c-898d-271ae93f060a-gRPC.txt

In these tests throughput for gRPC index append is worse than REST even though we previously found the opposite here:
#938 (comment)

The difference seems to come down to connection pooling. Looking at connections opened on relevant ports by OSB we observe REST opening 8 connections:
image

With gRPC opening only 4 connections:
image

gRPC connections additionally vary wildly in throughput.
Above snapshot gives 4 connections at about 20/20/40/40 Mb connections.
At another point we see this jumping to 60/60/60/60 Mb connections.
image

The difference I think is connection pooling in opensearchpy client which is not present in this gRPC configuration. The client machine (r5.xlarge) only has 4 cpus and only 4 OSB workers are created. The python client however still creates 8 separate connections/channels in the background. It is unclear to me why this is necessary or even desired on a machine with only 4 cpus.

The main issue with the gRPC implementation here I believe is sharing a single gRPC channel. client.py initializes the gRPC stubs once with a single channel and this object is shared across all runners. This means while 4 connections are present, objects being sent over those connections are fighting over a single queue responsible for request/response serde operations.

To better mirror REST client behavior each worker should have, at least, its own distinct channel instance.

@finnegancarroll finnegancarroll force-pushed the add-proto-runners branch 2 times, most recently from 783e238 to 68f523b Compare September 10, 2025 18:25
@finnegancarroll
Copy link
Contributor Author

finnegancarroll commented Sep 17, 2025

Updated latency benchmarks for bulk with noted performance regression fixed:
#938 (comment)
With upstream thread pool change here:
opensearch-project/OpenSearch#19278

Confirming network throughput is now equal between REST and gRPC:
(Although REST still creates 8 client connections, despite only having 4 workers - High level rest client seems to maybe default to 8 channels)

REST:
image

gRPC:
image

Latency additionally is as we expect based on flame graphs.
(Should be nearly equal for index-append workload)

REST:

|                                   50th percentile service time |             index-append |     135.526 |     ms |
|                                   90th percentile service time |             index-append |     196.326 |     ms |
|                                   99th percentile service time |             index-append |     294.155 |     ms |
|                                 99.9th percentile service time |             index-append |     1183.95 |     ms |
|                                99.99th percentile service time |             index-append |      1502.3 |     ms |
|                                  100th percentile service time |             index-append |     1609.85 |     ms |

gRPC:

|                                   50th percentile service time |        grpc-index-append |     131.422 |     ms |
|                                   90th percentile service time |        grpc-index-append |     193.971 |     ms |
|                                   99th percentile service time |        grpc-index-append |     288.847 |     ms |
|                                 99.9th percentile service time |        grpc-index-append |     1258.43 |     ms |
|                                99.99th percentile service time |        grpc-index-append |     1560.71 |     ms |
|                                  100th percentile service time |        grpc-index-append |      1739.5 |     ms |

@finnegancarroll
Copy link
Contributor Author

finnegancarroll commented Sep 18, 2025

Full REST index append:

|                                                 Min Throughput |             index-append |     25506.8 | docs/s |
|                                                Mean Throughput |             index-append |     26103.5 | docs/s |
|                                              Median Throughput |             index-append |     26139.3 | docs/s |
|                                                 Max Throughput |             index-append |     26442.5 | docs/s |
|                                        50th percentile latency |             index-append |     135.526 |     ms |
|                                        90th percentile latency |             index-append |     196.326 |     ms |
|                                        99th percentile latency |             index-append |     294.155 |     ms |
|                                      99.9th percentile latency |             index-append |     1183.95 |     ms |
|                                     99.99th percentile latency |             index-append |      1502.3 |     ms |
|                                       100th percentile latency |             index-append |     1609.85 |     ms |
|                                   50th percentile service time |             index-append |     135.526 |     ms |
|                                   90th percentile service time |             index-append |     196.326 |     ms |
|                                   99th percentile service time |             index-append |     294.155 |     ms |
|                                 99.9th percentile service time |             index-append |     1183.95 |     ms |
|                                99.99th percentile service time |             index-append |      1502.3 |     ms |
|                                  100th percentile service time |             index-append |     1609.85 |     ms |
|                                                     error rate |             index-append |           0 |      % |
|                                       100th percentile latency | wait-until-merges-finish |     60000.5 |     ms |
|                                  100th percentile service time | wait-until-merges-finish |     60000.5 |     ms |
|                                                     error rate | wait-until-merges-finish |         100 |      % |

gRPC index append:

|                                                  Segment count |                          |          29 |        |
|                                                 Min Throughput |        grpc-index-append |     22961.2 | docs/s |
|                                                Mean Throughput |        grpc-index-append |     25194.4 | docs/s |
|                                              Median Throughput |        grpc-index-append |     25612.4 | docs/s |
|                                                 Max Throughput |        grpc-index-append |     25882.2 | docs/s |
|                                        50th percentile latency |        grpc-index-append |     131.422 |     ms |
|                                        90th percentile latency |        grpc-index-append |     193.971 |     ms |
|                                        99th percentile latency |        grpc-index-append |     288.847 |     ms |
|                                      99.9th percentile latency |        grpc-index-append |     1258.43 |     ms |
|                                     99.99th percentile latency |        grpc-index-append |     1560.71 |     ms |
|                                       100th percentile latency |        grpc-index-append |      1739.5 |     ms |
|                                   50th percentile service time |        grpc-index-append |     131.422 |     ms |
|                                   90th percentile service time |        grpc-index-append |     193.971 |     ms |
|                                   99th percentile service time |        grpc-index-append |     288.847 |     ms |
|                                 99.9th percentile service time |        grpc-index-append |     1258.43 |     ms |
|                                99.99th percentile service time |        grpc-index-append |     1560.71 |     ms |
|                                  100th percentile service time |        grpc-index-append |      1739.5 |     ms |
|                                                     error rate |        grpc-index-append |           0 |      % |
|                                       100th percentile latency | wait-until-merges-finish |     60000.8 |     ms |
|                                  100th percentile service time | wait-until-merges-finish |     60000.8 |     ms |
|                                                     error rate | wait-until-merges-finish |         100 |      %

Copy link
Collaborator

@gkamat gkamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add appropriate unit tests and integ tests for this feature.

Some additional logging lines may be helpful for debugging.

Please convert to a regular PR when ready to be merged.

- Extends opensearchpy client with an additional async gRPC stub supplier.
- Adds config options for specifying gRPC server host endpoint.
- Adds new operation types for gRPC/protobuf operations.
- Adds runner implementations for above operations executing request in gRPC.
- Introduces proto helpers to implement conversion logic of params -> protobuf.

Signed-off-by: Finn Carroll <[email protected]>
Signed-off-by: Finn Carroll <[email protected]>
Signed-off-by: Finn Carroll <[email protected]>
@finnegancarroll
Copy link
Contributor Author

Hi @gkamat, it looks like the integration tests for OSB run against a few select versions and source OS from the published release .tar. I expect the gRPC apis in this PR will largely be compatible with versions >3.3 of OpenSource.

I'm wondering if I need to wait for 3.3 to be published to introduce ITs for these changes?

Signed-off-by: Finn Carroll <[email protected]>
Signed-off-by: Finn Carroll <[email protected]>
Signed-off-by: Finn Carroll <[email protected]>
Signed-off-by: Finn Carroll <[email protected]>
Signed-off-by: Finn Carroll <[email protected]>
@finnegancarroll finnegancarroll marked this pull request as ready for review October 14, 2025 21:00
@finnegancarroll
Copy link
Contributor Author

finnegancarroll commented Oct 17, 2025

Sharing benchmarking results for KNN query over gRPC.
Here I'm using the faiss-cohere-768 1million document dataset with a single server hosted on r5.xlarge.
The full workload params for gRPC can be found here.

(REST) prod-queries

flame_graph.html

Service time:

|                                   50th percentile service time | prod-queries |     4.71934 |     ms |
|                                   90th percentile service time | prod-queries |     5.30686 |     ms |
|                                   99th percentile service time | prod-queries |     5.68065 |     ms |
|                                 99.9th percentile service time | prod-queries |     9.22407 |     ms |
|                                99.99th percentile service time | prod-queries |     13.9234 |     ms |
|                                  100th percentile service time | prod-queries |     14.2881 |     ms |

Throughput:

|                                                 Min Throughput | prod-queries |      155.76 |  ops/s |
|                                                Mean Throughput | prod-queries |       161.5 |  ops/s |
|                                              Median Throughput | prod-queries |      162.25 |  ops/s |
|                                                 Max Throughput | prod-queries |      162.75 |  ops/s |

(gRPC) grpc-prod-queries

flame_graph.html

Service time:

|                                   50th percentile service time | grpc-prod-queries |     4.63652 |     ms |
|                                   90th percentile service time | grpc-prod-queries |     5.21496 |     ms |
|                                   99th percentile service time | grpc-prod-queries |     5.62033 |     ms |
|                                 99.9th percentile service time | grpc-prod-queries |     9.73362 |     ms |
|                                99.99th percentile service time | grpc-prod-queries |      43.662 |     ms |
|                                  100th percentile service time | grpc-prod-queries |     73.7093 |     ms |

Throughput:

|                                                 Min Throughput | grpc-prod-queries |      191.38 |  ops/s |
|                                                Mean Throughput | grpc-prod-queries |      197.48 |  ops/s |
|                                              Median Throughput | grpc-prod-queries |      197.97 |  ops/s |
|                                                 Max Throughput | grpc-prod-queries |      198.93 |  ops/s |

Next steps

While we observe little difference in the service time between REST and gRPC there is dramatic improvement in throughput. This could be a reflection of client side benefits as throughput measures operations per second of wall-clock time or some other network benefit of HTTP2.

While the serialization performance of the server improves when using gRPC, we can see from flame graphs this optimization is not fully realized on this workload as we spend the majority of our time within the query phase. About 5% of CPU is seen to be processing request/response pairs for the client/server connection.

Next steps here include root cause investigation into throughput gains we see in gRPC, as well as varying the dataset to determine which types of workloads benefit the most from improved performance of client/server transport layer.

Signed-off-by: Finn Carroll <[email protected]>
Copy link
Collaborator

@gkamat gkamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the suggestions need to go in right away, perhaps in a future PR.

Comment on lines +289 to +290
Sub channels manage the underlying connection with the server. When the global sub channel pool is used gRPC will
re-use sub channels and their underlying connections which does not appropriately reflect a multi client scenario.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably could be clearer: using local subchannels permits additional connections to be created each with their own pools, which can improve performance.

Comment on lines +31 to +32
for _, term_value in value.items():
terms[key].append(term_value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

terms[key].extend(value.values())

might be simpler here.

raise Exception("Error parsing query - Term query contains multiple terms: " + str(query))

# Term query body gives field/value as lists
term_field = next(iter(term.keys()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: next(iter(term)) should suffice.

@staticmethod
def build_proto_request(params):
body = params.get("body")
size = body.get("size") if "size" in body else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size = body.get("size")

should suffice.

self.assertEqual(result.request_body.query.term.field, "log.file.path")
self.assertEqual(result.request_body.query.term.value.string, "/var/log/messages/birdknight")

def test_build_proto_request_term_query_multi_field_fails(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps another test for the error case where there are multiple keys?

@gkamat gkamat merged commit dc94244 into opensearch-project:main Oct 20, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants