Skip to content

Introduce bulk routing optimization.#62020

Open
howardhuanghua wants to merge 3 commits intoelastic:mainfrom
TencentCloudES:bulk_routing
Open

Introduce bulk routing optimization.#62020
howardhuanghua wants to merge 3 commits intoelastic:mainfrom
TencentCloudES:bulk_routing

Conversation

@howardhuanghua
Copy link
Copy Markdown
Contributor

Issue

In large scale cluster, to avoid single shard size too big, user needs to set too many shards in a single index.
For example, one of our user's production cluster has 100 data nodes, and each of downlink index has around 150 shards to control single shard size around 30-50GB:

health status index              uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   downlink-20200904.3  gT_4J_-dQHuOq_G8cE00kg 153   0 8098910441            0      4.4tb          4.4tb
green  open   downlink-20200904    _ruilUSwQMq7aEYfXwBykQ 150   0 7078752021            0      3.8tb          3.8tb
green  open   downlink-20200904.9  EzVpJ4PTS1ae5E3SBYAXiA 159   0 8201595897            0      4.5tb          4.5tb
green  open   downlink-20200903.4  mRGqgjXXRN2rQQ6uj8Womw 154   0 1892298596            0        1tb            1tb
green  open   downlink-20200905.6  Z4TsjPjvQhO-APWo5aoPWw 156   0 6674856189            0      3.6tb          3.6tb
green  open   downlink-20200904.8  TGn8-xCcRzikNb2ZejkfxQ 158   0 7345715637            0        4tb            4tb
green  open   downlink-20200903.3  lmB-kjw9Rt6g2sdEEzjnuA 153   0 7452545046            0      4.1tb          4.1tb
green  open   downlink-20200905.8  GmECfHqESi6n3VxADWP8eg 158   0 4064526603            0      2.5tb          2.5tb

This cluster has total 1 million+ docs TPS, we found cluster has 0.1%+ bulk reject rate per minute.
The main reason is that each bulk request (bulk size around 2MB) would be separated to 100 sub write requests to each data node. In this case, if one or several data nodes got Old GC, unstable network or hardware failures, we call these shards as long-tail sick shards (they are randomly and temporarily), they would cause the bulk request to be delayed and finally cause the bulk reject exception. In scenarios with a large number of shards in a single index, this bulk reject issue can be significant.

Solution

User could use routing to control a bulk request only goes to a single shard, to avoid long-tail sick shards affection. However, this needs user's extra developments for each business index, sometimes specially in logging platform scenario, they don't care about routings, high performance of bulk throughput is very important.

This PR introduces a bulk routing optimization to speed up bulk performance. There is an index level setting called index.bulk_routing.enabled to control this optimization, it's false by default. If no user defined _id field and no user defined _routing, with the setting enabled, all the sub requests of the same index in a bulk request will be added the same routing automatically. Then same index's writing requests only go to one of the shards. Here is the setting:

// set it on index creation
PUT /my-index
{
    "settings" : {
        "index" : {
            "bulk_routing.enabled" : true
        }
    }
}

// update it dynamically
PUT /my-index/_settings
{
  "index.bulk_routing.enabled": true
}

// use template to control some of the indices
PUT /_template/bulk_routing_template
{
  "index_patterns": ["indices-prefix*"],
  "settings": {
    "bulk_routing.enabled": true
  }
}

Then if user has large cluster with lots of shards in an index, user could simple enable this optimization on this index without extra developments.

Performance improvement

In the above customer production cluster, after enabling this bulk routing optimization, The original rejection rate drops to 0, the CPU drops 25%, and the write speed increases 10%:

image

Copy link
Copy Markdown
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting optimization! But it breaks GET. Also, assigning routing will increase indexing time and storage. I wonder if we instead append a few letters to the generated-id or repeatedly generate a new id until all sub-requests are partitioned to a single shard.

@dnhatn dnhatn added the :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Sep 6, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/CRUD)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Sep 6, 2020
@dnhatn dnhatn added team-discuss and removed Team:Distributed Meta label for distributed team. labels Sep 6, 2020
@howardhuanghua
Copy link
Copy Markdown
Contributor Author

@dnhatn Thanks for checking. Yes, getById would be breaked, this is the same as user-defined routing case. When user defined their own routing in writing request, it also breaks GET. User need to query docs to get routing first and specific routing in GET operation. And also extra routing would increase storage, so I limit it only 10 bytes length. As you suggested it's better if we could make auto-generated-id to route all sub-requests to a single shard, I will continue to investigate it.

@dnhatn dnhatn added the Team:Distributed Meta label for distributed team. label Sep 7, 2020
@dnhatn
Copy link
Copy Markdown
Member

dnhatn commented Sep 7, 2020

getById would be breaked, this is the same as user-defined routing case. When user defined their own routing in writing request, it also breaks GET.

If _source is disabled, then users can't retrieve the auto-generated routing.

@howardhuanghua
Copy link
Copy Markdown
Contributor Author

_source is meta field? I could retrive _routing field with _source disabled in 6.4.3 version:

"hits": [
      {
        "_index": "live_stream_stat_data-60@1597075200000_7",
        "_type": "doc",
        "_id": "qf24oKYRwwzLA8g06P3QgI0A-gAAF_QD9o",
        "_score": 1,
        "_routing": "a0a611c3"
      }
]

Or the latest version has moved _routing into _source?

@dnhatn
Copy link
Copy Markdown
Member

dnhatn commented Sep 7, 2020

Ah, you're right.

@itiyamas
Copy link
Copy Markdown

itiyamas commented Sep 7, 2020

Here, the performance gain is not just due to avoiding the long-tail sick shard. By making this change, you are able to change the bulk size per shard, which has a huge impact on performance. Additionally, the transport layer co-ordination overhead is reduced here too.
It also helps with reducing the experimentation that one needs to do with bulk size when we scale the shard count.

It would be good if you keep similar id format as current as it is optimized for FST storage and some guarantees are maintained there too.

@howardhuanghua
Copy link
Copy Markdown
Contributor Author

It would be good if you keep similar id format as current as it is optimized for FST storage and some guarantees are maintained there too.

Would you mean the current generated random bulk routing?
tmpRouting = UUIDs.randomBase64UUID().substring(12);
The main reasons that I use this randomBase64UUID and limit it in 10 bytes length:

  1. It's quite fast than base64UUID as base64UUID has synchronized internal.
  2. Lower down the routing storage affection.

@itiyamas
Copy link
Copy Markdown

itiyamas commented Sep 7, 2020

Check this.

@henningandersen
Copy link
Copy Markdown
Contributor

We discussed this as a team and have following feedback:

  1. It seems very targetted to the specific situation.
  2. We wonder if Increase default write queue size #59559 has helped the reject rate?
  3. The 10 percent performance increase might be achieveable in other ways that we want to explore, for instance smarter shard level batching.
  4. Future coordinator level batching may also help this scenario and be more widely applicable.

@howardhuanghua
Copy link
Copy Markdown
Contributor Author

Thanks @henningandersen. Currently more customers have large scale clusters, 100+ nodes, 50-100+ shards would be normal case. Tencent has lots of these huge clusters in logging scenario. We introduce this bulk routing optimization mainly to solve fan-out issues in these large scale cluster case, as network issues, garbage collections would cause long-tail shard operation performance issue. Even we could increase the bulk queue, the pending operations would still cause a lot of resources. Would you please help to share more materials about coordinator level batching feature?

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. Team:Distributed Meta label for distributed team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants