Skip to content

Additional stats fields for Elasticsearch#41652

Closed
3kt wants to merge 18 commits intoelastic:mainfrom
3kt:additional_fields_index_stats
Closed

Additional stats fields for Elasticsearch#41652
3kt wants to merge 18 commits intoelastic:mainfrom
3kt:additional_fields_index_stats

Conversation

@3kt
Copy link
Contributor

@3kt 3kt commented Nov 15, 2024

Proposed commit message

Adds creation_date and tier_preference fields for elasticsearch.index dataset.
This will be necessary for further development through elastic/integrations#11656

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation.
  • I have made corresponding change to the default configuration files N/A
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Regarding the documentation, the example document is copied from the data.json file, accurately modified in this PR.

Another modification in the integrations repo will be required (for this file)

Disruptive User Impact

This "shouldn't" have an impact on end-users, this doesn't alter existing behavior but only adds 2 new fields that will be exposed in the gathered Elasticsearch monitoring stats.

Author's Checklist

How to test this PR locally

You can run the integration against any cluster (with xpack or otherwise) and check that the generated index stats documents have the two new fields:

  • creation_date
  • tier_preference

Screenshots

image

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 15, 2024
@mergify mergify bot assigned 3kt Nov 15, 2024
@mergify
Copy link
Contributor

mergify bot commented Nov 15, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @3kt? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

@mergify
Copy link
Contributor

mergify bot commented Nov 15, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 15, 2024
@3kt 3kt added the Team:Monitoring Stack Monitoring team label Nov 17, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 17, 2024
@3kt 3kt requested a review from consulthys November 21, 2024 20:26
@3kt
Copy link
Contributor Author

3kt commented Nov 28, 2024

Ran a quick smoke test by running the new Elasticsearch module (with xpack.enabled: true) against a large internal cloud cluster:

  • 8.7 TB RAM
  • 151 nodes
  • 47k shards

The module collected the additional fields as expected, but we will need an additional modification in Elasticsearch repo for the index template, as I believe this is hardcoded there.

image

The below diagrams present various metrics for the targeted cluster. The first 3 hours (before the annotation) don't use the new metricbeat module, the 3 hours after do. The cluster is located in us-east-1, and it was between 12:00 and 18:00 local in the target time frame (I'm in EMEA so the screenshots below show different times).

All these charts are filtered on attributes.data :"hot", since the collection takes place over API (which therefore gets solely routed to hot nodes, since no coordination nodes are used).

Also note that I used the scope: cluster collection mechanism in the Metricbeat module, but since my code modification only impacts the index dataset, I don't expect this to have an influence on cluster load.

Management thread pool count:

image

Barely perceptible increase following the additional deployment of a metricbeat. I don't think we can put this on the new code (and therefore metadata collection), but rather on the fact that I added a Metricbeat, rather than replaced the existing one.

CPU usage:

image

There doesn't seem to be any major difference, the last spike seems to be caused by cluster activity, rather than monitoring collection.

Garbage collections:

image

Similar opinion here, the GC spike correlates with the CPU increase

Heap usage:

image

No notable difference here either. The decrease could be related to US east coast winding down, as this is where the target cluster was located.


I will keep collection active for longer, but I don't see any concrete evidence for now that this addition could negatively impact the health or stability of the cluster.

@consulthys
Copy link
Contributor

Thanks for running these tests, they look promising!

The module collected the additional fields as expected, but we will need an additional modification in Elasticsearch repo for the index template, as I believe this is hardcoded there.

Do you mean this index template? That's easily modifiable

@henningandersen
Copy link

Did you also collect the response sizes and response times? Also, it would be interesting to know the frequency at which this is called.

@3kt
Copy link
Contributor Author

3kt commented Dec 3, 2024

@henningandersen about collection rate, I didn't change the default of 1 collection point every 10 seconds:

image

Response size is an interesting one - since I use "external" collection, the addition of my Metricbeat translates into additional outbound traffic, which we can quantify with the billing API. Looking at the last 10 days, we have the following:

image

This seems to represent around ~300MB or 16% of added traffic per hour for this deployment. In a "real" scenario, the collection would be internal though.

Also note that this increase could be caused by the cluster upgrade from 8.15 to 8.16, which happened over the displayed time range.

Going beyond, the internode traffic didn't drastically change after the addition of this collection:

image

Paying attention to the scale of internode and outbound traffic, the new collection data volume would probably get "drowned" in the TBs of internode traffic.

@3kt
Copy link
Contributor Author

3kt commented Dec 3, 2024

Added a debug call in the GetClusterState:

	// Measure how long the API call takes
	start := time.Now()

	queryString := strings.Join(queryParams, "&")

	content, err := fetchPath(http, resetURI, clusterStateURI, queryString)
	if err != nil {
		return nil, err
	}

	elapsed := time.Since(start)

	// Display time in ms
	fmt.Printf("Cluster state response size: %d - took %d ms\n", len(content), elapsed.Milliseconds())

For the "old" code base:

data points: 54
average response time: 673.9 ms
std dev response time: 370.0 ms
response size: 14540971 b

For the "new" code base, hitting the same large SRE cluster, I get:

data points: 100
average response time: 7727.0 ms
std dev response time: 1497.0 ms
response size: 15679947 b

In short:

| Metric              | Old      | New      | Difference |
| Data point          | 54       | 100      |            |
| Response size bytes | 14540971 | 15679947 | +7.8%      |
| Response time ms    | 673.9    | 7727.0   | +1046%     |

cc @consulthys @henningandersen

@3kt
Copy link
Contributor Author

3kt commented Dec 8, 2024

Discarding, in favor of #41944

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.x Automated backport to the 8.x branch with mergify enhancement Team:Monitoring Stack Monitoring team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants