Intake protocol v2 #1260

roncohen · 2018-08-07T07:20:11Z

TODO:

Config options of note:
On the server there are some config options that are interesting for v2:

read timeout (30s default atm.)
if the agent doesn’t complete its request within the read_timeout, atm. APM Server will force close the connection with an internal error
max_unzipped_size (30mb default atm.)
if the unzipped size surpasses max_unzipped_size the rest will be rejected
concurrent_requests (5 default atm.)
for v2, this will be the maximum number of agents that we can receive data from. For v2 we should make it significantly or disable it entirely for v2, as we no longer need to keep it low to conserve memory

simitt

I started the APM Server and sent 1 metadata, 1 transaction request.
After the transaction showing up in ES, the CPU usage of the APM Server went up to 200% although idle. I haven't looked into the reasons for this.

EDIT: I print the size of the publishers queue p.pendingRequests and the queue keeps getting filled up with some entries, they are all non transformable though so nothing can really get processed, and CPU load increases.

simitt · 2018-08-09T09:06:13Z

beater/route_config.go

+	sourcemapRouteType = routeType{
+		sourcemapHandler,
+		backendMetadataDecoder,
+		rumTransformConfig,


Why would the sourceampRouteType need the sourcemap config? This is for uploading afaik.

it just needs config.SmapMapper to be set. rumTransformConfig does that.

simitt · 2018-08-09T09:06:57Z

beater/route_config.go

+			authHandler(beaterConfig.SecretToken, h)))
+}
+
+func backendMetadataDecoder(beaterConfig *Config, d decoder.ReqDecoder) decoder.ReqDecoder {


How about calling this systemMetadataDecoder and userMetaDataDecoder instead of backend and rum ?

simitt · 2018-08-09T09:12:47Z

beater/route_config.go

+
+type v2Route struct {
+	routeType
+}


I think the type definition and the methods on them should be in the same file.

simitt · 2018-08-09T15:14:39Z

update: @graphaelli and I looked into the issues, and it seems that the server is not handling the read_timeout properly. The requeing of non-existent requests starts when the read timeout is hit.

simitt · 2018-08-09T15:51:18Z

beater/v2_handler.go

+	var err error
+	var rawModel map[string]interface{}
+
+	eventables := []transform.Transformable{}


If you are using var eventables []transform.Transformable instead of creating an empty array, then the formerly experienced CPU load and queuing of non-existing events disappears.

Still think there are more issues here though, as eof seems to not be read properly,

thx to @graphaelli for pointing out the cpu usage is going on in this method.

simitt

Thanks @roncohen for putting up this WIP PR to get some early feedback!
I appreciate the cleanness of the PR and I really like having the route_config separated, and also how little changes are necessary in the models!

As this is WIP, I assume you're going to add processor/package_tests for v2.

Are you planning on introducing json schemas for v2? E.g. in v1 spans were nested in transactions, which would not be allowed any longer in v2. With the current schemas this would still be possible.

Afaik the plan is to open a new http connection per bulk request. Anyhow, one bulk request potentially holds a lot of data, which are currently read, decoded and validated one after the other by a single thread. The reading and decoding part has high impact on the throughput. Have you thought about something like using one go routine reading from the stream but multiple go routines doing the decoding and validation and passing back the results to the one blocking go routine? (ofc does not need to be included in this PR but would like to hear your thoughts about this).

simitt · 2018-08-10T07:55:03Z

beater/v2_handler.go

+
+	for k, v := range reqMeta {
+		utility.MergeAdd(rawMetadata, k, v.(map[string]interface{}))
+	}


Why would this be necessary here? I understood that the requestDecoder is taking care of bringing data into a map[string]interface{} format and augment if necessary. In a next step we pass those data to validation and go struct decoding.

we discussed offline and decided to defer this to subsequent PRs

roncohen · 2018-08-13T08:45:55Z

great feedback. Thanks @simitt !
Great find regarding the read_timeout. I'll get right on that.

I think parallelizing decoding makes sense, but I might leave it for a follow up PR. When we do it, we would need to benchmark it make sure there's significant speedups. I'll make an issue if it doesn't go into this PR.

simitt · 2018-08-13T08:52:29Z

Created a follow-up task for the v2 json spec changes, #1276.

simitt · 2018-08-13T09:02:49Z

model/span/span.go

@@ -56,7 +65,7 @@ type Span struct {
 	TransactionId string
 }

-func DecodeSpan(input interface{}, err error) (*Span, error) {
+func DecodeSpan(input interface{}, err error) (transform.Transformable, error) {


You need to add setting the Timestamp to the DecodeSpan method, as for v2 the spans are not sent within transactions any more.
There was an agreement that if the Timestamp is not set it should be set to the time "that the HTTP request from the agent was first received. "

Instead of setting the request time during decoding, I'm now doing it during transformation. I added a request time to the transformation context which is used to set the time if it's absent.

simitt · 2018-08-17T06:31:19Z

beater/v2_handler.go

+
+	for {
+		transformables, done := v.readBatch(batchSize, ndReader, resp)
+		if transformables != nil {


This would still report empty arrays (if no data were read or a read error occurred an empty array can be returned), which doesn't cause errors but is unnecessary, please see https://github.com/elastic/apm-server/pull/1260/files/fb77d985e9ee8623290bc23839542e9b318472c8#r208983595

simitt · 2018-08-17T09:11:52Z

script/inline_schemas/inline_schemas.go

+		{"metadata.json", "model/metadata/generated/schema/metadata.go", "ModelSchema"},
+		{"errors/error.json", "model/error/generated/schema/error.go", "ModelSchema"},
+		{"transactions/transaction.json", "model/transaction/generated/schema/transaction.go", "ModelSchema"},
+		{"transactions/span.json", "model/span/generated/schema/transaction.go", "ModelSchema"},


this should be span

… http requests

simitt

I suggest to create a separate v2 branch in the upstream and point this PR against it. There are a couple of open tasks, required to be solved before v2 should be used. While I think this PR introduces a lot of good changes, merging it into master does not feel right. This is such a big undertaking that I think having a separate feature branch that we keep up-to-date with changes in master would be the right way to go forward.

simitt · 2018-08-20T06:55:44Z

docs/spec/span/span.json

@@ -1,5 +1,5 @@
 {


Can you please rename to docs/spec/spans/span.json as the other folders are also pluralized.

simitt · 2018-08-20T07:03:33Z

beater/stream_response_test.go

+	sr.add(QueueFullErr, 23)
+
+	jsonOut, err := sr.marshal()
+	assert.NoError(t, err)


Could you use require.xxx instead of assert.xxx wherever the test should stop if the condition is not fullfilled?
(see https://godoc.org/github.com/stretchr/testify/require)

simitt · 2018-08-20T07:26:29Z

beater/stream_response_test.go

+			}
+		}
+	}`
+	expectedJSON = strings.Replace(strings.Replace(expectedJSON, "\n", "", -1), "\t", "", -1)


I don't think we should defer this, and you already introduce the tests here. Changing this to use the approvals would require something like:

... var jsonOut map[string]interface{} outByte, err := sr.marshal() require.NoError(t, err) err = json.Unmarshal(outByte, &jsonOut) require.NoError(t, err) verifyErr := tests.ApproveJson(jsonOut, "testStreamResponseSimple", nil) if verifyErr != nil { assert.Fail(t, fmt.Sprintf("Test %s failed with error: %s", "testStreamResponseSimple", verifyErr.Error())) }

As this is not a huge effort, I'd appreciate having it changed.

simitt · 2018-08-20T08:06:14Z

beater/route_config.go

+func v2backendHandler(beaterConfig *Config, h http.Handler) http.Handler {
+	return logHandler(
+		requestTimeHandler(
+			authHandler(beaterConfig.SecretToken, h)))


The concurrencyLimitHandler ensured that the Server cannot get overwhelmed with requests. While I see that the handling needs to be changed for v2 I am not sure it can simply be removed.
What if an agent is configured to create a separate http request in very short intervals and sends huge payloads per event (e.g. large stracktraces). This could still cause severe issues on the server, especially when running on a small box.

I see what you mean. However, the current default of 5 makes it impossible to support more than 5 agents in v2 out of the box without changing the default value - which would affect v1 negatively. I think the solution is to create a limited number (runtime.NumCPU) of separate goroutines that read from the streams. That would also parallelize decoding and validation. I'll take a look

I've now changed the base branch to a v2. With that in mind, I would like to defer this to a separate PR. v2 is already way safer memory wise than v1, even the "edge case" you mention here. I don't think this particular issue should hold up the many other issues that are waiting for this PR.

Merging this to a v2 branch instead of master relaxes this, and I agree we can defer it. Could you create and link the Issue please so we don't forget.

Added it to this one: #1285

roncohen · 2018-08-20T09:48:37Z

Let me know you think there's more we need to do here @simitt

simitt

Could you link all unchecked notes to the according github tasks please!

simitt · 2018-08-20T10:53:18Z

beater/v2_handler_test.go

@@ -0,0 +1,328 @@
+// Licensed to Elasticsearch B.V. under one or more contributor


errors and metrics are missing completely from these tests. There is an open Issue for handling errors, but none for metrics. Please add an Issue or some tests.

missed this one. We have an issue for adding tests for all the different event types here: #1288 but i don't know if that's what you had in mind

Errors and metrics are tested here by the way: https://github.com/elastic/apm-server/pull/1260/files/0d3c4cadfc672abafbf469ca5e3b960babfca8ad#diff-44977a4ead5422850c78d1922785dc2bR168

Is that what you had in mind?

This introduces two new endpoints, one for backend systems and one for RUM, which use the new intake v2 format.

Ron cohen added 8 commits August 6, 2018 10:26

WIP on intake v2

5f7f27e

Added license headers

0fb773f

Add StreamResponse test

8eac7ff

Streamline handlers and decoder

d035da8

Make v2 urls constants

41e0f2c

Fix wrapping handlers

33f4357

Add license headers

89b9c40

Tiny cleanup

44c19ab

roncohen added the in progress label Aug 7, 2018

roncohen mentioned this pull request Aug 7, 2018

Implement Intake V2 basic functionality #1238

Closed

WIP on tests

e47e9ca

simitt reviewed Aug 9, 2018

View reviewed changes

simitt reviewed Aug 10, 2018

View reviewed changes

simitt reviewed Aug 13, 2018

View reviewed changes

alvarolobato added the [zube]: In Progress label Aug 13, 2018

Ron cohen added 11 commits August 14, 2018 10:29

Introduce requestLogger to abstract away the logger in request context

30994c4

Improve error handling

d5fb2b6

Cleanup error response

d03c9d0

Cleanup responses

c59fde2

Export structs in metric to allow us to test from outside

7cb94ca

fix validation issue with metadata

5c81576

fix SkipToEnd bug

54f898f

moved route definitions in their own files

0a6cc47

Default timestamps to request time

8a4a269

Handle transaction ids on spans

6fefecc

Fix imports

fe4e058

simitt reviewed Aug 17, 2018

View reviewed changes

Ron cohen added 2 commits August 17, 2018 12:47

Added more tests for v2 response

6e876e2

Moved span spec.

2c737ab

roncohen force-pushed the new-intake-v2 branch from 4cd792f to 2c737ab Compare August 17, 2018 11:05

Ron cohen added 3 commits August 17, 2018 14:41

Improve reporting error handling and tests

16087a4

v2: Disable concurrent_request as a limit on the number of concurrent…

eb0100e

… http requests

v2: respond immidiately on invalid content type

6197346

simitt reviewed Aug 20, 2018

View reviewed changes

roncohen changed the base branch from master to v2 August 20, 2018 08:52

use Approvals tool for v2 handler tests

be856ad

roncohen force-pushed the new-intake-v2 branch from 698cafd to be856ad Compare August 20, 2018 09:37

moved docs/span -> docs/spans

c281be2

move approved files

0d3c4ca

simitt approved these changes Aug 20, 2018

View reviewed changes

simitt reviewed Aug 20, 2018

View reviewed changes

roncohen merged commit 3cde45d into elastic:v2 Aug 20, 2018

zube bot added [zube]: Done and removed [zube]: In Progress labels Aug 20, 2018

roncohen mentioned this pull request Aug 20, 2018

Create a summary response for Intake V2 #1236

Closed

roncohen added a commit that referenced this pull request Aug 30, 2018

Intake protocol v2 (#1260)

1b97ebd

This introduces two new endpoints, one for backend systems and one for RUM, which use the new intake v2 format.

simitt pushed a commit to simitt/apm-server that referenced this pull request Sep 7, 2018

Intake protocol v2 (elastic#1260)

0f2e2fc

This introduces two new endpoints, one for backend systems and one for RUM, which use the new intake v2 format.

roncohen added a commit to roncohen/apm-server that referenced this pull request Oct 7, 2018

Intake protocol v2 (elastic#1260)

9f417d6

This introduces two new endpoints, one for backend systems and one for RUM, which use the new intake v2 format.

roncohen added a commit to roncohen/apm-server that referenced this pull request Oct 15, 2018

Intake protocol v2 (elastic#1260)

b2641be

This introduces two new endpoints, one for backend systems and one for RUM, which use the new intake v2 format.

roncohen added a commit to roncohen/apm-server that referenced this pull request Oct 15, 2018

[v2] Intake protocol v2 (elastic#1260)

fb0eaf8

This introduces two new endpoints, one for backend systems and one for RUM, which use the new intake v2 format.

roncohen added a commit to roncohen/apm-server that referenced this pull request Oct 15, 2018

Intake protocol v2 (elastic#1260)

784c485

This introduces two new endpoints, one for backend systems and one for RUM, which use the new intake v2 format.

roncohen added a commit to roncohen/apm-server that referenced this pull request Oct 15, 2018

[v2] Intake protocol v2 (elastic#1260)

a52e9de

This introduces two new endpoints, one for backend systems and one for RUM, which use the new intake v2 format.

roncohen added a commit to roncohen/apm-server that referenced this pull request Oct 16, 2018

[v2] Intake protocol v2 (elastic#1260)

13a3b5e

This introduces two new endpoints, one for backend systems and one for RUM, which use the new intake v2 format.

roncohen added a commit that referenced this pull request Oct 16, 2018

[v2] Intake protocol v2 (#1260)

ab34d62

This introduces two new endpoints, one for backend systems and one for RUM, which use the new intake v2 format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intake protocol v2 #1260

Intake protocol v2 #1260

roncohen commented Aug 7, 2018 •

edited by zube bot

Loading

simitt left a comment •

edited

Loading

simitt Aug 9, 2018

roncohen Aug 14, 2018 •

edited

Loading

simitt Aug 9, 2018

simitt Aug 9, 2018

simitt commented Aug 9, 2018

simitt Aug 9, 2018

simitt Aug 9, 2018

simitt left a comment

simitt Aug 10, 2018

roncohen Aug 14, 2018

roncohen commented Aug 13, 2018 •

edited

Loading

simitt commented Aug 13, 2018

simitt Aug 13, 2018

roncohen Aug 16, 2018

simitt Aug 17, 2018

simitt Aug 17, 2018

simitt left a comment

simitt Aug 20, 2018

simitt Aug 20, 2018

simitt Aug 20, 2018

simitt Aug 20, 2018

roncohen Aug 20, 2018

roncohen Aug 20, 2018

simitt Aug 20, 2018

roncohen Aug 20, 2018

roncohen commented Aug 20, 2018

simitt left a comment

simitt Aug 20, 2018

roncohen Aug 20, 2018 •

edited

Loading

roncohen Aug 20, 2018 •

edited

Loading

		@@ -0,0 +1,328 @@
		// Licensed to Elasticsearch B.V. under one or more contributor

Intake protocol v2 #1260

Intake protocol v2 #1260

Conversation

roncohen commented Aug 7, 2018 • edited by zube bot Loading

simitt left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roncohen Aug 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simitt commented Aug 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roncohen commented Aug 13, 2018 • edited Loading

simitt commented Aug 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roncohen commented Aug 20, 2018

simitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roncohen Aug 20, 2018 • edited Loading

Choose a reason for hiding this comment

roncohen Aug 20, 2018 • edited Loading

Choose a reason for hiding this comment

roncohen commented Aug 7, 2018 •

edited by zube bot

Loading

simitt left a comment •

edited

Loading

roncohen Aug 14, 2018 •

edited

Loading

roncohen commented Aug 13, 2018 •

edited

Loading

roncohen Aug 20, 2018 •

edited

Loading

roncohen Aug 20, 2018 •

edited

Loading