Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JavaScript heap out of memory during import #259

Closed
kierangirvan opened this issue Feb 2, 2024 · 24 comments
Closed

JavaScript heap out of memory during import #259

kierangirvan opened this issue Feb 2, 2024 · 24 comments
Assignees
Labels
wontfix This will not be worked on

Comments

@kierangirvan
Copy link

Describe the bug
Whilst attempting to upload a large jtl file (1.1GB), the upload seems to work, but when the file is being processed (yellow icon in test report view), it never completes and an exception is thrown in the be to suggest we've run out of memory.

To Reproduce
Attempt to upload 1.1GB jtl file.

Expected behavior
The tesy results should eventually become visible in the jtlreporter fe

Screenshots
We are running this in AWS ECS on a fargate task, you can see that from 17:12 onwards the kpi file is being processed:

February 01, 2024 at 17:12 (UTC) {"level":"info","message":"Starting KPI file streaming and saving to db, item_id: 76dc39fc-a417-48d9-8d78-8f8a47a1df3a"}

Almost 90minutes later, the following exception is thrown by the be container:

February 01, 2024 at 18:46 (UTC) FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
February 01, 2024 at 18:46 (UTC) <--- JS stacktrace --->
February 01, 2024 at 18:46 (UTC) [17:0x7fc901654300] 72456860 ms: Scavenge (reduce) 2045.4 (2080.5) -> 2044.5 (2080.5) MB, 1.9 / 0.0 ms (average mu = 0.086, current mu = 0.001) allocation failure;
February 01, 2024 at 18:46 (UTC) [17:0x7fc901654300] 72456856 ms: Scavenge (reduce) 2045.4 (2080.5) -> 2044.5 (2080.5) MB, 1.8 / 0.0 ms (average mu = 0.086, current mu = 0.001) allocation failure;
February 01, 2024 at 18:46 (UTC) [17:0x7fc901654300] 72456853 ms: Scavenge (reduce) 2045.4 (2080.5) -> 2044.5 (2080.5) MB, 2.0 / 0.0 ms (average mu = 0.086, current mu = 0.001) allocation failure;
February 01, 2024 at 18:46 (UTC) <--- Last few GCs --->

The container is then marked unhealthy and is replaced by a new container. From what I can see we are not running hot on either CPU/memory on the task itself:
image

So I assume we need to set the JVM in question to have a bigger slice of the memory. Do you know what or how to set this?

@kierangirvan
Copy link
Author

Just doing some research, and it would seem there is an node env variable which can be set i.e.

export NODE_OPTIONS=--max_old_space_size=4096

See here for details > https://www.npmjs.com/package/increase-memory-limit

Do you think this is worth a try?

@ludeknovy
Copy link
Owner

Hi @kierangirvan!
It definitely looks like a memory issue.
Yes, it's worth giving export NODE_OPTIONS=--max_old_space_size=4096 a try, I guess.

Was there any other log message Starting KPI file streaming and saving to db?

There must be a memory leak somewhere, I suppose, although the overall design is to process the file in chunks. So I would like to know whether it failed during parsing/saving the data into the DB or during processing.

Also, did you consider streaming the data into the app while your test is running? That would reduce the amount of time spent on parsing the data significantly.
https://jtlreporter.site/docs/integrations/samples-streaming
https://jtlreporter.site/docs/integrations/jmeter#2-continuous-results-uploading

@kierangirvan
Copy link
Author

Thanks for your quick response.

It does suggest it is attempting to save to the DB, this step takes for ages usually, but we've come to live with that, so the previous log entry before it runs out of memory is:

February 01, 2024 at 17:12 (UTC) 
{"level":"info","message":"Starting KPI file streaming and saving to db, item_id: 76dc39fc-a417-48d9-8d78-8f8a47a1df3a"}

We are using Taurus entirely for our test design i.e. we do not dip into jmx, its entirely yaml based, and I believe it is not possible to enable the backend listener with Taurus via yaml, you have to convert the whole scenario to jmx to achieve this which we really don't want to do.

@ludeknovy
Copy link
Owner

ludeknovy commented Feb 2, 2024

  1. If there's no other log, then yes, there must be an issue somewhere here https://github.com/ludeknovy/jtl-reporter-be/blob/master/src/server/controllers/item/create-item-controller.ts#L184.

If it is possible to anonymize your .jtl file and share it with me, this way I could have a look and check if I would be able to spot the issue.

  1. Oh, I see. I haven't checked the Taurus lately, but since they have blazemeter support, which I believe sends the data during test execution, there must be a way to achieve the same for any other tool.

@ludeknovy
Copy link
Owner

One more note to the 2) actually it seems to be possible if you do custom jmeter installation / copy the plugin into plugins folder? https://gettaurus.org/docs/JMeter/#JMeter-Location-Auto-Installation

@kierangirvan
Copy link
Author

Thanks @ludeknovy

I think the issue boils down to our ability to call the backend listener within the yaml itself. We have purposely built everything in yaml (and not jmx), and I do not believe there is a way to call the backend listener within the taurus yaml configuration.

Regarding the out of memory issue itself - we have included the following node heap configuration and have now successfully uploaded a 1.1GB kpi file. We will run a few more uploads to be certain, but that seems to have done the trick.

NODE_OPTIONS=--max_old_space_size=4096

I will close this issue once we have successfully uploaded a few more tests in the next few days.

@ludeknovy
Copy link
Owner

  1. Thanks for checking it.
  2. If the executor used is jmeter, then it should work, I believe. But once I have a minute, I will test it myself.

@ludeknovy
Copy link
Owner

  1. You were right. It seems that it's not possible in case no jmx is used. Unfortunately, that's due to the taurus design, I did not find an easy way to extend it, and forking it is does not look like a good idea.

@milanpanik
Copy link

milanpanik commented Feb 9, 2024

I had the similar problem, I increased the memory but now I'm encountering problems with slow DB I guess :) attaching both cfg and error log of failed BE service. It seems that DB is busy therefore BE failed and it needs manual restart to work again.

Screenshot 2024-02-09 at 9 28 08 Screenshot 2024-02-09 at 9 28 25 Screenshot 2024-02-09 at 9 28 56

@ludeknovy
Copy link
Owner

Hi @milan-panik !
From the provided information, it looks like a networking issue - the server could not connect to the database (eai_again - indicates a problem with dns resolution). And that resulted in backend failure — I will try to handle it so it would not crash the whole application.

@milanpanik
Copy link

It happens only during peak hours, i.e. when the batch of tests ends at the same time and lot of reports are being uploaded.

@ludeknovy
Copy link
Owner

@milan-panik
ok, would it be possible to provide logs from the jtl-reported-db service from time when this problem occurred?
Maybe it will help to understand what is going on.

@milanpanik
Copy link

be died at 00:45 on 2024-02-07. DB logs are:
Screenshot 2024-02-09 at 11 45 23

@ludeknovy
Copy link
Owner

ludeknovy commented Feb 9, 2024

@milan-panik Thanks! I see in your config increased value for max_wal_size, does the issue occur with it as well? Or was it set afterwards? It looks like the load for the database is too high - you mentioned you have many test reports processed at the same time.

@ludeknovy
Copy link
Owner

Do you have enabled the option to delete samples after a report is generated?

@ludeknovy
Copy link
Owner

I've removed a vacuum query after samples purge—it was a way too heavy operation. By default, it's handled by autovacuum anyway. So if you had Delete sample data after processing enabled, changes in latest docker image should help.

@milanpanik
Copy link

Thank you Ludek. I'm bit lost though, has it already been released? bc I've checked releases and related changelog, and cannot find it

@ludeknovy
Copy link
Owner

@milan-panik it was not released yet, but it's available in latest image: novyl/jtl-reporter-be:latest

@ludeknovy
Copy link
Owner

@kierangirvan I've pushed a possible fix, but I would appreciate if you could test it and let me know.

@kierangirvan
Copy link
Author

Thanks @ludeknovy

I'll get the latest build pushed out in the coming week and confirm if this has helped.

Copy link

stale bot commented Mar 30, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@ludeknovy
Copy link
Owner

I've found a memory leak. I've prepared a fix for it that will release the memory. But I need to change the whole solution - so the high memory usage would not even be there. However, that won't be possible without changing the DB docker image, as it needs to include the timescale toolkit - it will take some time to prepare the image as the HA version does not support ARM.

@stale stale bot closed this as completed Apr 23, 2024
@ludeknovy ludeknovy reopened this Apr 23, 2024
@stale stale bot removed the wontfix This will not be worked on label Apr 23, 2024
@ludeknovy
Copy link
Owner

I've prepared new docker images for the project: https://hub.docker.com/r/novyl/jtl-reporter-db
I think I have the proper fix ready, it will be released in v5 - and it will require some manual steps to upgrade from v4 (backup and restore the DB)

Copy link

stale bot commented May 26, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label May 26, 2024
@stale stale bot closed this as completed Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants