Intermittent 100% CPU forcing reboot of EC2 instance #195

iantearle · 2024-10-24T10:36:04Z

This has been going on for a while and is intermittent - every two or so weeks. I fed my logs into ChatGPT and it came up with these explanations and suggestions.
Question is, am I able to adjust NGINX PHP FPM settings? Or is there something at play with either the docker container or FreeScout itself?

I am running FreeScout on an AWS EC2 Instance with 30GB volume

Buffering Issues: Starting from around 7:46 AM, NGINX is logging repeated warnings about buffering responses from the upstream server to temporary files (/var/cache/nginx/proxy_temp/...). This indicates that NGINX could not process responses from your upstream server fast enough, possibly because the responses were too large or slow, leading to higher memory and disk usage.
2. Connection Timeouts: Around 9:18 AM, you begin seeing upstream timed out errors, indicating that your upstream server (IP 13.43.11.38) was taking too long to respond. This could be due to the upstream server being overloaded or unresponsive. These timeouts are critical and often lead to increased load on NGINX as it keeps retrying requests.
3. Connection Refused Errors: Starting from 9:59 AM, NGINX is logging connect() failed (111: Connection refused) errors when trying to communicate with the upstream server. This suggests that by this time, the upstream server was entirely unresponsive or had crashed. Since these errors coincide with the time your site went down, this could be the direct cause of the 100% CPU usage as NGINX continues to attempt connecting to the dead upstream server.

Possible Causes:
• Upstream server failure: The upstream server (...:8252) was likely overloaded, crashed, or was unresponsive, which caused cascading issues for NGINX.
• Large files or slow responses: The buffering issues related to the large images and JavaScript files could point to large payloads causing excessive disk and memory usage, pushing the CPU to 100%.
• NGINX caching configuration: The buffering to disk in /var/cache/nginx/proxy_temp might indicate a configuration issue or that your upstream server wasn’t able to handle the load efficiently.

Suggested Next Steps:

Check upstream server (...:8252): Investigate logs and resource usage (CPU, memory) on this server to see if it was overloaded or crashed.
NGINX tuning: You may need to adjust NGINX’s proxy buffering settings to limit the size of buffered files or reduce memory usage.
Upstream performance: Consider optimizing the upstream server for better performance or increasing resources if it’s underpowered.
Investigate large payloads: Review the size of the resources (images, scripts) being delivered to clients and optimize them if necessary.

Key Insights:
1. Upstream Buffering and Temporary Files:
  • NGINX is buffering responses to temporary files (fastcgi_temp) frequently when reading from upstream. This can occur when the upstream response (in this case, from PHP-FPM) is large or slow, forcing NGINX to store part of the response in temporary files rather than in memory.
  • This excessive buffering can lead to increased disk I/O, contributing to system load and potentially exhausting CPU resources.
2. Upstream Sending More Data than Content-Length:
  • Several logs mention that the upstream (PHP-FPM) is sending more data than specified in the Content-Length header. This discrepancy can be problematic for the NGINX fastcgi module, leading to inefficient resource handling.
  • This issue often points to a misconfiguration or bug in the application, or it could be caused by large file uploads/downloads not being correctly managed.
3. File Not Found (404) Errors:
  • There are several open() failed errors indicating that specific files (e.g., CSS and JS files) were not found. These missing static files might cause additional load on your server, as NGINX tries to locate these resources repeatedly.
  • These errors also lead to the browser retrying the requests, potentially adding unnecessary load to the server.
4. PHP-FPM Timeout:
  • Multiple upstream timed out errors appear, which indicate that PHP-FPM processes are timing out while handling certain requests (e.g., /polycast/receive). This could point to PHP-FPM being overloaded, not having enough workers, or experiencing high resource consumption from specific requests.
  • This likely exacerbated the situation and contributed to the 100% CPU usage, as NGINX keeps trying to connect to a slow or unresponsive PHP-FPM backend.

Possible Causes of High CPU:

Disk I/O Overload: The buffering to temporary files can significantly increase disk usage and slow down the entire system.
Misconfigured FastCGI Settings: The discrepancies between Content-Length headers and actual data sent can cause inefficient resource management in NGINX and PHP-FPM.
PHP-FPM Worker Overload: If PHP-FPM is underprovisioned, it might not be able to handle the incoming requests, causing the timeouts you see and pushing CPU usage higher as the server keeps trying to process the backlog.

Suggested Fixes:

Optimize Buffering Settings in NGINX:

You can try limiting the buffer size for FastCGI responses to avoid writing large files to disk.

    fastcgi_buffer_size 64k;
    fastcgi_buffers 4 64k;
    fastcgi_busy_buffers_size 128k;
    fastcgi_temp_file_write_size 128k;

Ensure that your proxy_buffering is configured appropriately depending on the types of responses (static vs. dynamic).
1. Fix Content-Length Mismatches:
Investigate the FreeScout application and PHP-FPM configurations to address why more data is being sent than declared in the Content-Length header. This could require adjusting how the app handles file uploads/downloads or tweaking PHP-FPM output settings.
3. Increase PHP-FPM Worker Limits:
You may need to increase the number of PHP-FPM workers to handle the request load better. Check your php-fpm.conf file to make sure the pm.max_children, pm.start_servers, and other worker-related configurations are appropriate for your server load.
4. Address Missing Files:
Resolve the missing static file errors by checking the deployment pipeline and ensuring that all required static resources are properly deployed.
7. Consider Caching for Static Files:
If feasible, implement caching for static files using NGINX or a CDN to reduce the load on your PHP-FPM processes and avoid repeatedly hitting the upstream server for static resources.
he PHP-FPM error log provides further insights into the issues your site is experiencing. Here’s a breakdown of key points:

Key Issues:

PHP-FPM Pool Busy Warnings:

Starting around 10:10 AM, PHP-FPM begins logging repeated warnings that the pool is busy and that it needs to spawn more child processes. This continues until 10:57 AM, when PHP-FPM is terminated and restarted. These warnings indicate that PHP-FPM was overwhelmed with requests and didn’t have enough workers to handle the load efficiently.
PHP-FPM keeps trying to spawn additional children, but the number of total children continues to increase, reaching 31 children by 10:16 AM, and eventually 36 children. Despite these attempts, there are still no idle workers, meaning all the children are busy processing requests, leading to delays and timeouts.
2. Child Process Exits:
The logs also show numerous instances where child processes exit with code 0, which means they completed successfully. However, the logs show long execution times (e.g., over 23000 seconds), which indicates that some processes were taking an extremely long time to complete, likely contributing to the high CPU usage and eventual pool overload.
3. Frequent Spawning of New Children:
The system is constantly spawning new PHP-FPM child processes to keep up with demand, which puts additional strain on the server, especially if there are insufficient resources (CPU, memory) available. The more workers that are spawned, the more resources are required to manage them.
4. PHP-FPM Restart:
At 10:57 AM, PHP-FPM terminates and restarts. This could have been a manual restart or triggered by an automated system trying to recover from overload.

Possible Causes:

Insufficient PHP-FPM Worker Settings: Your current settings for pm.start_servers, pm.min_spare_servers, pm.max_spare_servers, or pm.max_children are likely too low to handle the traffic load.
Long-Running Requests: The extremely long-running requests (some over 23000 seconds) suggest that certain PHP processes are taking a very long time to complete. This could be due to inefficient code, database issues, or external services being slow to respond.

Recommendations:

Increase PHP-FPM Workers:

Review and increase your PHP-FPM pool settings. You can adjust these parameters in your php-fpm.conf or pool configuration file (e.g., /etc/php/7.x/fpm/pool.d/www.conf):
pm.max_children: The maximum number of child processes that PHP-FPM will spawn. If your server has enough CPU and memory, consider increasing this value.
pm.start_servers: The number of child processes created when PHP-FPM starts.
pm.min_spare_servers and pm.max_spare_servers: The minimum and maximum number of idle child processes. Ensure there are enough spare servers to handle incoming requests.
Example configuration:

    pm = dynamic
    pm.max_children = 100
    pm.start_servers = 20
    pm.min_spare_servers = 10
    pm.max_spare_servers = 20

2.	Optimize Long-Running Requests:

Investigate the processes taking a long time (shown in the log with exits after several thousand seconds) and identify whether they can be optimized. These could be database queries, file operations, or API requests causing delays.
Set appropriate timeout values for PHP scripts, database connections, and other external services to avoid hanging processes.
3. Monitoring and Scaling:
Set up monitoring for PHP-FPM and your server resources (CPU, memory) to track how many children are spawned and whether your system is nearing its limits.
Consider scaling horizontally (e.g., adding more application servers) if your traffic load exceeds what your current server can handle.
4. Consider Persistent Connections:
If applicable, you could enable persistent database connections (via pdo_mysql or similar) to reduce the overhead of opening and closing database connections for each request.

The text was updated successfully, but these errors were encountered:

iantearle added the bug Something isn't working label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent 100% CPU forcing reboot of EC2 instance #195

Intermittent 100% CPU forcing reboot of EC2 instance #195

iantearle commented Oct 24, 2024

Intermittent 100% CPU forcing reboot of EC2 instance #195

Intermittent 100% CPU forcing reboot of EC2 instance #195

Comments

iantearle commented Oct 24, 2024