Sometimes a Linux server which runs Nginx + PHP-CGI(php-fpm) web service may experience sudden system load increase and the CPU usage is around 100% for many php-cgi processes when checking with top command. If this happens, file_get_contents may be the cause if it's used in the PHP script.
In lots of web applications, normally there are lots of API requests based on HTTP. Many PHP developers like to use file_get_contents("http://example.com/") to get the API response because it's simple to use. But the problem of using file_get_contents is that it will block if the remote server responds slowly and it will not timeout.
In php.ini, there is a max_execution_time setting which can be used to control the maximum execution time of a PHP script. However, this setting will not take effect in PHP-CGI(php-fpm). The actual setting which could be used to terminate a script is a setting in php-fpm.conf.
The timeout (in seconds) for serving a single request after which the worker process will be terminated
Should be used when 'max_execution_time' ini option does not stop script execution for some reason
'0s' means 'off'
<value name="request_terminate_timeout">0s</value>
The default value is 0s which means PHP script will never terminate. In this case, when all php-cgi processes are stuck at running file_get_contents, the web server cannot handle any other PHP request anymore and the Nginx server would return 502 Bad Gateway error.
Hence the value needs to be changed to some reasonable value such as 30s. Thereafter any file_get_contents call would not take more than 30s. But if there are lots of requests, this still cannot prevent the bad gateway error. To resolve this problem thoroughly, need to change the way how file_get_contents is called. Actually file_get_contents can take other parameters which can be used o control the function call timeout.
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1 // timeout value, unit is second
)
));
file_get_contents("http://example.com/", 0, $ctx);
According to the API doc, the parameter context has below meaning:
context
A valid context resource created with stream_context_create(). If you don't need to use a custom context, you can skip this parameter by NULL.
After knowing how the issue can be resolved. How does one know the issue is caused by file_get_contents? There may be many other reasons why CPU usage would go up to 100%. Try with the top command.
top - 10:34:18 up 724 days, 21:01, 3 users, load average: 17.86, 11.16, 7.69
Tasks: 561 total, 15 running, 546 sleeping, 0 stopped, 0 zombie
Cpu(s): 5.9%us, 4.2%sy, 0.0%ni, 89.4%id, 0.2%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 8100996k total, 4320108k used, 3780888k free, 772572k buffers
Swap: 8193108k total, 50776k used, 8142332k free, 412088k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10747 www 18 0 360m 22m 12m R 100.6 0.3 0:02.60 php-cgi
10709 www 16 0 359m 28m 17m R 96.8 0.4 0:11.34 php-cgi
10745 www 18 0 360m 24m 14m R 94.8 0.3 0:39.51 php-cgi
10707 www 18 0 360m 25m 14m S 77.4 0.3 0:33.48 php-cgi
10782 www 20 0 360m 26m 15m R 75.5 0.3 0:10.93 php-cgi
10708 www 25 0 360m 22m 12m R 69.7 0.3 0:45.16 php-cgi
10683 www 25 0 362m 28m 15m R 54.2 0.4 0:32.65 php-cgi
10711 www 25 0 360m 25m 15m R 52.2 0.3 0:44.25 php-cgi
10688 www 25 0 359m 25m 15m R 38.7 0.3 0:10.44 php-cgi
10719 www 25 0 360m 26m 16m R 7.7 0.3 0:40.59 php-cgi
Find the PID of one process which has high CPU usage and trace it: strace -p 10747. Check whether below output is displayed.
select(7, [6], [6], [], {15, 0}) = 1 (out [6], left {15, 0})
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
select(7, [6], [6], [], {15, 0}) = 1 (out [6], left {15, 0})
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
select(7, [6], [6], [], {15, 0}) = 1 (out [6], left {15, 0})
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
select(7, [6], [6], [], {15, 0}) = 1 (out [6], left {15, 0})
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
select(7, [6], [6], [], {15, 0}) = 1 (out [6], left {15, 0})
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
select(7, [6], [6], [], {15, 0}) = 1 (out [6], left {15, 0})
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
select(7, [6], [6], [], {15, 0}) = 1 (out [6], left {15, 0})
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
select(7, [6], [6], [], {15, 0}) = 1 (out [6], left {15, 0})
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
select(7, [6], [6], [], {15, 0}) = 1 (out [6], left {15, 0})
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
select(7, [6], [6], [], {15, 0}) = 1 (out [6], left {15, 0})
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
It keeps polling the remote server and timeout and it should be the issue.
Reference: https://www.toutiao.com/i6621105587566412296/