Oct 28, 2012

Zabbix poller processes more than 75% busy and queue delay (II)

After putting forward the issues turned up on my current Zabbix installation and related to its performance (Zabbix poller processes more than 75% busy and queue delay), I am going to explain to you how I solved it.

First of all, I tried out to increase the number of pre-forked instances of pollers for the Zabbix server, that is, I changed its default value from 5 to 256 (remember that for that case, you have to set the the number of maximum connections in MySQL - max_connections - higher than 256, since every single poller opens a dedicated connection to the database).

root@zabbix-server:~# cat /etc/zabbix/zabbix_server.conf
...
# StartPollers=5
StartPollers=256

root@zabbix-server:~# cat /etc/mysql/my.cnf
...
max_connextions = 512

Below you can see the outcome after applying it (Zabbix server performance).




And the Zabbix data gathering process.




In the first figure, you can observe that the Zabbix queue has gone from 48 to 30 (approximately), and for the second one, the Zabbix busy poller processes went from 100% to 24%. So it is clear that if you have a server with enough resources, there is no problem to start many pollers. These kind of processes are responsible for requesting the data defined in the items, so the more pollers have available, the less overloaded the system is.

Other Zabbix server parameter that you ought to take into account is for example the Timeout (specifies how log pollers wait for agent responses). Try not to assign a very high value. Otherwise, the system might get overloaded.

Next week, I will end up this series of articles by accomplishing the part of the client.


Oct 21, 2012

sysstat vs top vs ps (III)

Let's finish the series of articles related to the differences in the measure of sysstat, top and ps.

What about the system CPU time (%sy) of the first top from the previous article? It is 39.7%. It is right as well. You have to take into account that, because this server has a couple of cores, that data represents the average from both cores. You can see this point by running top into interactive mode and pressing the number 1. Then you will be able to obtain the consumption for both cores.

root@ubuntu-server:~# top
top - 20:04:37 up 47 min,  1 user,  load average: 0.40, 0.54, 0.60
Tasks:  87 total,   1 running,  86 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.9%us, 37.7%sy,  0.0%ni, 59.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  1.6%us, 40.8%sy,  0.0%ni, 57.1%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
Mem:   1024800k total,   207680k used,   817120k free,    23532k buffers
Swap:  1048572k total,        0k used,  1048572k free,   100320k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1561 root      20   0 29280 4180 2416 S  128  0.4  56:57.23 script.py

At any rate, the sum of both percentages does not match the %CPU used by the script. This might be due to the sampling frequencies commented in the preceding article.

Finally, I would like to remark that in order to monitor threads, I like to use pidstat with the -t parameter.

root@ubuntu-server:~# pidstat -u -t -p 1561 1 1
Linux 3.2.0-30-generic-pae (ubuntu-server)     09/30/2012     _i686_    (2 CPU)

08:14:47 PM      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
08:14:48 PM      1561         -    9.00  119.00    0.00  128.00     1  script.py
08:14:48 PM         -      1561    0.00    0.00    0.00    0.00     1  |__script.py
08:14:48 PM         -      1562    4.00   60.00    0.00   64.00     0  |__script.py
08:14:48 PM         -      1563    4.00   59.00    0.00   63.00     1  |__script.py

This tool is great because you can see the core where the thread (or process) is being executed.

Another interesting choice is to use directly the ps command with the suitable parameters (-L shows threads and sgi_p the processor where the process is currently executing on).

root@ubuntu-server:~# ps -Leo pid,pcpu,sgi_p,command | grep '^ 1561'
 1561  0.0 * /usr/bin/python ./script.py
 1561 65.6 0 /usr/bin/python ./script.py
 1561 65.2 0 /usr/bin/python ./script.py

Also point out for the above output that, ps does not offer the number of TID (Thread ID), in contradistinction to top and pidstat (remember that a thread does not have PID).


Oct 15, 2012

Zabbix poller processes more than 75% busy and queue delay (I)

In my previous job, I had to set up a Zabbix infrastructure in order to monitor more than 400 devices between switches and servers. The main feature of this architecture was that there were a lot of machines, but the update interval was large (around 30 seconds) and the number of items small.

For this purpose, I wrote down a couple of articles related to this issue:


But in my current position, I am starting to introduce Zabbix (2.0.3 on Ubuntu Server 12.04) with the aim of controlling few devices where a large number of items and a small monitoring period are required. This situation leads to an overload of the Zabbix server, on the one hand by increasing the number of monitored elements delayed in the queue, and on the other, turning out that the poller processes are busy long.

In addition, I have been able to observe that, from time to time, the agent goes down in an unexpected way. If you take a look at the log file from the client (debug mode), the following error lines are dumped.

root@zabbix-client:~# tail -f /var/log/zabbix/zabbix_agentd.log
...
zabbix_agentd [17271]: [file:'cpustat.c',line:155] lock failed: [22] Invalid argument
 17270:20121015:092010.216 One child process died (PID:17271,exitcode/signal:255). Exiting ...
 17270:20121015:092010.216 zbx_on_exit() called
 17272:20121015:092010.216 Got signal [signal:15(SIGTERM),sender_pid:17270,sender_uid:0,reason:0]. Exiting ...
 17273:20121015:092010.216 Got signal [signal:15(SIGTERM),sender_pid:17270,sender_uid:0,reason:0]. Exiting ...
 17274:20121015:092010.216 Got signal [signal:15(SIGTERM),sender_pid:17270,sender_uid:0,reason:0]. Exiting ...
 17270:20121015:092012.216 Zabbix Agent stopped. Zabbix 2.0.3 (revision 30485).

Below you can observe a figure which shows the Zabbix server performance (queue) for the aforementioned case.




And the other one, reflects the Zabbix data gathering process (pay attention to the data Zabbix busy poller processes, in %).




For the first case, the Zabbix queue has averaged more than 50 monitored items delayed, and for the second one, the poller processes are busy about 100% of the time. This situation can produce that, sometimes, Zabbix draws sporadic dots rather than lines in the graphs. Another effect that you can get from this condition is that if you set a short update interval for an item, you could run into lack of data when you check the values gathered later.




Also say that I followed the tuning guide that I mentioned before, but as you can see, Zabbix server was acting up.


Oct 6, 2012

sysstat vs top vs ps (II)

Following up on the previous article, sysstat vs top vs ps (I), a curious case that I would like to talk about is when you use more than one core. Let's create a simple script in Python which runs a couple of threads a little bit overloaded.

root@ubuntu-server:~# nproc 
2

root@ubuntu-server:~# cat script.py 
#!/usr/bin/python

import threading, time

def sleep():
    while True:
        time.sleep(0.000001)

t1 = threading.Thread(target=sleep)
t2 = threading.Thread(target=sleep)

t1.start()
t2.start()

t1.join()
t2.join()

root@ubuntu-server:~# ./script.py &
[1] 1561

If we take a look now at the status of this process by means of top, we can see as follows.

root@ubuntu-server:~# top -b -n 1 -p 1561
top - 19:41:19 up 24 min,  1 user,  load average: 0.74, 0.66, 0.48
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
Cpu(s):  5.3%us, 39.7%sy,  0.0%ni, 55.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1024800k total,   205596k used,   819204k free,    23192k buffers
Swap:  1048572k total,        0k used,  1048572k free,    98712k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1561 root      20   0 29280 4180 2416 S  129  0.4  26:23.26 script.py

So what would be the first weird thing that you can observe from the previous screen? The script is consuming 129% of the CPU. This is right because you have to remember that this virtual machine has two cores and the script, rather, its two threads, are using the two ones and that figure is the combination of the CPU utilization from both cores. You can appreciate this situation much better if you execute top with the -H option.

root@ubuntu-server:~# top -b -H -n 1 -p 1561
top - 19:44:22 up 27 min,  1 user,  load average: 0.69, 0.72, 0.54
Tasks:   3 total,   2 running,   1 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.3%us, 38.9%sy,  0.0%ni, 56.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1024800k total,   205624k used,   819176k free,    23192k buffers
Swap:  1048572k total,        0k used,  1048572k free,    98712k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1563 root      20   0 29280 4180 2416 R   65  0.4  15:10.19 script.py
 1562 root      20   0 29280 4180 2416 R   64  0.4  15:12.39 script.py
 1561 root      20   0 29280 4180 2416 S    0  0.4   0:00.05 script.py