Monitor CPU, MEM and I/O usage
AIX utilities and commands which you can use either to analyze the issue
In working with your database, you might notice a certain DB2 process consuming a high amount of CPU space. This section describes some AIX utilities and commands which you can use either to analyze the issue yourself or to gather data before submitting a PMR to IBM Technical Support:
ps
A ps command reveals the current status of an active process. You can use ps auxw | sort -r +3 |head -10 to sort and get a list of the top 10 highest CPU consuming processes. Listing 1 shows the ps output:
Where "-r +3" reverses the order of the specified sort: The column specified sort is the third column "+3"
Listing 1. Sample ps output
root@vrcp $ ps auxw|sort -r +3|head -10
Output:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
db2db2 5391608 0.0 0.0 9900 9500 - A Apr 23 1:32 db2agent (db4) 0
db2db2 4395022 3.0 0.0 9968 9572 - A 07:24:57 5:29 db2agent (db3) 0
db2db2 1495122 2.1 0.0 9964 9576 - A 08:30:02 5:11 db2agent (db4) 0
db2db2 1429562 1.7 0.0 9916 2260 - A 03:47:14 0:00 db2agent (db4) 0
db2db2 3391608 0.0 0.0 9900 9500 - A Apr 23 1:32 db2agent (db4) 0
db2db1 3305562 0.0 0.0 9848 9468 - A Apr 23 9:16 db2agent (db1) 0
db2db1 4260074 0.0 0.0 9840 2188 - A 00:13:28 0:00 db2agent (db2) 0
db2db2 4169808 0.0 0.0 9832 2184 - A 03:47:18 0:00 db2agent (db3) 0
db2db1 3457072 0.0 0.0 9820 9432 - A Apr 23 6:12 db2agent (db1) 0
topas
When executing a ps -ef command, you see the CPU usage of a certain process. You can also use the topas command to get further details. Similar to the ps command, a topas command retrieves selected statistics about the activity on the local system. Listing 2 is a sample topas output that shows a DB2 process consuming 33.3% CPU. You can use the topas output to get specific information such as the process id, the CPU usage and the instance owner who started the process. It is normal to see several db2sysc processes for a single instance owner. DB2 processes are renamed depending on the utility being used to list process information:
root@vrcp $ topas
Topas Monitor for host: serverZ
Fri Apr 25 06:51:52 2008 Interval: 2
Kernel 11.9 |#### |
User 72.5 |##################### |
Wait 1.4 |# |
Idle 14.2 |##### |
Physc = 3.39 %Entc= 94.3
Name PID CPU% PgSp Owner
db2sysc 105428 33.3 11.7 udbtest
db2sysc 38994 14.0 11.9 udbtest
test 14480 1.4 0.0 root
db2sysc 36348 0.8 1.6 udbtest
db2sysc 116978 0.5 1.6 udbtest
db2sysc 120548 0.5 1.5 udbtest
sharon 30318 0.3 0.5 root
lrud 9030 0.3 0.0 root
db2sysc 130252 0.3 1.6 udbtest
db2sysc 130936 0.3 1.6 udbtest
topas 120598 0.3 3.0 udbtest
db2sysc 62248 0.2 1.6 udbtest
db2sysc 83970 0.2 1.6 udbtest
db2sysc 113870 0.2 1.7 root
vmstat
The vmstat command can be used to monitor CPU utilization; you can get details on the amount of user CPU utilization as well as system CPU usage. Listing 3 shows the output from a vmstat command:
The vmstat below lists system utilization each 5 seconds
root@vrcp $ vmstat 5
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
32 3 1673185 44373 0 0 0 0 0 0 4009 60051 9744 62 38 0 0
24 0 1673442 44296 0 0 0 0 0 0 4237 63775 9214 67 33 0 0
30 3 1678417 39478 0 0 0 0 0 0 3955 70833 8457 69 31 0 0
33 1 1677126 40816 0 0 0 0 0 0 4101 68745 8336 68 31 0 0
28 0 1678606 39183 0 0 0 0 0 0 4525 75183 8708 63 37 0 0
35 1 1676959 40793 0 0 0 0 0 0 4085 70195 9271 72 28 0 0
23 0 1671318 46504 0 0 0 0 0 0 4780 68416 9360 64 36 0 0
30 0 1677740 40178 0 0 0 0 0 0 4326 58747 9201 66 34 0 0
30 1 1683402 34425 0 0 0 0 0 0 4419 76528 10042 60 40 0 0
In the listing above the system is hitting an
average of 65% user CPU usage and 35% system CPU usage. Pi
and Po
values are equal to 0, thus there are no paging
issues. The wa
column shows there does not seem to be any I/O issues.
When reading a vmstat output, as above, you can ignore the first line. The important columns to look at are us, sy, id and wa.
- id: Time spent idle.
- wa: Time spent waiting for I/O.
- us: Time spent running non-kernel code. (user time)
- sy: Time spent running kernel code. (system time)
In the listing below shows the wa (waiting on I/O) to be unusually high and this indicates there might be I/O bottlenecks on the system which in turn causes the CPU usage to be inefficient. You can check errpt -a output to see if there are any reported issues with the media or I/O on the system.
Sample vmstat output showing I/O issues
root@vrcp $ vmstat 5
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
2 8 495803 3344 0 0 0 929 1689 0 998 6066 1832 4 3 76 16
0 30 495807 3340 0 0 0 0 0 0 1093 4697 1326 0 2 0 98
0 30 495807 3340 0 0 0 0 0 0 1055 2291 1289 0 1 0 99
0 30 495807 3676 0 2 0 376 656 0 1128 6803 2210 1 2 0 97
0 29 495807 3292 0 1 3 2266 3219 0 1921 8089 2528 14 4 0 82
1 29 495810 3226 0 1 0 5427 7572 0 3175 16788 4257 37 11 0 52
4 24 495810 3247 0 3 0 6830 10018 0 2483 10691 2498 40 7 0 53
4 25 495810 3247 0 0 0 3969 6752 0 1900 14037 1960 33 5 1 61
2 26 495810 3262 0 2 0 5558 9587 0 2162 10629 2695 50 8 0 42
3 22 495810 3245 0 1 0 4084 7547 0 1894 10866 1970 53 17 0 30
iostat
An iostat command quickly tells you if your system has a disk I/O-bound performance problem. Below is an example of an iostat command output:
root@vrcp $ iostat
Sample iostat output
System configuration: lcpu=4 disk=331
tty: tin tout avg-cpu: % user % sys % idle % iowait
0.0 724.0 17.9 12.3 0.0 69.7
Disks: % tm_act Kbps tps Kb_read Kb_wrtn
hdisk119 100.0 5159.2 394.4 1560 24236
hdisk115 100.0 5129.6 393.0 1656 23992
hdiskpower26 100.0 10288.8 790.8 3216 48228
- %tm_act : Reports back the percentage of time that the physical disk was active or the total time of disk requests.
- Kbps : Reports back the amount of data transferred to the drive in kilobytes.
- tps : Reports back the number of transfers-per-second issued to the physical disk.
- Kb_read : Reports back the total data (kilobytes) from your measured interval that is read from the physical volumes.
- Kb_wrtn : Reports back the amount of data (kilobytes) from your measured interval that is written to the physical volumes.
To check if you are experiencing resource contention, you can focus on the %tm_act value from the above output. An increase in this value, especially more than 40%, implies that processes are waiting for I/O to complete, and you have an I/O issue on your hands. Checking which hard disk has higher disk activity percentage and whether DB2 uses those hard disks gives you a better idea if these two factors are related.
What to collect
You should collect the following information before opening a PMR with IBM Technical Support:
- db2support.zip
- truss -f -o truss.out -p <pid> of high cpu process
- db2pd -stack <pid> of high cpu process
Technical support might also send you the db2service.perf1 script which basically collects data repeatedly over a period of time. The output of the script needs to be bundled and sent back to the support team for their further analysis.
Sources: IBM developer works