Layer7 API Security

 View Only

Load Average high, but CPU utilization is low.... ?

By Doyle_Reece posted May 09, 2016 12:57 PM

  

So, i had this Pre-Prod box that had a high Load Average value, which didn't make sense to me, as this box shouldn't be taking any traffic. What further made things confusing was that it didn't appear CPU was being used much at all either. Usually the two go hand and hand based on my understanding of the metric.

 

After a little 'Googling', i discovered that Load Average also takes into account disk I/O.

 

Hmmm, how do i figure out if I'm running high amounts of I/O, which lead me back to Mr. Google, who showed me how to interpret the vmstat command.

 

I came across this article which pointed me to the 'bo' metric of the vmstat output - Linux Performance Measurements using vmstat - Thomas-Krenn-Wiki

 

'bo' stands for 'blocks out' or blocks written to the filesystem ( is how i interpreted it ).

 

my server's 'bo' metric was around 8-9k. That seemed high, but i didn't have anything to base that on, so a quick glance at a prod server confirmed that this value in 'Pre-Prod' appeared to be abnormal.

 

So now it appears i have something writing a lot of stuff to the filesystem and i need to track it down....

 

Back to my Friend google on the topic and I come across an article that said that a disk being full can affect this 'bo' value.

 

"interesting i thought, that would be a quick fix", so i ran my usual 'df -h' command and didn't see anything obvious... hmmm

 

in the past, inodes have bit me, so i ran the command to check 'inode' utilization ( 'df -i' ) and found a bread crumb..

 

/var was 100% utilized....

 

inodes is related to the number of files, so i found this cool command to help me track down which directory in /var has a lot of files...

for i in *; do echo -e "$(find $i | wc -l)\t$i"; done | sort -n

 

eventually, i find that mail drop has almost 20,000 files in it. That is a lot of files....

 

What the heck is attempting to send all this mail?

 

I dug into one of the files and find it's related to one of the cron jobs failing on the Gateway, and when it fails it tries to send mail to tell us it isn't happy...

 

So, the resolution would be to append '2>&1' to the end of the cron job command, so that in the situation of a failure, it's not suppose to attempt to shoot out an angry email...

 

once i did this and removed all the files from the mail drop folder using the following command ( /bin/rm didn't like the number of arguments ), it appears to start behaving.

find . -name "*.pdf" -print0 | xargs -0 rm

 

'bo' reduced to less than 100, 'df -i' /var output dropped to 3% instead from 100%, and Load Average dropped significantly.

 

Yayyy!!

0 comments
4 views

Permalink