DX Infrastructure Management

Expand all | Collapse all

Real time UNIX/Linux Disk I/O and Network I/O monitoring

  • 1.  Real time UNIX/Linux Disk I/O and Network I/O monitoring

    Posted 01-23-2012 04:10 AM

    I have an immediate need that requires real time UNIX/Linux Disk I/O and Network I/O minitoring with QOS.

    Disk IO Stats:
    - Average Queued Disk Requests
    - Average disk service time
    - Average disk wait time
    - Blocks Per Second
    - Percentage disk is busy
    - Read Write OPs per second

    Network IO Stats:
    - Network collisions
    - Bytes In
    - Bytes Out
    - Errors In
    - Errors Out

    How can I accomplish this?



  • 2.  Re: Real time UNIX/Linux Disk I/O and Network I/O monitoring

    Posted 05-23-2012 03:30 PM

    You can run command using logmon probe and capture the IO output values, also collect those values under variable and store as QoS.



  • 3.  Re: Real time UNIX/Linux Disk I/O and Network I/O monitoring

    Posted 08-08-2012 01:40 PM

    In this example we will use regex to pull out and store the bi and bo values for disk I/O and store them in group variables. Highlighted in blue are the two variable we are going to capture in this example.

     

    Advanced Logmon tutorial using Linux commands and their standard output to generate parsable output. Using regex group variables QoS data can be generated for monitoring.

     

    As there is currently no Nimsoft Probe avaliavle for Disk I/O QoS monitoring we will configure the logmon probe to provide this functionality. The guide is intended to provide a general overview and you should bare in mind some of the Linux commands may not be avaliable or provide differing output depending on your system. This guide was written using NMS 5.11, Logmon Version 3.03 and CentOS Linux 5.5.

     

    Getting the Raw Data

     

    First lets take a look at the linux command for Disk IO stats:

    # vmstat --help
    usage: vmstat [-V] [-n] [delay [count]]
    -V prints version.
    -n causes the headers not to be reprinted regularly.
    -a print inactive/active page stats.
    -d prints disk statistics
    -D prints disk table
    -p prints disk partition statistics
    -s prints vm table
    -m prints slabinfo
    -S unit size
    delay is the delay between updates in seconds.
    unit size k:1000 K:1024 m:1000000 M:1048576 (default is K)
    count is the number of updates.

     

    # vmstat
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
    r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
    0  0 253668  21240 104600 116712    0    0     3    46   11   11  1  1 98  0  0

     

    The simple command vmstat provides various information including IO usage:

    Procs
    r: The number of processes waiting for run time.
    b: The number of processes in uninterruptible sleep.

    Memory
    swpd: the amount of virtual memory used.
    free: the amount of idle memory.
    buff: the amount of memory used as buffers.
    cache: the amount of memory used as cache.
    inact: the amount of inactive memory. (-a option)
    active: the amount of active memory. (-a option)

    Swap
    si: Amount of memory swapped in from disk (/s).
    so: Amount of memory swapped to disk (/s).

    IO
    bi: Blocks received from a block device (blocks/s).
    bo: Blocks sent to a block device (blocks/s).

    System
    in: The number of interrupts per second, including the clock.
    cs: The number of context switches per second.

    CPU
    These are percentages of total CPU time.
    us: Time spent running non-kernel code. (user time, including nice time)
    sy: Time spent running kernel code. (system time)
    id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
    wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle.
    st: Time stolen from a virtual machine. Prior to Linux 2.6.11, unknown.

     

    Logmon Configuration

     

    Deploy the logmon from the archive and select and right+click - configure the probe:

     

     

    Create a new profile by right clicking in the left hand window and selecting "new", name this profile Linux Disk IO or what ever you choose:

    General Tab:

     

    alt

     

    Change the mode drop down box to "command" to issue a bash command to the Linux robot. Type in "vmstat" in the command input text box. Select an appropriate "Check Interval" , this is the period of time it takes between checking the vmstat command again and generating a new QoS entry.

    Tick the checkbox "Generate Quality of Service"

    "Generate Alarm" this is out of the scope for this document but could be used by creating another profile to watch for IO values over a specific threshold. For this tutorial we are just interested in QoS.

     

    Format Rules Tab:

    This can be left blank, ensure nothing is active here

     

    Watcher Tab:

    This is the core of the configuration to parse the output of the vmstat comand and select the required variables.

     

    alt

     

    Create a new watcher profile by right clicking in the vertical pane and selecting new, for this example I call the profile "BI and BO Vales"

    The match expression text box is where we will place the regex for pattern matching our variable and where this gets a more complex. We will explain the rexeg sytnax in greater detail in a later section.

     

    Next select the variables tab:

     

    alt

     

    Right click and select new in the left hand column to create a new variable profile. In this example we will call it "BI and BO Values".

    Ensureing this new profile is highlighted right click in the variables window and select new to create a new variable.

     

    alt

     

    In the name text box enter "bi". Under the "Source FROM position" select Match and ensure character position is 1. Now this section is not the most intuitive for a new user of logmon and I will take the time to try and explain what is happening here.

    Selecting "Match Expression" uses the REGEX syntax from the watcher as our variable input. REGEX is a very powerful tool or pattern matching and any given regex syntax may include "variable groups" lets look at a basic example:

    The\squick\sbrown\s(fox)\sjumps\sover\sthe\slazy\s(dog)

    The above syntax matches the whole expression "The quick brown fox jumps over the lazy dog" however the words "fox" and "dog" are in brackets which create "groups". The first group is the first syntax to be placed in brackets, in this case fox, however this could be a complex regex expression if required.

    So taking this understanding of regex groups and appying it to our scenario above, we can use the "Match Expression" and the "Character Position" preferances to match one of the regex groups to use as a variable.

     

    When we look at the REGEX syntax used for this tutorial in closer detail we will see that the regex group 1 is in fact the value for the Disk IO BI value. This will be used as our QoS data.

     

    Go ahead and repeat the above steps to create another variable for "bo" using character position 2 to take the 2nd regex group as its input.

     

    Next is the QoS tab:

     

    alt

     

    In the botton right hand corner of the window is the "QoS on Variables" section. Ensure each of our newly created variables are checked. This will generate QoS data based on these variables.

    NOTE: For future understanding it is important to note only numeric values can be used as QoS data. Logmon variables can be used to match words and phrases which is useful for parsing into the alarm messages no QoS can be generated on anything alphanumeric.

     

    All done! Select the apply button and reload the probe. All things well the logmon probe will start to capture QoS data.

     

    Unfortuantly things are rearley that easy and first time round you will probably come across a problem or typo somewhere along the line. Under the "settings" section of logmon (the grey cog to left of the window alt)

    Set the log level to 3:

     

    alt

     

    Right clicking the probe and selecting "View Log" or simply CTRL+V on the probe will show the log window.

    TIP: for diagnosing logmon problems it is a good idea to change the "Check interval" to just a few seconds helping speed up diagnostics, remember to set it back to something practical once complete.

    Lets take a look at a sucessful pattern match:

    logmon: [Linux Disk IO] NO MATCH [BI and BO Values] offset now 0
    logmon: [Linux Disk IO] FORMAT START [default] - ' 1  0 227820  23316  94108  95284    0    0     3    43    1    2  1  1 98  0  0'
    logmon: [Linux Disk IO] FORMAT LINES [default] - ' 1  0 227820  23316  94108  95284    0    0     3    43    1    2  1  1 98  0  0'
    logmon: (scan) BI and BO Values offset 0
    logmon: [Linux Disk IO] MATCH [BI and BO Values] on line 0
    logmon: SREQUEST: post ->192.168.1.50/48001
    logmon: RREPLY: status=OK(0) <-192.168.1.50/48001  h=37 d=28
    logmon: SREQUEST: _close ->192.168.1.50/48001
    logmon: (LogMon) CiOpen Device Success.
    logmon: (LogMon) CiOpen Device Success.
    logmon: RREPLY: status=OK(0) <-192.168.1.50/48001  h=37 d=28
    logmon: SREQUEST: _close ->192.168.1.50/48001
    logmon: RREPLY: status=OK(0) <-192.168.1.50/48001  h=37 d=28
    logmon: SREQUEST: _close ->192.168.1.50/48001
    logmon: [Linux Disk IO] used 14 ms scanning 241 bytes
    logmon: (scan) - before ptScanClose...
    logmon: (ptScanClose) - closing Linux Disk IO modified
    logmon: (ptScanClose) - before storeInDB
    logmon: (ptScanClose) - after storeInDB
    logmon: (ptScanClose) - leaving
    logmon: (scan) - after ptScanClose...

    Lets not go into too much detail over what is heppening here but its useful to know what your looking for when things are working!

     

    The full config for this exercise can be found at the bottom of this document, just copy it into the logmon.cfg file

     

    As promised lets take a look at the actual syntact for the regex "Pattern Match"

    NOTE: I seriously reccomend a copy of regex buddy or similar regex highlighting tool if your going to work through this example

     

    Using the unix command

    # vmstat

    We can return the current disk I/O usage for a given machine:

    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
    r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
    2  0 2257564 232040   2880 206620    0    1    14 11 4    9  2 14 84  0

    In this example we will use regex to pull out and store the bi and bo values for disk I/O and store them in group variables. Highlighted in blue are the two variable we are going to capture in this example.

    We use the following syntax to capture these values and place them in individual groups to use as QoS variables.

    /[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}([0-9]{1,})\s{1,}([0-9]{1,})/

    Looks a little confusing and overwhalming for the newbie users of regex but fear not it's actually quite simple when broken down:

    Logmon regex is always "wrapped" in opening and closing "/" this is how we tell Nimsoft NMS we are about to start using REGEX and not standard pattern matching.

    If you look closely you will see a repeating pattern in the above syntax:

    [0-9]{1,}\s{1,}

    this breaks our row of numbers into columns:

    2 0 2257564 232040 2880 206620 0 1 14 11 4 9 2 14 84 0

    Lets look a little closer:

    [0-9] - Match any numeric character of value 0 - 9

    {1,} - Any number of times

    \s - Match white spaces

    {1,} - Any number of times

     

    Repeating that syntax for each "column", in this case there is 8 columns before the first value we want to capture for our QoS data:

    [0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}

    We then simply place brackets around the next numberic pattern match:

    ([0-9]{1,})

    This gives us our first group for the BI value in the 9th column. Adding:

    \s{1,} - Match white spaces any number of times

    captures the white space between the BI and BO columns. We then use a second numeric group match to capture our second QoS value for the BO value.

    ([0-9]{1,})\s{1,}([0-9]{1,})

    ^Group 1                         ^Group 2

     

    Combining the above gives us a full syntax of:

    /[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}([0-9]{1,})\s{1,}([0-9]{1,})/

     

    Logmon CFG

    Below is a snippet from the logmon CFG file. This should be quite portable so feel free to insert into your existing logmon.cfg file.

    <Linux Disk IO>
    active = yes
    interval = 5 min
    scanfile = vmstat
    scanmode = command
    alarm = no
    qos = yes
    message =
    subject =
    max_alarms =
    max_alarm_msg =
    password =
    <watchers>
    <BI and BO Values>
    active = yes
    match = /[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}[0-9]{1,}\s{1,}([0-9]{1,})\s{1,}([0-9]{1,})/
    level = information
    subsystemid =
    message =
    i18n_token =
    restrict =
    expect = no
    abort = no
    sendclear = no
    count = no
    separator =
    suppid =
    source =
    qos =
    runcommandonmatch = no
    commandexecutable =
    commandarguments =
    expect_message =
    expect_level =
    <variables>
    <bi>
    definition = $1
    operator =
    threshold =
    qosactive = yes
    qosname = <Default>
    qostarget =
    </bi>
    <bo>
    definition = $2
    operator =
    threshold =
    qosactive = yes
    qosname = <Default>
    qostarget =
    </bo>
    </variables>
    </BI and BO Values>
    </watchers>
    </Linux Disk IO>



  • 4.  Re: Real time UNIX/Linux Disk I/O and Network I/O monitoring

    Posted 10-15-2012 09:37 PM

    Thank you for the detailed explanation.  This will be very useful.  However, from the documentation on vmstat, the first line is always the average value since boot, and every subsequent sampling provides instantaneous values.  So I don't think reporting on the first value will tell the true story of the server performance.

     

    I have tried setting this up using 'vmstat 1 2' in order to get a second line and then in my variable definitions I specified source line 4 in order to have it ignore the first line of averages.  However, when I look at my QOS, it appears it is capturing both lines.  Do you know what it might take to get logmon to ignore the first line of output?  I suspect it has to do with the 1 second delay in producing the next line, but I'm not sure.

     

    Thanks!



  • 5.  Re: Real time UNIX/Linux Disk I/O and Network I/O monitoring

    Posted 10-16-2012 01:43 AM

    I think I have answered my own question by modifying the vmstat command used in the logmon probe.  Rather than using just simply vmstat 1 2 as the command, I am using vmstat 1 2 | sed '3d' to simply delete the 3rd line of output.



  • 6.  Re: Real time UNIX/Linux Disk I/O and Network I/O monitoring

    Posted 02-15-2017 01:32 PM

    Has this been added as a feature in the product?