DX Application Performance Management

 View Only
  • 1.  Packets TIM

    Posted Jan 19, 2020 12:14 AM
      |   view attached
    Hi Teams,
    
    By chance there is some documentation that explains the meaning of the different Tim Packet Statistics
    
    It is not very clear, when forward packages, analyzed packages, captured packages are shown.
    
    I wish I could understand what each item of packages means.

    Thanks
    Richard B


  • 2.  RE: Packets TIM

    Broadcom Employee
    Posted Jan 20, 2020 04:04 AM
    Hia,

    there is no docs explaining exactly this, but from experience, here is what it means.
    What you need to understand is that the TIM 9.6+ is using a third-party app to collect the data from the network sources (Network Capture board: Napatech, or network interface). The apm-packet copies the captured traffic to a dedicated directory (usually a tmpfs/Ram filesystem of 4GBytes of size) per TIM worker, and the TIM workers will in turn each look out for the pcaps assigned to it. Once the TIM has retrieved the pcap to work on, it will delete it from the file-queue on the tmpfs.
    That - for a little understanding on the process.

    Now - to the actual explanation of the content:
    1. Captured: Shows the number of packets the apm-packet process has captured from the network source
    2. Dropped: Shows the number of packets that have been dropped from the  network source due to overload, missing space etc.. If a drop happens, something is wrong. A health-check should be done.
    3. Forwarded: The number of packets that have been forwarded from apm-packet to the TIM workers
    4. No-Space: means that the TIM workers (most common cause) can't process (actually analyze) the amount of data provided by apm-packet. There are various reasons this could happen. In that case, a health-check should be done.
    5. Too Short: Don't know that one, sorry. Maybe @Hallett German remembers what it was...
    6. Too Old: The data in the tmpfs filesystem is too old and is discarded by the TIM workers. Usually when the TIM process was stopped without stopping the apmpacket for a while. This also should not happen. Also causes the No Space counter to rise.
    7. Analyzed: The actual data packets that have been analyzed by the TIM workers (actually retrieved by the TIM workers).
    8. Bytes Analyzed: Number of bytes the TIM's have actually analyzed. note that the size of a packet can go from few bytes up top 9MBytes  (for Jumbo Frames) per packet.
    9. Throughput in Kilo Bytes per second: Computed throughput of analyzed data by the TIM.

    BONUS: If you want to have some more details, try adding a: "&unsupported=1" at the end of the URL of that page in the browser. You'll have the full per TIM Worker details displayed.

    Note: Performing a health-check for the TIM can be done (accelerated) using the cem-healtcheck scripts : https://github.com/CA-APM/cem-healthcheck-scripts





  • 3.  RE: Packets TIM
    Best Answer

    Broadcom Employee
    Posted Jan 20, 2020 07:52 AM
    This is the first time I have sent a post on the community in months. There is an internal document (was for MTP) that was created explaining this. I see no reason why it cannot be shared. 

    Tim protocol statistics files

     

    This describes the format of the Tim packet statistics files stored in the /etc/wily/cem/tim/logs/protocolstats and  /etc/wily/cem/tim/logs/protocolstats1 directories.  These files are written by the tim program and read by the viewstats web page.

     

    The files are in csv (comma-separated value) format.  Each line contains one protocol statistics entry.  There are no header or trailer lines.

     

    Each line consists of a number of fields.  Fields can be scalars or one-dimensional arrays.  A scalar field consists of two comma-separated values: the name and the value.  An array field consists of a number of sets of 3 comma-separated values; in each set, the first value is the field name, the second is the index (starting at 0), and the third is the value for that index.  The currently defined arrays have names starting with "cpu-" and "w-".  Numbers are in decimal.

     

    A partial example is:

      ver,1,fmt,m,cpu-use,0,70%,cpu-use,1,85%

    The first field consists of the name "ver" (version) and the value "1".  The second field consists of the name "fmt" (format) and the value "m" (multi-process).  The third and fourth fields form the "cpu-use" array, where  cpu-use[0] is 70% and cpu-use[1] is 85%.

     

    Each line can have either of two formats: one for a single-process Tim and one for a multi-process Tim.  The latter includes arrays that are indexed by the worker number.  If Tim is restarted in a different mode, a single file can contain lines in both formats.  The "fmt" field indicates which.

     

    The following tables list the fields.  A value in the "index" column indicates that the field is an array, in which case the index is a decimal number starting at 0.

     

    The following fields are common to a single-process and multi-process Tim.  The ones that mention MTP are present only when running on an MTP machine.

     

    Name

    Index

    Value

    ver

     

    Version number of this line – always 1

    fmt

     

    m for multi-process or s for single-process

    time

     

    Date and time of this entry

    pkts-capture

     

    Number of packets captured

    pkts-drop

     

    Number of packets dropped

    pkts-forward

     

    Packets forwarded from MTP

    pkts-nospace

     

    Packets dropped by MTP because of no space to write them

    pkts-short

     

    Short packets received by MTP, possibly because user defined hardware filters incorrectly

    pkts-tooold

     

    Old packets not processed. Their timestamp is older than (now –TimMtpLimitPeriodInMinute * 60). Default value of TimMtpLimitPeriodInMinute is 15

    pkts-analyze

     

    Number of packets analyzed

    bytes-analyze

     

    Number of bytes analyzed

    thruput

     

    Throughput

    stats

     

    Number of statistics records open

    cpu-use

    CPU number

    Time used by this CPU

     

    The following fields are present for a multi-process Tim:

     

    Name

    Index

    Value

    hub-cpu

     

    Percentage of CPU time used by the hub

    hub-mem

     

    Memory used by the hub

    w-cpu

    Worker number

    Percentage of CPU time used by this worker

    w-mem

    Worker number

    Memory used by this worker

    w-conn

    Worker number

    TCP connections managed by this worker

    w-ts

    Worker number

    Transets open for this worker

    w-tu

    Worker number

    Transets open for this worker

    w-tc

    Worker number

    Transets open for this worker

    w-ssl

    Worker number

    SSL sessions open for this worker

    w-lgn

    Worker number

    Logins managed by this worker

     

    Records for a single-process Tim have the following fields:

     

    Name

     

    Value

    tim-cpu

     

    Percentage of CPU time used by Tim

    tim-mem

     

    Memory used by Tim

    conn

     

    TCP connections managed by Tim

    ts

     

    Transets open

    tu

     

    Transets open

    tc

     

    Transets open

    ssl

     

    SSL sessions open

    lgn

     

    Logins managed

     




  • 4.  RE: Packets TIM

    Posted Jan 28, 2020 02:55 PM
    Hi Jörg,
    Thanks for your reply. I wanted to ask if there is a way to solve things related to packages too old. Currently almost all packages end up too old. And I suspect that everything happened after services were restarted, wrongly.

    Richard



  • 5.  RE: Packets TIM

    Broadcom Employee
    Posted Jan 29, 2020 04:09 AM
    Hi Richard,
    there can be many reasons for this to happen.
    If you could provide me a Health-check logs for TIM, SYS and performance data archive TIMPERF, I may point out directly to what is going wrong. Without these logs, it it just a wild guess. Check out the README from the cem-healtcheck scripts : https://github.com/CA-APM/cem-healthcheck-scripts for instructions.


  • 6.  RE: Packets TIM

    Broadcom Employee
    Posted Jan 29, 2020 12:30 PM
    Edited by Jörg Mertin Jan 29, 2020 12:32 PM
    Thx for the logs.
    The TIM workers do barely a thing:

      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
     6530 root      20   0 4793m 290m 148m S 32.7  1.9  54:24.97 apmpacket          
     6460 root      20   0 3054m 7676 2724 S  5.8  0.0   3:27.06 tim                
     7556 root      20   0 6658m  40m 3324 S  3.8  0.3 356:25.07 timworker          
      229 root      20   0     0    0    0 S  1.9  0.0  11:06.80 kswapd0            
     2249 root      20   0     0    0    0 S  1.9  0.0   0:04.31 jbd2/dm-5-8        
     7541 root      20   0 6658m  41m 3356 S  1.9  0.3 356:54.60 timworker          
     7544 root      20   0 6658m  43m 3460 S  1.9  0.3 356:47.38 timworker          
     7547 root      20   0 6658m  30m 3324 S  1.9  0.2 356:24.39 timworker          
     7562 root      20   0 6658m  38m 3368 S  1.9  0.2 357:03.38 timworker          
     7568 root      20   0 6658m  45m 3464 S  1.9  0.3 357:09.25 timworker          
     7571 root      20   0 6658m  43m 3464 S  1.9  0.3 356:57.26 timworker          
     7577 root      20   0 6658m  45m 3584 S  1.9  0.3 356:47.41 timworker ​
    They are not loaded at all. Means they don't do much.
    Also - it seems the TIM processes are constantly restarted.
    The TIM memory utilization is a good indication for a good functioning TIM. In case there are changed from Top to Down, then the TIM has been restarted. Make sure this does not happen. If it does, check why the TIM crashes !

    Some packets are also already dropped before the TIM processes these:
    === Dropped packets !  ========================================================
    Wed Jan 29 10:40:00 2020  6460 ! Warning: hub: ProbeStats: 5587 packets dropped by apmpacket in the last 300 seconds
    Wed Jan 29 10:35:00 2020  6460 ! Warning: hub: ProbeStats: 161815 packets dropped by apmpacket in the last 300 seconds
    Wed Jan 29 10:30:00 2020  6460 ! Warning: hub: ProbeStats: 124645 packets dropped by apmpacket in the last 300 seconds
    Wed Jan 29 10:25:00 2020  6460 ! Warning: hub: ProbeStats: 128496 packets dropped by apmpacket in the last 300 seconds
    Wed Jan 29 10:20:00 2020  6460 ! Warning: hub: ProbeStats: 196754 packets dropped by apmpacket in the last 300 seconds
    Wed Jan 29 10:15:00 2020  6460 ! Warning: hub: ProbeStats: 28440 packets dropped by apmpacket in the last 300 seconds
    Wed Jan 29 10:10:00 2020  6460 ! Warning: hub: ProbeStats: 10499 packets dropped by apmpacket in the last 300 seconds
    Wed Jan 29 10:05:00 2020  6460 ! Warning: hub: ProbeStats: 12996 packets dropped by apmpacket in the last 300 seconds
    Wed Jan 29 10:00:00 2020  6460 ! Warning: hub: ProbeStats: 46934 packets dropped by apmpacket in the last 300 seconds
    Wed Jan 29 09:55:00 2020  6460 ! Warning: hub: ProbeStats: 29737 packets dropped by apmpacket in the last 300 seconds
    Wed Jan 29 09:50:00 2020  6460 ! Warning: hub: ProbeStats: 13201 packets dropped by apmpacket in the last 300 seconds
    ​
    and the old packets are indeed there. Definition is: 
    Old packets not processed. Their timestamp is older than (now ?TimMtpLimitPeriodInMinute * 60). Default value of imMtpLimitPeriodInMinute is 15

    What seems to be happening is that the TIM workers are crashing/stopping to run/process data, and the apmpacket is stacking up pcaps for the workers. As the TIm's are offline too long, when they come back, they have to discard the pcaps as these are too old.

    As you have interrupted the log gathering process (disk-speed and pcap for packet analysis) I can't tell if the traffic quality is the culprit for the TIM crashes overall. 

    So what I'd need is a packet capture to have a look at what traffic is actually sent to the TIM.

    From the partial logs I have received, I also see that you don't use a tmpfs filesystem for temperarily storing the pcaps. Check out https://techdocs.broadcom.com/content/broadcom/techdocs/us/en/ca-enterprise-software/it-operations-management/application-performance-management/10-7/installing/apm-installation/install-and-configure-tim-for-ca-cem.html - section on tmpfs on how to set this up.



  • 7.  RE: Packets TIM

    Posted Jan 29, 2020 01:39 PM
    Hi Jörg,

    Thanks for answering. I just realized that the logs were incomplete, I attached the correct logs again. Now, I tell you the following:
    * I made the TIM based on disk, not memory, so do not create the FS TMPS as the documentation says.
    * The client has CISCO ACI network device, and cannot filter traffic, so I am filtering it via IP through the CEM webserver


    I see in the graph that the TIM constantly restarts, but that is happening automatically, because the service has not been restarted so many times.

    Link:
    https://msldistribucionesyciasas-my.sharepoint.com/:u:/g/personal/rbriceno_msl_com_co/EckcZPQeHdpPrVvX_b-hGYIBabEgDD1GnoP_sF9KYNJvXA?e=pQwbLM

    Thx
    Richard


  • 8.  RE: Packets TIM

    Broadcom Employee
    Posted Jan 30, 2020 04:55 AM
    Thanks. Got the new log set and indeed, these are complete.

    You are using 4 network interfaces to collect data. On practically all interfaces I do see lots of missing packets (ACKed unseen segment). Below for eth4.
    === Manual TCP analysis !  ==================================================== 
    Total segments:        1000000 
    Out of Order segments: 1126 
    ACKed unseen Segment: 346066 
    TCP Retransmission: 407 
    Duplicate ACK: 226

    This causes the TIM worker to actually grow memory consumption until the Watchdog restarts the process (Would explain the TIM restarts).
    My assumption on this is that the Cisco devices is sending in traffic in random order, means the request and answer do not come back on the same interface. The apmpacket, to my knowledge, will not merge the traffic flows and the TIM's will not merge these either. Hence the request packets from stream X on Interface eth4 and the response packets on Interface eth5 will cause on TIM workers assigned to these pcaps "ACKed unseen Segment" (means the ACK has been seen, but not the payload) on one TIM, on the other TIM the traffic will be ignored.

    On the other hand I see quite some text/xml Data the TIM has to analyze. Analysis if this type of data is very CPU consuming.

    If you are not Monitoring XML traffic, disable the XML Parser in the TIM by adding: 

    XmlPlugin/Enabled: 0

    to the TIM configuration (TIM Settings, create a new field). This will make the TIM ignore all XML traffic that may come through and release massive resources. This has the best chance to reduce the TIM load and crashes.

    Also, disable the tracing. It costs a lot of resources, and set the TIM logfile size to 100MBytes to reduce the log-rotation induced I/O.

    === Tim active options: /app/CA/APM/tim/config/tim.options !  =================
    # This file is generated automatically -- do not edit.
    # The script that starts tim reads options from this file.
    -tracesessionsandlogins
    -tracetransets
    -tracetranunits
    -tracematchedtrancomps
    

    The other thing that can be done is to see if the communication streams (actually the traffic between one server and one client) can be SPAN'ed to the same interface to the TIM. Would prevent the ACKed unseen Segment IMHO.


    Remember, the TIM is very sensitive on the provided traffic quality. If it is not clean, there is not much it can do as it will spend more time with trying to work through the bad traffic than actually analyzing data. This is why we recommend using Network Taps in inline mode, because that is the only way of getting a real copy of the traffic. SPAN will only show us a "look-alike" copy of the traffic.