Performance Monitoring: How It Works and Why to Do It

View Only

Performance Monitoring: How It Works and Why to Do It

By Elliot Agnew posted Aug 02, 2023 03:49 PM

Recommend

Much like one would use Task Manager to see what program is consuming resources on a personal computer, mainframes also require a tool that monitors its performance. However, unlike personal computers, it is a business critical use case. If something happens to them, then critical business operations will be disrupted.

This post will explain the essentials of performance monitoring, the data that supports it, and why it matters from the perspective of an intern whose experience is the first exposure to the world of mainframe. Performance monitoring of mainframes is an important topic for keeping business operations running.

SMF and the Data that Drives Monitoring

To conduct performance monitoring, you need data. Data about the operational status of computers. Due to the criticality of mainframes, there is a wealth of data and metrics available, sourced from several different products. Since my project involved SMF (IBM’s System Management Facilities), I dove into SMF records to determine how useful metrics like CPUBSY_CORE (CPU core’s busy percentage) are extracted and used to provide insights about the mainframe’s operation.

SMF collects performance data about the activities and processes happening on the mainframe and writes records with related data into system data sets. For example, to access information on its CPU utilization, you can get this and related processor utilization data from type 70 records which store CPU activity.

How MOI Uses Performance Metrics

MOI (Mainframe Operational Intelligence) is Broadcom’s performance monitoring solution. It analyzes incoming metrics with a statistical model to detect abnormal behavior. When abnormal behavior is detected, MOI signals an alert to notify system administrators.

A fixed normal range is not sufficient to determine abnormal behavior because what is normal and expected varies over time. For example, the CPU load may be higher during business hours when there are more transactions. MOI uses machine learning to determine what normal is for any given time and alerting administrators when values do not look normal.

Alerts combined with visualization of metric history, performance monitoring improves the ability to quickly identify and resolve issues before critical operations are disrupted.

High-level data flow of MOI.

SMF Processing Project Overview

My intern project involved streamlining the processing of SMF records. This benefits the system by reducing the amount of metric data it needs to handle. There is a large amount of data that is output by SMF alone (which is only one of the supported products).

Originally, SMF records were processed within the MOI system itself. Since the entire record was sent over, this means that extra data was being sent over the network, processed in the system, and persisted in the database. Even if only a couple metrics were enabled, the whole record still needed to be handled while only a few data fields will be used.

A new method had been developed in the past where records are processed as close as possible to the point the data is collected. By processing there, we can choose what metrics we would like to see. Only those metrics will be computed from the record data. With this method, the system only needs to handle the desired metric values instead of the entire SMF record. Additionally, when metrics are aggregated from multiple records, the reduction is even greater. What was multiple records is reduced down to only the desired metrics.

When this method was created, a few types were transitioned over. Three types, types 70, 72, and 74, still used the old method. Not only does migrating the remaining record types to the new method gain the same performance improvements but the system is also streamlined by removing the need for two different methods.

A Technical Look into SMF Challenges

A specific issue that I encountered was during implementation of record type 72. Some intervals had a value of zero while others were higher than expected. Having encountered other issues with data not being read correctly, I knew some avenues to investigate.

I first checked the data being parsed, but it was correct. After investigating the timestamps, the cause was found to be variability in the exact time each interval was cut. Because this type uses multiple records, it needs to be aggregated. The records are sent in a batch though, and it behaves differently than previous aggregated types. All previous types either send a single record on an interval or are streamed, meaning that records are sent on a frequent, irregular basis. These records are simply put in the “bucket” or interval during which they arrive. However, with type 72 sent in batches, they need to be placed in the correct interval even when the records come slightly before or after the exact time.

After developing a new aggregation method for the batch of records, this issue was resolved.

Summary

The performance of mainframes can be monitored by tracking the multitude of performance metrics over time. System administrators will be notified when performance metrics indicate abnormal behavior, to enable faster responses to issues with valuable history information for identifying the issue.

This post gave the essentials of performance monitoring that I learned while working on reducing the resource demands of processing the data that makes it all possible. Hopefully, you found this topic as interesting as I found it to work on.

0 comments

17 views

Next-Generation Mainframers Community