First Fault Problem Resolution
As ConicIT learns the system over time, it creates models of how system parameters should behave and how they may affect each other. In this example, ConicIT used its model of past CPU behavior to discover a non-standard behavior pattern of CPU unrest. Since the prediction model understood that such behavior may be indicative of future performance problems, it used that anomaly as the basis to request additional monitoring information (e.g., MVS) to analyze external resources (e.g., TSO) that could have also been accessing the CPU and causing the anomalous behavior, especially any dispatching waits. ConicIT also started requesting additional monitoring information from CICS to analyze whether the anomaly could be explained as a temporal phenomenon (e.g., day of the week) or as a result of CICS transaction dispatching waits or an internal CICS QRTCB.
As a result of its analysis of all of this data, the ConicIT analysis rules engine would decide whether to issue an alert. The collected data and the reasoning that triggered the data collection are stored by ConicIT for later analysis. The goal is that once an alert was issued, system programmers will have all the relevant information at their fingertips along with a basic causal analysis – all without putting noticeable load on the system being monitored. It will be up to system programmers to solve the problem, but the diagnosis becomes an order of magnitude easier.
Batch Window Overrun
A current ConicIT customer had a problem occur over the weekend in handling transactions coming from Automated Teller Machines (ATMs). The following Monday, they were greeted by alerts regarding performance slowdowns over the weekend. None of the problems were severe enough to cause a panic over the weekend, but needed to be resolved to ensure that they weren’t indicative of a deeper issue.
The standard logs and monitors showed nothing unusual about the ATM transactions that occurred during the slowdown. However, ConicIT’s prediction engine had noticed a database resource contention issue (through anomalies in the DB2 resource usage information) and used that information to predict possible CICS transaction response time problems.
That prediction caused ConicIT to start collecting system information regarding the state of the database involved and the transactions that took place during the slowdown. From that information, it was immediately clear to the system programmers that there was a standard overnight batch process that had overrun its batch window during which it had locked the database. This occurred since the batch program needed to process an unusual amount of information received on the Friday before the weekend.
A large organization was experiencing intermittent performance slowdowns that affected various mainframe transactions, and there seemed to be no pattern to the slowdowns. Not surprisingly, by the time a slowdown was noticed, the relevant system information was no longer available via the monitor and not available in any log, making problem determination extremely difficult. This led them to pilot ConicIT to see if it could assist in diagnosing the problem.
ConicIT was installed to learn the standard behavior of the system. As part of that learning process, ConicIT’s prediction models learned the system’s standard behavior relating database resource availability and CPU thread activity to CICS transaction response time. As a result of the learning, when ConicIT sensed that DB2 resource wait time was starting to rise above the norm, it understood that this might be a predictor of a brewing transaction response time problem. This caused ConicIT’s prediction engine to proactively collect monitor information related to transaction response time, CPU thread information and DB2 resource usage – before the slowdown actually occurred.
The next time the problem occurred, ConicIT predicted the problem and collected the relevant information before the actual slowdown was noticed. Using this information, the system programmers understood exactly the state of the system during, and immediately before, the slowdown. The results of the analysis showed that there was a specific set of transactions that were heavy users of database resources, and when they occurred together could cause a general response time slowdown for the whole system.