Al Sherkow's I/S Management Strategies, Ltd.
LPAR Capacity and Software Usage Analysis Software
Capacity Planning and Performance Management for z/OS

LCS Software

Workload License Charges

Our Seminars

Papers, Presentations

Outside Interests

About I/S MS

Home


Home > Updates > IWM063I Speed Changes Update (this page)

Message IWM063I, machine speed changes, what you need to know, and the potential impact on your software charges

Last Updated: Thursday, 8 January, 2009

Cooling Unit Failure May Cause Slowdown on z890, z990, z9BC, z9EC, z10EC and z10BC

 

Normally LCS Advantage Tips are provided only to licensed sites, but this issue is of important to all sites with recent recent machines. This LCSTIP was provided to licensees on 14 October 2008.

 

There was an interesting discussion on MXG-L early in October 2008 that has specific implications for software pricing.

One of my customers had a cooling problem on a z9EC that started on Tuesday September 30. When a cooling problem occurs the machine actually runs slower to generate less heat trying to avoid an outage due to overheating. This behavior had been described by Horst Sinram in a 2005 Share session (“z/OS Workload Manager, The Latest and Greatest”, page 8) and there is an article in the IBM Research Journal (pointed out by Don Deese "Hybrid cooling with cycle steering in the IBM eServer z990", by G. F. Goth, D. J. Kearney, U. Meyer, and D. W. Porter, IBM Systems Journal Volume 48, Number 3/4, 2004.)

When the machine’s speed changes for this problem, or because of other changes such as configuring an engines online or changing the speed of a machine with multiple capacity settings the hardware notifies z/OS and the following occurs within z/OS:

  • STSI information changes
  • ENF20 posted in z/OS
  • WLM/SRM will readjust the hardware speed constant for a significant change
    • Message IWM063I WLM POLICY WAS REFRESHED DUE TO A PROCESSOR SPEED CHANGE
    • RMF interval synch’d, SMF99 records written

At this site, the cooling problem and related slowing down of the clock speed lead to a 25% drop in capacity. (One other site reported an impact of 7 to 10%.) This occurred across their month end processing and IBM did not notify the site of the cooling problem with their machine and they did not fix the problem until Saturday, October 4th. The site was searching for what had caused their service levels to degrade.

You should be certain that your site is handling message IWM063I with some type of console automation. If you are using capacity upgrade on demand you will know you are doing it. But if you are not doing any of the activities that would cause adjustment to the speed constant this message is very serious. Because this information is so serious we are not labeling this LCS Advantage Tip confidential.

Here is an example report:

Sample SPEEDCHG Report
 
SPEEDCHG: Suspected Speed Change Due to Hardware Problem  

Search System Logs for Message IWM063I to Confirm the Problem
The Model, SW MSUs and Number of GP Engines Have Not Changed

 Machine Type: 2094
Machine Model: 732
Serial Number: WXYZ

                    Loss or
          Start of  Gain in                            Previous      Changed
          Interval  Computing              Time of       HW MSU       HW MSU  Percent
       With Change  Power             Speed Change     Constant     Constant   Change  SYSPLEX   SYSNAME   SYSM
  -------------------------------------------------------------------------------------------------------------
  12SEP08:09:29:00  Loss       12SEP08:09:39:34.82   28,368.794   20,997.375  (26.0%)  TSTRING   TST1      TST1
                               12SEP08:09:39:35.57   28,368.794   20,997.375  (26.0%)  RING1     LPRS      LPRS
                               12SEP08:09:39:35.59   19,300.362   14,324.082  (25.8%)  RING1     LPRD      LPRD
                               12SEP08:09:39:35.77   20,592.021   15,267.176  (25.9%)  RING1     LPRM      LPRM
  12SEP08:16:44:00  Gain       12SEP08:16:55:31.06   20,997.375   28,368.794   26.0%   TSTRING   TST1      TST1
                               12SEP08:16:55:32.83   20,997.375   28,368.794   26.0%   RING1     LPRS      LPRS
                               12SEP08:16:55:32.86   14,324.082   19,300.362   25.8%   RING1     LPRD      LPRD
                               12SEP08:16:55:32.88   15,267.176   20,592.021   25.9%   RING1     LPRM      LPRM

We have verified that the constant for determining SW MSUs (SMF70CPA) does not change when the service degradation occurs. WLM updating R723MADJ changes the calculation to represent the drop in service from a Hardware MSUs perspective. Not updating SMF70CPA leads to WLM calculating the Software MSUs based on the "Announced Capacity SW MSUs" without any degradation in capabilities.

Besides the Software MSUs not being correct you may notice the degradation in your ability to meet your service objectives. The size of the degradation is not always 25%, but the machine truly has less capacity and you may need that capacity for your processing. Your low priority work should suffer most of the impact, but if you have little low priority work, the important work will be impacted also.

Machines with this problem are doing less work, but the software MSUs used for billing are not changed. If the slower machine is running at 100% of its slower capabilities for 4 hours, then the 4-hour rolling average (SMF70LAC) will reflect 100% in the software MSUS without any loss of capabilities. At this site that would be about 25% too many MSUs.

You'll want to make sure the maximum simultaneous 4-hour rolling average MSUs calculated by SCRT do not occur within the period of the starting and ending IWM063I messages. You don't want the peak within 4 hours of the last IWM063I, or the first time the SU_SEC of your LPARs is back to the "normal" value.

If your peak is during this period of degraded service due to cooling, I recommend excluding that time span from your SCRT reports.

 

 

Last Updated: Thursday, 8 January, 2009

Email Contact Us Trademarks Copyright©(c) 1998-2017 Alan M. Sherkow