Home > Updates
> IWM063I Speed Changes
Update (this page)
Message IWM063I, machine speed changes, what you need to know, and the potential impact on your software charges
Thursday, 8 January, 2009
Cooling Unit Failure May Cause Slowdown on z890, z990, z9BC, z9EC, z10EC and z10BC
Normally LCS Advantage Tips are provided only to licensed sites, but this issue is of important to all sites with recent recent machines. This LCSTIP was provided to licensees on 14 October 2008.
There was an interesting discussion on MXG-L early in October 2008 that has specific implications for software pricing.
One of my customers had a cooling problem on a z9EC that started on Tuesday September 30. When a cooling problem occurs the machine actually runs slower to generate less heat trying to avoid an outage due to overheating. This behavior had been described by Horst Sinram in a 2005 Share session (“z/OS Workload Manager, The Latest and Greatest”, page 8) and there is an article in the IBM Research Journal (pointed out by Don Deese "Hybrid cooling with cycle steering in the IBM eServer z990", by G. F. Goth, D. J. Kearney, U. Meyer, and D. W. Porter, IBM Systems Journal Volume 48, Number 3/4, 2004.)
When the machine’s speed changes for this problem, or because of other changes such as configuring an engines online or changing the speed of a machine with multiple capacity settings the hardware notifies z/OS and the following occurs within z/OS:
- STSI information changes
- ENF20 posted in z/OS
- WLM/SRM will readjust the hardware speed constant for a significant change
- Message IWM063I WLM POLICY WAS REFRESHED DUE TO A PROCESSOR SPEED CHANGE
- RMF interval synch’d, SMF99 records written
At this site, the cooling problem and related slowing down of the clock speed lead to a 25% drop in capacity. (One other site reported an impact of 7 to 10%.) This occurred across their month end processing and IBM did not notify the site of the cooling problem with their machine and they did not fix the problem until Saturday, October 4th. The site was searching for what had caused their service levels to degrade.
You should be certain that your site is handling message IWM063I with some type of console automation. If you are using capacity upgrade on demand you will know you are doing it. But if you are not doing any of the activities that would cause adjustment to the speed constant this message is very serious. Because this information is so serious we are not labeling this LCS Advantage Tip confidential.
Here is an example report:
Sample SPEEDCHG Report
SPEEDCHG: Suspected Speed Change Due to Hardware Problem
Search System Logs for Message IWM063I to Confirm the Problem
The Model, SW MSUs and Number of GP Engines Have Not Changed
Machine Type: 2094
Machine Model: 732
Serial Number: WXYZ
Start of Gain in Previous Changed
Interval Computing Time of HW MSU HW MSU Percent
With Change Power Speed Change Constant Constant Change SYSPLEX SYSNAME SYSM
12SEP08:09:29:00 Loss 12SEP08:09:39:34.82 28,368.794 20,997.375 (26.0%) TSTRING TST1 TST1
12SEP08:09:39:35.57 28,368.794 20,997.375 (26.0%) RING1 LPRS LPRS
12SEP08:09:39:35.59 19,300.362 14,324.082 (25.8%) RING1 LPRD LPRD
12SEP08:09:39:35.77 20,592.021 15,267.176 (25.9%) RING1 LPRM LPRM
12SEP08:16:44:00 Gain 12SEP08:16:55:31.06 20,997.375 28,368.794 26.0% TSTRING TST1 TST1
12SEP08:16:55:32.83 20,997.375 28,368.794 26.0% RING1 LPRS LPRS
12SEP08:16:55:32.86 14,324.082 19,300.362 25.8% RING1 LPRD LPRD
12SEP08:16:55:32.88 15,267.176 20,592.021 25.9% RING1 LPRM LPRM
We have verified that the constant for determining SW MSUs (SMF70CPA) does not change when the service degradation occurs. WLM updating R723MADJ changes the calculation to represent the drop in service from a Hardware MSUs perspective. Not updating SMF70CPA leads to WLM calculating the Software MSUs based on the "Announced Capacity SW MSUs" without any degradation in capabilities.
Besides the Software MSUs not being correct you may notice the degradation in your ability to meet your service objectives. The size of the degradation is not always 25%, but the machine truly has less capacity and you may need that capacity for your processing. Your low priority work should suffer most of the impact, but if you have little low priority work, the important work will be impacted also.
Machines with this problem are doing less work, but the software MSUs used for billing are not changed. If the slower machine is running at 100% of its slower capabilities for 4 hours, then the 4-hour rolling average (SMF70LAC) will reflect 100% in the software MSUS without any loss of capabilities. At this site that would be about 25% too many MSUs.
You'll want to make sure the maximum simultaneous 4-hour rolling average MSUs calculated by SCRT do not occur within the period of the starting and ending IWM063I messages. You don't want the peak within 4 hours of the last IWM063I, or the first time the SU_SEC of your LPARs is back to the "normal" value.
If your peak is during this period of degraded service due to cooling, I recommend excluding that time span from your SCRT reports.
Thursday, 8 January, 2009