Contents
Power management aims at reducing operating costs for energy and cooling systems while at the same time keeping the performance of a system at a level that matches the current requirements. Thus, power management is always a matter of balancing the actual performance needs and power saving options for a system. Power management can be implemented and used at different levels of the system. A set of specifications for power management functions of devices and the operating system interface to them has been defined in the Advanced Configuration and Power Interface (ACPI). As power savings in server environments can primarily be achieved on processor level, this chapter introduces some of the main concepts and highlights some tools for analyzing and influencing relevant parameters.
At CPU level, you can control power usage in various ways: for example, by using idling power states (C-states), changing CPU frequency (P-states), and throttling the CPU (T-states). The following sections give a short introduction to each approach and its significance for power savings. Detailed specifications can be found at http://www.acpi.info/spec.htm.
Modern processors have several power saving modes called
C-states
. They reflect the capability of an idle
processor to turn off unused components in order to save power. Whereas
C-states have been available for laptops for some time, they are a
rather recent trend in the server market (for example, with Intel*
processors, C-modes are only available since
Nehalem).
When a processor runs in the C0
state, it is
executing instructions. A processor running in any other C-state is
idle. The higher the C number, the deeper the CPU sleep mode: more
components are shut down to save power. Deeper sleep states save more
power, but the downside is that they have higher latency (the time the
CPU needs to go back to C0
).
Some states also have submodes with different power saving latency
levels. Which C-states and submodes are supported depends on the
respective processor. However, C1
is always
available.
Table 11.1, “C-States” gives an overview of the most common C-states.
Table 11.1. C-States¶
Mode |
Definition |
---|---|
C0 |
Operational state. CPU fully turned on. |
C1 |
First idle state. Stops CPU main internal clocks via software. Bus interface unit and APIC are kept running at full speed. |
C2 |
Stops CPU main internal clocks via hardware. State where the processor maintains all software-visible states, but may take longer to wake up through interrupts. |
C3 |
Stops all CPU internal clocks. The processor does not need to keep its cache coherent, but maintains other states. Some processors have variations of the C3 state that differ in how long it takes to wake the processor through interrupts. |
While a processor operates (in C0 state), it can be in one of several
CPU performance states (P-states)
. Whereas C-states
are idle states (all but C0), P-states
are
operational states that relate to CPU frequency and voltage.
The higher the P-state, the lower the frequency and voltage at which the
processor runs. The number of P-states is processor-specific and the
implementation differs across the various types. However,
P0
is always the highest-performance state. Higher
P-state numbers represent slower processor speeds and lower power
consumption. For example, a processor in P3 state runs more slowly and
uses less power than a processor running at P1 state. To operate at any
P-state, the processor must be in the C0 state where the processor is
working and not idling. The CPU P-states are also defined in the
Advanced Configuration and Power Interface (ACPI) specification, see
http://www.acpi.info/spec.htm.
C-states and P-states can vary independently of one another.
T-states refer to throttling the processor clock to lower frequencies in
order to reduce thermal effects. This means that the CPU is forced to be
idle a fixed percentage of its cycles per second. Throttling states
range from T1
(the CPU has no forced idle cycles) to
T
, with the percentage of
idle cycles increasing the greater n
n
is.
This differs from changing the frequency (which makes the CPU have fewer cycles per second), and from running in a C-state other than C1. Note that throttling does not reduce voltage and since the CPU is forced to idle part of the time, processes will take longer to finish and will consume more power instead of saving any power.
T-states are a concept from the times when dynamic frequency scaling and C-states did not exist. With the implementation of the latter, T-states are only useful if reducing thermal effects is the primary goal. Since T-states can interfere with C-states (preventing the CPU from reaching higher C-states), they can even increase power consumption in a modern CPU capable of C-states.
Processor performance states (P-states) and processor operating states (C-states) are the capability of a processor to switch between different supported operating frequencies and voltages to modulate power consumption.
In order to dynamically scale processor frequencies at runtime, you can use the CPUfreq infrastructure to set a static or dynamic power policy for the system. Its main components are the CPUfreq subsystem (providing a common interface to the various low-level technologies and high-level policies) , the in-kernel governors (policy governors that can change the CPU frequency based on different criteria) and CPU-specific drivers that implement the technology for the specific processor. Apart from that, user-space daemons may be available.
The dynamic scaling of the clock speed helps to consume less power and generate less heat when not operating at full capacity.
You can think of the in-kernel governors as a sort of pre-configured power schemes for the CPU. The CPUfreq governors use P-states to change frequencies and lower power consumption. The dynamic governors can switch between CPU frequencies, based on CPU utilization to allow for power savings while not sacrificing performance. These governors also allow for some tuning so you can customize and change the frequency scaling.
The following governors are available with the CPUfreq subsystem:
The CPU frequency is statically set to the highest possible for maximum performance. Consequently, saving power is not the focus of this governor.
Tuning options: The range of maximum frequencies available to the governor can be adjusted. For details, see Section 11.3.2, “Modifying Current Settings with cpufreq-set”.
The CPU frequency is statically set to the lowest possible. This can have severe impact on the performance, as the system will never rise above this frequency no matter how busy the processors are.
However, using this governor often does not lead to the expected power savings as the highest savings can usually be achieved at idle through entering C-states. Due to running processes at the lowest frequency with the powersave governor, processes will take longer to finish, thus prolonging the time for the system to enter any idle C-states.
Tuning options: The range of minimum frequencies available to the governor can be adjusted. For details, see Section 11.3.2, “Modifying Current Settings with cpufreq-set”.
The kernel implementation of a dynamic CPU frequency policy: The governor monitors the processor utilization. As soon as it exceeds a certain threshold, the governor will set the frequency to the highest available. If the utilization is less than the threshold, the next lowest frequency is used. If the system continues to be underutilized, the frequency is again reduced until the lowest available frequency is set.
Tuning options: The range of available frequencies, the rate at which the governor checks utilization, and the utilization threshold can be adjusted.
Similar to the on-demand implementation, this governor also dynamically adjusts frequencies based on processor utilization, except that it allows for a more gradual increase in power. If processor utilization exceeds a certain threshold, the governor does not immediately switch to the highest available frequency (as the on-demand governor does), but only to next higher frequency available.
Tuning options: The range of available frequencies, the rate at which the governor checks utilization, the utilization thresholds, and the frequency step rate can be adjusted.
If the CPUfreq subsystem in enabled on your system (which it is by
default with SUSE Linux Enterprise Server), you can find the relevant files and directories
under /sys/devices/system/cpu/
. If you list the
contents of this directory, you will find a
cpu{0..x}
subdirectory for each processor, and
several other files and directories. You will find a
cpufreq
subdirectory in each processor directory,
holding a number of files and directories that define the parameters for
CPUfreq. Some of them are writable (for root
), some of them are
read-only. If your system currently uses the on-demand or conservative
governor, you will see a separate subdirectory for those governors in
cpufreq
, containing the parameters for the
governors.
Different Processor Settings | |
---|---|
The settings under the |
The CPUfreq subsystem offers several tuning options for P-states: You can switch between the different governors or change individual governor parameters.
Though you can view or adjust the current settings manually (in
/sys/devices/system/cpu/cpufreq
or in
/sys/devices/system/cpu/cpu*/cpufreq
for machines
with multiple cores), we advise to use the tools provided by
cpufrequtils
for that. After
you have installed the
cpufrequtils
package, you can
make use of the cpufreq-info and
cpufreq-set command line tools as described below.
The cpufreq-info command helps you to retrieve CPUfreq kernel information. Run without any options, it collects the information available for your system and shows an output similar to the following:
cpufrequtils 004: cpufreq-info (C) Dominik Brodowski 2004-2006 Report errors and bugs to http://bugs.opensuse.org, please. analyzing CPU 0: driver: acpi-cpufreq CPUs which need to switch frequency at the same time: 0 hardware limits: 2.80 GHz - 3.40 GHz available frequency steps: 3.40 GHz, 2.80 GHz available cpufreq governors: conservative, userspace, powersave, ondemand, performance current policy: frequency should be within 2.80 GHz and 3.40 GHz. The governor "performance" may decide which speed to use within this range. current CPU frequency is 3.40 GHz. analyzing CPU 1: driver: acpi-cpufreq CPUs which need to switch frequency at the same time: 1 hardware limits: 2.80 GHz - 3.40 GHz available frequency steps: 3.40 GHz, 2.80 GHz available cpufreq governors: conservative, userspace, powersave, ondemand, performance current policy: frequency should be within 2.80 GHz and 3.40 GHz. The governor "performance" may decide which speed to use within this range. current CPU frequency is 3.40 GHz.
Using the appropriate options, you can view the current CPU frequency,
the minimum and maximum CPU frequency allowed, show the currently used
CPUfreq policy, the available CPUfreq governors, or determine the
CPUfreq kernel driver used. For more details and the available
options, refer to the cpufreq-info man page or run
cpufreq-info --help
.
To modify CPUfreq settings, use the cpufreq-set
command as root
. It allows you set values for the minimum or
maximum CPU frequency the governor may select or to create a new
governor. With the -c
option, you can also specify for
which of the processors the settings should be modified. That makes it
easy to use a consistent policy across all processors without adjusting
the settings for each processor individually. For more details and the
available options, refer to the cpufreq-set man page
or run cpufreq-set --help
.
You can switch to another governor at runtime with the
-g
option. For example, the following command will
activate the on-demand governor:
cpufreq-set -g ondemand
If you want the change in the governor to persist after a reboot or shutdown, use the pm-profiler as described in Section 11.5, “Creating and Using Power Management Profiles”.
Apart from the governor settings that can be influenced with cpufreq-set (like minimum or maximum CPU frequency to be used), you can also tune further governor parameters manually, for example, Ignoring Nice Values in Processor Utilization.
Another parameter that significantly impacts the performance loss caused by dynamic frequency scaling is the sampling rate (rate at which the governor checks the current CPU load and adjusts the processor's frequency accordingly). Its default value depends on a BIOS value and it should be as low as possible. However, in modern systems, an appropriate sampling rate is set by default and does not need manual intervention.
Procedure 11.1. Ignoring Nice Values in Processor Utilization¶
One parameter you might want to change for the on-demand or
conservative governor is ignore_nice_load
.
Each process has a niceness value associated with it. This value is used by the kernel to determine which processes require more processor time than others. The higher the nice value, the lower the priority of the process. Or: the “nicer” a process, the less CPU it will try to take from other processes.
If the ignore_nice_load
parameter for the on-demand
or conservative governor is set to 1
, any processes
with a nice
value will not be counted toward the
overall processor utilization. When ignore_nice_load
is set to 0
(default value), all processes are
counted toward the utilization. Adjusting this parameter can be useful
if you are running something that requires a lot of processor capacity
but you do not care about the runtime.
Change to the subdirectory of the governor whose settings you want to modify, for example:
cd /sys/devices/system/cpu/cpu0/cpufreq/conservative/
Show the current value of ignore_nice_load
with:
cat ignore_nice_load
To set the value to 1
, execute:
echo 1 > ignore_nice_load
When setting the ignore_nice_load
value for
cpu0
, the same value is automatically used for all
cores. In this case, you do not need to repeat the steps above for each
of the processors where you want to modify this governor parameter.
By default, openSUSE uses C-states appropriately. The only parameter
you might want to touch for optimization is the
sched_mc_power_savings
scheduler. Instead of
distributing a work load across all cores with the effect that all cores
are utilized only at a minimum level, the kernel can try to schedule
processes on as few cores as possible so that the others can go idle.
This helps to save power as it allows some processors to be idle for a
longer time so they can reach a higher C-state. However, the actual
savings depend on a number of factors, for example how many processors
are available and which C-states are supported by them (especially deeper
ones such as C3 to C6).
If sched_mc_power_savings
is set to
0
(default value), no special scheduling is done. If
it is set to 1
, the scheduler tries to consolidate the
work onto the fewest number of processors possible in the case that all
processors are a little busy.
To modify this parameter, proceed as follows:
Procedure 11.2. Scheduling Processes on Cores¶
Change to the subdirectory where the scheduler is located:
cd /sys/devices/system/cpu/
Show the current value of sched_mc_power_savings
with:
cat sched_mc_power_savings
To set the value to 1
, execute:
echo 1 > sched_mc_power_savings
openSUSE includes pm-profiler, intended for server use. It is a
script infrastructure to enable or disable certain power management
functions via configuration files. It allows you to define different
profiles, each having a specific configuration file for defining
different settings. A configuration template for new profiles can be
found at
/usr/share/doc/packages/pm-profiler/config.template
.
The template contains a number of parameters you can use for your
profile, including comments on usage and links to further documentation.
The individual profiles are stored in
/etc/pm-profiler/
. The profile that will be
activated on system start, is defined in
/etc/pm-profiler.conf
.
Procedure 11.3. Creating and Switching Power Profiles¶
To create a new profile, proceed as follows:
Create a directory in /etc/pm-profiler/
,
containing the profile name, for example:
mkdir /etc/pm-profiler/testprofile
To create the configuration file for the new profile, copy the profile template to the newly created directory:
cp /usr/share/doc/packages/pm-profiler/config.template \ /etc/pm-profiler/testprofile/config
Edit the settings in
/etc/pm-profiler/testprofile/config
and save the
file. You can also remove variables that you do not need—they
will be handled like empty variables, the settings will not be touched
at all.
Edit /etc/pm-profiler.conf
. The
PM_PROFILER_PROFILE
variable defines which
profile will be activated on system start. If it has no value, the
default system or kernel settings will be used. To set the newly
created profile:
PM_PROFILER_PROFILE="testprofile
"
The profile name you enter here must match the name you used in the
path to the profile configuration file
(/etc/pm-profiler/testprofile/config
), not
necessarily the NAME
you used for the profile in the
/etc/pm-profiler/testprofile/config
.
To activate the profile, run
rcpm-profiler start
or
/usr/lib/pm-profiler/enable-profile testprofile
Though you have to manually create or modify a profile by editing the
respective profile configuration file, you can use YaST to switch
between different profiles. Start YaST and select root
and execute yast2
power-management on a command line. The drop-down list shows
the available profiles. Default
means that the system
default settings will be kept. Select the profile to use and click
.
A useful tool for monitoring system power consumption is powerTOP. It
helps you to identify the reasons for unnecessary high power consumption
(for example, processes that are mainly responsible for waking up a
processor from its idle state) and to optimize your system settings to
avoid these. It supports both Intel and AMD processors. The
powertop
package is
available from the SUSE Linux Enterprise SDK. For information how to access the SDK,
refer to About This Guide.
powerTOP combines various sources of information (analysis of programs, device drivers, kernel options, amounts and sources of interrupts waking up processors from sleep states) and shows them in one screen. Example 11.1, “Example powerTOP Output” shows which information categories are available:
Example 11.1. Example powerTOP Output¶
Cn Avg residency P-states (frequencies) C0 (cpu running) (11.6%) 2.00 Ghz 0.1% polling 0.0ms ( 0.0%) 2.00 Ghz 0.0% C1 4.4ms (57.3%) 1.87 Ghz 0.0% C2 10.0ms (31.1%) 1064 Mhz 99.9% Wakeups-from-idle per second : 11.2 interval: 5.0s no ACPI power usage estimate available Top causes for wakeups: 96.2% (826.0) <interrupt> : extra timer interrupt 0.9% ( 8.0) <kernel core> : usb_hcd_poll_rh_status (rh_timer_func) 0.3% ( 2.4) <interrupt> : megasas 0.2% ( 2.0) <kernel core> : clocksource_watchdog (clocksource_watchdog) 0.2% ( 1.6) <interrupt> : eth1-TxRx-0 0.1% ( 1.0) <interrupt> : eth1-TxRx-4 [...] Suggestion: Enable SATA ALPM link power management via: echo min_power > /sys/class/scsi_host/host0/link_power_management_policy or press the S key.
The column shows the C-states. When working, the CPU is in state
| |
The column shows average time in milliseconds spent in the particular C-state. | |
The column shows the percentages of time spent in various C-states. For considerable power savings during idle, the CPU should be in deeper C-states most of the time. In addition, the longer the average time spent in these C-states, the more power is saved. | |
The column shows the frequencies the processor and kernel driver support on your system. | |
The column shows the amount of time the CPU cores stayed in different frequencies during the measuring period. | |
Shows how often the CPU is awoken per second (number of interrupts).
The lower the number the better. The | |
When running powerTOP on a laptop, this line displays the ACPI information on how much power is currently being used and the estimated time until discharge of the battery. On servers, this information is not available. | |
Shows what is causing the system to be more active than needed. powerTOP displays the top items causing your CPU to awake during the sampling period. | |
Suggestions on how to improve power usage for this machine. |
For more information, refer to the powerTOP project page at http://www.lesswatts.org/projects/powertop/. It also provides tips and tricks and an informative FAQ section.
In order to make use of C-states or P-states, check your BIOS options:
To use C-states, make sure to enable CPU C State
or similar options to benefit from power savings at idle.
To use P-states and the CPUfreq governors, make sure to enable
Processor Performance States
options or similar.
In case of a CPU upgrade, make sure to upgrade your BIOS, too. The BIOS needs to know the new CPU and its valid frequencies steps in order to pass this information on to the operating system.
In openSUSE, the CPUfreq subsystem is enabled by default. To
find out if the subsystem is currently enabled, check for the
following path in your system:
/sys/devices/system/cpu/cpufreq
(or
/sys/devices/system/cpu/cpu*/cpufreq
for machines
with multiple cores). If the cpufreq
subdirectory
exists, the subsystem is enabled.
Check syslog (usually /var/log/messages
) for any
output regrading the CPUfreq subsystem. Only severe errors are
reported there.
If you suspect problems with the CPUfreq subsystem on your machine,
you can also enable additional debug output. To do so, either use
cpufreq.debug=7 as boot parameter or execute the
following command as root
:
echo 7 > /sys/module/cpufreq/parameters/debug
This will cause CPUfreq to log more information to dmesg on state transitions, which is useful for diagnosis. But as this additional output of kernel messages can be rather comprehensive, use it only if you are fairly sure that a problem exists.
A threepart, comprehensive article about tuning components with regards to power efficiency is available at the following URLs:
Reduce Linux power consumption, Part 1: The CPUfreq subsystem, available at http://www.ibm.com/developerworks/linux/library/l-cpufreq-1/?ca=dgr-lnxw03ReduceLXPWR-P1dth-LX&S_TACT=105AGX59&S_CMP=grlnxw03
Reduce Linux power consumption, Part 2: General and governor-specific settings, available at http://www.ibm.com/developerworks/linux/library/l-cpufreq-2/?ca=dgr-lnxw03ReduceLXPWR-P1dth-LX&S_TACT=105AGX59&S_CMP=grlnxw03
Reduce Linux power consumption, Part 3: Tuning results, available athttp://www.ibm.com/developerworks/linux/library/l-cpufreq-3/?ca=dgr-lnxw03ReduceLXPWR-P1dth-LX&S_TACT=105AGX59&S_CMP=grlnxw03
The LessWatts.org project deals with how to save power, reduce costs and increase efficiency on Linux systems. Find the project home page at http://www.lesswatts.org/. The project page also holds an informative FAQs section at http://www.lesswatts.org/documentation/faq/index.php and provides useful tips and tricks. For tips dealing with the CPU level, refer to http://www.lesswatts.org/tips/cpu.php. For more information about powerTOP, refer to http://www.lesswatts.org/projects/powertop/.
There is also platform-specific power saving information available, for example: HP ProLiant Server Power Management on SUSE Linux Enterprise Server 11—Integration Note , available from http://h18004.www1.hp.com/products/servers/technology/whitepapers/os-techwp.html