mirror of https://github.com/mkerrisk/man-pages
1502 lines
45 KiB
Groff
Executable File
1502 lines
45 KiB
Groff
Executable File
.\" Copyright (c) 2008 Silicon Graphics, Inc.
|
|
.\"
|
|
.\" Author: Paul Jackson (http://oss.sgi.com/projects/cpusets)
|
|
.\"
|
|
.\" This is free documentation; you can redistribute it and/or
|
|
.\" modify it under the terms of the GNU General Public License
|
|
.\" version 2 as published by the Free Software Foundation.
|
|
.\"
|
|
.\" The GNU General Public License's references to "object code"
|
|
.\" and "executables" are to be interpreted as the output of any
|
|
.\" document formatting or typesetting system, including
|
|
.\" intermediate and printed output.
|
|
.\"
|
|
.\" This manual is distributed in the hope that it will be useful,
|
|
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
.\" GNU General Public License for more details.
|
|
.\"
|
|
.\" You should have received a copy of the GNU General Public
|
|
.\" License along with this manual; if not, write to the Free
|
|
.\" Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
|
.\" MA 02111, USA.
|
|
.\"
|
|
.TH CPUSET 7 2008-07-03 "Linux" "Linux Programmer's Manual"
|
|
.SH NAME
|
|
cpuset \- confine processes to processor and memory node subsets
|
|
.SH DESCRIPTION
|
|
The cpuset file system is a pseudo-file-system interface
|
|
to the kernel cpuset mechanism,
|
|
which is used to control the processor placement
|
|
and memory placement of processes.
|
|
It is commonly mounted at
|
|
.IR /dev/cpuset .
|
|
.PP
|
|
On systems with kernels compiled with built in support for cpusets,
|
|
all processes are attached to a cpuset, and cpusets are always present.
|
|
If a system supports cpusets, then it will have the entry
|
|
.B nodev cpuset
|
|
in the file
|
|
.IR /proc/filesystems .
|
|
By mounting the cpuset file system (see the
|
|
.B EXAMPLE
|
|
section below),
|
|
the administrator can configure the cpusets on a system
|
|
to control the processor and memory placement of processes
|
|
on that system.
|
|
By default, if the cpuset configuration
|
|
on a system is not modified or if the cpuset file
|
|
system is not even mounted, then the cpuset mechanism,
|
|
though present, has no affect on the system's behavior.
|
|
.PP
|
|
A cpuset defines a list of CPUs and memory nodes.
|
|
.PP
|
|
The CPUs of a system include all the logical processing
|
|
units on which a process can execute, including, if present,
|
|
multiple processor cores within a package and Hyper-Threads
|
|
within a processor core.
|
|
Memory nodes include all distinct
|
|
banks of main memory; small and SMP systems typically have
|
|
just one memory node that contains all the system's main memory,
|
|
while NUMA (non-uniform memory access) systems have multiple memory nodes.
|
|
.PP
|
|
Cpusets are represented as directories in a hierarchical
|
|
pseudo-file system, where the top directory in the hierarchy
|
|
.RI ( /dev/cpuset )
|
|
represents the entire system (all online CPUs and memory nodes)
|
|
and any cpuset that is the child (descendant) of
|
|
another parent cpuset contains a subset of that parent's
|
|
CPUs and memory nodes.
|
|
The directories and files representing cpusets have normal
|
|
file-system permissions.
|
|
.PP
|
|
Every process in the system belongs to exactly one cpuset.
|
|
A process is confined to only run on the CPUs in
|
|
the cpuset it belongs to, and to allocate memory only
|
|
on the memory nodes in that cpuset.
|
|
When a process
|
|
.BR fork (2)s,
|
|
the child process is placed in the same cpuset as its parent.
|
|
With sufficient privilege, a process may be moved from one
|
|
cpuset to another and the allowed CPUs and memory nodes
|
|
of an existing cpuset may be changed.
|
|
.PP
|
|
When the system begins booting, a single cpuset is
|
|
defined that includes all CPUs and memory nodes on the
|
|
system, and all processes are in that cpuset.
|
|
During the boot process, or later during normal system operation,
|
|
other cpusets may be created, as subdirectories of this top cpuset,
|
|
under the control of the system administrator,
|
|
and processes may be placed in these other cpusets.
|
|
.PP
|
|
Cpusets are integrated with the
|
|
.BR sched_setaffinity (2)
|
|
scheduling affinity mechanism and the
|
|
.BR mbind (2)
|
|
and
|
|
.BR set_mempolicy (2)
|
|
memory-placement mechanisms in the kernel.
|
|
Neither of these mechanisms let a process make use
|
|
of a CPU or memory node that is not allowed by that process's cpuset.
|
|
If changes to a process's cpuset placement conflict with these
|
|
other mechanisms, then cpuset placement is enforced
|
|
even if it means overriding these other mechanisms.
|
|
The kernel accomplishes this overriding by silently
|
|
restricting the CPUs and memory nodes requested by
|
|
these other mechanisms to those allowed by the
|
|
invoking process's cpuset.
|
|
This can result in these
|
|
other calls returning an error, if for example, such
|
|
a call ends up requesting an empty set of CPUs or
|
|
memory nodes, after that request is restricted to
|
|
the invoking process's cpuset.
|
|
.PP
|
|
Typically, a cpuset is used to manage
|
|
the CPU and memory-node confinement for a set of
|
|
cooperating processes such as a batch scheduler job, and these
|
|
other mechanisms are used to manage the placement of
|
|
individual processes or memory regions within that set or job.
|
|
.SH FILES
|
|
Each directory below
|
|
.I /dev/cpuset
|
|
represents a cpuset and contains a fixed set of pseudo-files
|
|
describing the state of that cpuset.
|
|
.PP
|
|
New cpusets are created using the
|
|
.BR mkdir (2)
|
|
system call or the
|
|
.BR mkdir (1)
|
|
command.
|
|
The properties of a cpuset, such as its flags, allowed
|
|
CPUs and memory nodes, and attached processes, are queried and modified
|
|
by reading or writing to the appropriate file in that cpuset's directory,
|
|
as listed below.
|
|
.PP
|
|
The pseudo-files in each cpuset directory are automatically created when
|
|
the cpuset is created, as a result of the
|
|
.BR mkdir (2)
|
|
invocation.
|
|
It is not possible to directly add or remove these pseudo-files.
|
|
.PP
|
|
A cpuset directory that contains no child cpuset directories,
|
|
and has no attached processes, can be removed using
|
|
.BR rmdir (2)
|
|
or
|
|
.BR rmdir (1).
|
|
It is not necessary, or possible,
|
|
to remove the pseudo-files inside the directory before removing it.
|
|
.PP
|
|
The pseudo-files in each cpuset directory are
|
|
small text files that may be read and
|
|
written using traditional shell utilities such as
|
|
.BR cat (1),
|
|
and
|
|
.BR echo (1),
|
|
or from a program by using file I/O library functions or system calls,
|
|
such as
|
|
.BR open (2),
|
|
.BR read (2),
|
|
.BR write (2),
|
|
and
|
|
.BR close (2).
|
|
.PP
|
|
The pseudo-files in a cpuset directory represent internal kernel
|
|
state and do not have any persistent image on disk.
|
|
Each of these per-cpuset files is listed and described below.
|
|
.\" ====================== tasks ======================
|
|
.TP
|
|
.I tasks
|
|
List of the process IDs (PIDs) of the processes in that cpuset.
|
|
The list is formatted as a series of ASCII
|
|
decimal numbers, each followed by a newline.
|
|
A process may be added to a cpuset (automatically removing
|
|
it from the cpuset that previously contained it) by writing its
|
|
PID to that cpuset's
|
|
.I tasks
|
|
file (with or without a trailing newline.)
|
|
|
|
.B Warning:
|
|
only one PID may be written to the
|
|
.I tasks
|
|
file at a time.
|
|
If a string is written that contains more
|
|
than one PID, only the first one will be used.
|
|
.\" =================== notify_on_release ===================
|
|
.TP
|
|
.I notify_on_release
|
|
Flag (0 or 1).
|
|
If set (1), that cpuset will receive special handling
|
|
after it is released, that is, after all processes cease using
|
|
it (i.e., terminate or are moved to a different cpuset)
|
|
and all child cpuset directories have been removed.
|
|
See the \fBNotify On Release\fR section, below.
|
|
.\" ====================== cpus ======================
|
|
.TP
|
|
.I cpus
|
|
List of the physical numbers of the CPUs on which processes
|
|
in that cpuset are allowed to execute.
|
|
See \fBList Format\fR below for a description of the
|
|
format of
|
|
.IR cpus .
|
|
|
|
The CPUs allowed to a cpuset may be changed by
|
|
writing a new list to its
|
|
.I cpus
|
|
file.
|
|
.\" ==================== cpu_exclusive ====================
|
|
.TP
|
|
.I cpu_exclusive
|
|
Flag (0 or 1).
|
|
If set (1), the cpuset has exclusive use of
|
|
its CPUs (no sibling or cousin cpuset may overlap CPUs).
|
|
By default this is off (0).
|
|
Newly created cpusets also initially default this to off (0).
|
|
|
|
Two cpusets are
|
|
.I sibling
|
|
cpusets if they share the same parent cpuset in the
|
|
.I /dev/cpuset
|
|
hierarchy.
|
|
Two cpusets are
|
|
.I cousin
|
|
cpusets if neither is the ancestor of the other.
|
|
Regardless of the
|
|
.I cpu_exclusive
|
|
setting, if one cpuset is the ancestor of another,
|
|
and if both of these cpusets have non-empty
|
|
.IR cpus ,
|
|
then their
|
|
.I cpus
|
|
must overlap, because the
|
|
.I cpus
|
|
of any cpuset are always a subset of the
|
|
.I cpus
|
|
of its parent cpuset.
|
|
.\" ====================== mems ======================
|
|
.TP
|
|
.I mems
|
|
List of memory nodes on which processes in this cpuset are
|
|
allowed to allocate memory.
|
|
See \fBList Format\fR below for a description of the
|
|
format of
|
|
.IR mems .
|
|
.\" ==================== mem_exclusive ====================
|
|
.TP
|
|
.I mem_exclusive
|
|
Flag (0 or 1).
|
|
If set (1), the cpuset has exclusive use of
|
|
its memory nodes (no sibling or cousin may overlap).
|
|
Also if set (1), the cpuset is a \fBHardwall\fR cpuset (see below.)
|
|
By default this is off (0).
|
|
Newly created cpusets also initially default this to off (0).
|
|
|
|
Regardless of the
|
|
.I mem_exclusive
|
|
setting, if one cpuset is the ancestor of another,
|
|
then their memory nodes must overlap, because the memory
|
|
nodes of any cpuset are always a subset of that cpuset's
|
|
parent cpuset.
|
|
.\" ==================== mem_hardwall ====================
|
|
.TP
|
|
.IR mem_hardwall " (since Linux 2.6.26)"
|
|
Flag (0 or 1).
|
|
If set (1), the cpuset is a \fBHardwall\fR cpuset (see below.)
|
|
Unlike \fBmem_exclusive\fR,
|
|
there is no constraint on whether cpusets
|
|
marked \fBmem_hardwall\fR may have overlapping
|
|
memory nodes with sibling or cousin cpusets.
|
|
By default this is off (0).
|
|
Newly created cpusets also initially default this to off (0).
|
|
.\" ==================== memory_migrate ====================
|
|
.TP
|
|
.IR memory_migrate " (since Linux 2.6.16)"
|
|
Flag (0 or 1).
|
|
If set (1), then memory migration is enabled.
|
|
By default this is off (0).
|
|
See the \fBMemory Migration\fR section, below.
|
|
.\" ==================== memory_pressure ====================
|
|
.TP
|
|
.IR memory_pressure " (since Linux 2.6.16)"
|
|
A measure of how much memory pressure the processes in this
|
|
cpuset are causing.
|
|
See the \fBMemory Pressure\fR section, below.
|
|
Unless
|
|
.I memory_pressure_enabled
|
|
is enabled, always has value zero (0).
|
|
This file is read-only.
|
|
See the
|
|
.B WARNINGS
|
|
section, below.
|
|
.\" ================= memory_pressure_enabled =================
|
|
.TP
|
|
.IR memory_pressure_enabled " (since Linux 2.6.16)"
|
|
Flag (0 or 1).
|
|
This file is only present in the root cpuset, normally
|
|
.IR /dev/cpuset .
|
|
If set (1), the
|
|
.I memory_pressure
|
|
calculations are enabled for all cpusets in the system.
|
|
By default this is off (0).
|
|
See the
|
|
\fBMemory Pressure\fR section, below.
|
|
.\" ================== memory_spread_page ==================
|
|
.TP
|
|
.IR memory_spread_page " (since Linux 2.6.17)"
|
|
Flag (0 or 1).
|
|
If set (1), pages in the kernel page cache
|
|
(file-system buffers) are uniformly spread across the cpuset.
|
|
By default this is off (0) in the top cpuset,
|
|
and inherited from the parent cpuset in
|
|
newly created cpusets.
|
|
See the \fBMemory Spread\fR section, below.
|
|
.\" ================== memory_spread_slab ==================
|
|
.TP
|
|
.IR memory_spread_slab " (since Linux 2.6.17)"
|
|
Flag (0 or 1).
|
|
If set (1), the kernel slab caches
|
|
for file I/O (directory and inode structures) are
|
|
uniformly spread across the cpuset.
|
|
By default this is off (0) in the top cpuset,
|
|
and inherited from the parent cpuset in
|
|
newly created cpusets.
|
|
See the \fBMemory Spread\fR section, below.
|
|
.\" ================== sched_load_balance ==================
|
|
.TP
|
|
.IR sched_load_balance " (since Linux 2.6.24)"
|
|
Flag (0 or 1).
|
|
If set (1, the default) the kernel will
|
|
automatically load balance processes in that cpuset over
|
|
the allowed CPUs in that cpuset.
|
|
If cleared (0) the
|
|
kernel will avoid load balancing processes in this cpuset,
|
|
.I unless
|
|
some other cpuset with overlapping CPUs has its
|
|
.I sched_load_balance
|
|
flag set.
|
|
See \fBScheduler Load Balancing\fR, below, for further details.
|
|
.\" ================== sched_relax_domain_level ==================
|
|
.TP
|
|
.IR sched_relax_domain_level " (since Linux 2.6.26)"
|
|
Integer, between \-1 and a small positive value.
|
|
The
|
|
.I sched_relax_domain_level
|
|
controls the width of the range of CPUs over which the kernel scheduler
|
|
performs immediate rebalancing of runnable tasks across CPUs.
|
|
If
|
|
.I sched_load_balance
|
|
is disabled, then the setting of
|
|
.I sched_relax_domain_level
|
|
does not matter, as no such load balancing is done.
|
|
If
|
|
.I sched_load_balance
|
|
is enabled, then the higher the value of the
|
|
.IR sched_relax_domain_level ,
|
|
the wider
|
|
the range of CPUs over which immediate load balancing is attempted.
|
|
See \fBScheduler Relax Domain Level\fR, below, for further details.
|
|
.\" ================== proc cpuset ==================
|
|
.PP
|
|
In addition to the above pseudo-files in each directory below
|
|
.IR /dev/cpuset ,
|
|
each process has a pseudo-file,
|
|
.IR /proc/<pid>/cpuset ,
|
|
that displays the path of the process's cpuset directory
|
|
relative to the root of the cpuset file system.
|
|
.\" ================== proc status ==================
|
|
.PP
|
|
Also the
|
|
.I /proc/<pid>/status
|
|
file for each process has four added lines,
|
|
displaying the process's
|
|
.I Cpus_allowed
|
|
(on which CPUs it may be scheduled) and
|
|
.I Mems_allowed
|
|
(on which memory nodes it may obtain memory),
|
|
in the two formats \fBMask Format\fR and \fBList Format\fR (see below)
|
|
as shown in the following example:
|
|
.PP
|
|
.RS
|
|
.nf
|
|
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
|
|
Cpus_allowed_list: 0-127
|
|
Mems_allowed: ffffffff,ffffffff
|
|
Mems_allowed_list: 0-63
|
|
.fi
|
|
.RE
|
|
.PP
|
|
The "allowed" fields were addded in Linux 2.6.24;
|
|
the "allowed_list" fields were added in Linux 2.6.26.
|
|
.\" ================== EXTENDED CAPABILITIES ==================
|
|
.SH EXTENDED CAPABILITIES
|
|
In addition to controlling which
|
|
.I cpus
|
|
and
|
|
.I mems
|
|
a process is allowed to use, cpusets provide the following
|
|
extended capabilities.
|
|
.\" ================== Exclusive Cpusets ==================
|
|
.SS Exclusive Cpusets
|
|
If a cpuset is marked
|
|
.I cpu_exclusive
|
|
or
|
|
.IR mem_exclusive ,
|
|
no other cpuset, other than a direct ancestor or descendant,
|
|
may share any of the same CPUs or memory nodes.
|
|
.PP
|
|
A cpuset that is
|
|
.I mem_exclusive
|
|
restricts kernel allocations for
|
|
buffer cache pages and other internal kernel data pages
|
|
commonly shared by the kernel across
|
|
multiple users.
|
|
All cpusets, whether
|
|
.I mem_exclusive
|
|
or not, restrict allocations of memory for user space.
|
|
This enables configuring a
|
|
system so that several independent jobs can share common kernel data,
|
|
while isolating each job's user allocation in
|
|
its own cpuset.
|
|
To do this, construct a large
|
|
.I mem_exclusive
|
|
cpuset to hold all the jobs, and construct child,
|
|
.RI non- mem_exclusive
|
|
cpusets for each individual job.
|
|
Only a small amount of kernel memory,
|
|
such as requests from interrupt handlers, is allowed to be
|
|
placed on memory nodes
|
|
outside even a
|
|
.I mem_exclusive
|
|
cpuset.
|
|
.\" ================== Hardwall ==================
|
|
.SS Hardwall
|
|
A cpuset that has
|
|
.I mem_exclusive
|
|
or
|
|
.I mem_hardwall
|
|
set is a
|
|
.I hardwall
|
|
cpuset.
|
|
A
|
|
.I hardwall
|
|
cpuset restricts kernel allocations for page, buffer,
|
|
and other data commonly shared by the kernel across multiple users.
|
|
All cpusets, whether
|
|
.I hardwall
|
|
or not, restrict allocations of memory for user space.
|
|
.PP
|
|
This enables configuring a system so that several independent
|
|
jobs can share common kernel data, such as file system pages,
|
|
while isolating each job's user allocation in its own cpuset.
|
|
To do this, construct a large
|
|
.I hardwall
|
|
cpuset to hold
|
|
all the jobs, and construct child cpusets for each individual
|
|
job which are not
|
|
.I hardwall
|
|
cpusets.
|
|
.PP
|
|
Only a small amount of kernel memory, such as requests from
|
|
interrupt handlers, is allowed to be taken outside even a
|
|
.I hardwall
|
|
cpuset.
|
|
.\" ================== Notify On Release ==================
|
|
.SS Notify On Release
|
|
If the
|
|
.I notify_on_release
|
|
flag is enabled (1) in a cpuset,
|
|
then whenever the last process in the cpuset leaves
|
|
(exits or attaches to some other cpuset)
|
|
and the last child cpuset of that cpuset is removed,
|
|
the kernel will run the command
|
|
.IR /sbin/cpuset_release_agent ,
|
|
supplying the pathname (relative to the mount point of the
|
|
cpuset file system) of the abandoned cpuset.
|
|
This enables automatic removal of abandoned cpusets.
|
|
.PP
|
|
The default value of
|
|
.I notify_on_release
|
|
in the root cpuset at system boot is disabled (0).
|
|
The default value of other cpusets at creation
|
|
is the current value of their parent's
|
|
.I notify_on_release
|
|
setting.
|
|
.PP
|
|
The command
|
|
.I /sbin/cpuset_release_agent
|
|
is invoked, with the name
|
|
.RI ( /dev/cpuset
|
|
relative path)
|
|
of the to-be-released cpuset in
|
|
.IR argv[1] .
|
|
.PP
|
|
The usual contents of the command
|
|
.I /sbin/cpuset_release_agent
|
|
is simply the shell script:
|
|
.in +4n
|
|
.nf
|
|
|
|
#!/bin/sh
|
|
rmdir /dev/cpuset/$1
|
|
.fi
|
|
.in
|
|
.PP
|
|
As with other flag values below, this flag can
|
|
be changed by writing an ASCII
|
|
number 0 or 1 (with optional trailing newline)
|
|
into the file, to clear or set the flag, respectively.
|
|
.\" ================== Memory Pressure ==================
|
|
.SS Memory Pressure
|
|
The
|
|
.I memory_pressure
|
|
of a cpuset provides a simple per-cpuset running average of
|
|
the rate that the processes in a cpuset are attempting to free up in-use
|
|
memory on the nodes of the cpuset to satisfy additional memory requests.
|
|
.PP
|
|
This enables batch managers that are monitoring jobs running in dedicated
|
|
cpusets to efficiently detect what level of memory pressure that job
|
|
is causing.
|
|
.PP
|
|
This is useful both on tightly managed systems running a wide mix of
|
|
submitted jobs, which may choose to terminate or re-prioritize jobs that
|
|
are trying to use more memory than allowed on the nodes assigned them,
|
|
and with tightly coupled, long-running, massively parallel scientific
|
|
computing jobs that will dramatically fail to meet required performance
|
|
goals if they start to use more memory than allowed to them.
|
|
.PP
|
|
This mechanism provides a very economical way for the batch manager
|
|
to monitor a cpuset for signs of memory pressure.
|
|
It's up to the batch manager or other user code to decide
|
|
what action to take if it detects signs of memory pressure.
|
|
.PP
|
|
Unless memory pressure calculation is enabled by setting the pseudo-file
|
|
.IR /dev/cpuset/memory_pressure_enabled ,
|
|
it is not computed for any cpuset, and reads from any
|
|
.I memory_pressure
|
|
always return zero, as represented by the ASCII string "0\en".
|
|
See the \fBWARNINGS\fR section, below.
|
|
.PP
|
|
A per-cpuset, running average is employed for the following reasons:
|
|
.IP * 3
|
|
Because this meter is per-cpuset rather than per-process or per virtual
|
|
memory region, the system load imposed by a batch scheduler monitoring
|
|
this metric is sharply reduced on large systems, because a scan of
|
|
the tasklist can be avoided on each set of queries.
|
|
.IP *
|
|
Because this meter is a running average rather than an accumulating
|
|
counter, a batch scheduler can detect memory pressure with a
|
|
single read, instead of having to read and accumulate results
|
|
for a period of time.
|
|
.IP *
|
|
Because this meter is per-cpuset rather than per-process,
|
|
the batch scheduler can obtain the key information \(em memory
|
|
pressure in a cpuset \(em with a single read, rather than having to
|
|
query and accumulate results over all the (dynamically changing)
|
|
set of processes in the cpuset.
|
|
.PP
|
|
The
|
|
.I memory_pressure
|
|
of a cpuset is calculated using a per-cpuset simple digital filter
|
|
that is kept within the kernel.
|
|
For each cpuset, this filter tracks
|
|
the recent rate at which processes attached to that cpuset enter the
|
|
kernel direct reclaim code.
|
|
.PP
|
|
The kernel direct reclaim code is entered whenever a process has to
|
|
satisfy a memory page request by first finding some other page to
|
|
repurpose, due to lack of any readily available already free pages.
|
|
Dirty file system pages are repurposed by first writing them
|
|
to disk.
|
|
Unmodified file system buffer pages are repurposed
|
|
by simply dropping them, though if that page is needed again, it
|
|
will have to be re-read from disk.
|
|
.PP
|
|
The
|
|
.I memory_pressure
|
|
file provides an integer number representing the recent (half-life of
|
|
10 seconds) rate of entries to the direct reclaim code caused by any
|
|
process in the cpuset, in units of reclaims attempted per second,
|
|
times 1000.
|
|
.\" ================== Memory Spread ==================
|
|
.SS Memory Spread
|
|
There are two Boolean flag files per cpuset that control where the
|
|
kernel allocates pages for the file-system buffers and related
|
|
in-kernel data structures.
|
|
They are called
|
|
.I memory_spread_page
|
|
and
|
|
.IR memory_spread_slab .
|
|
.PP
|
|
If the per-cpuset Boolean flag file
|
|
.I memory_spread_page
|
|
is set, then
|
|
the kernel will spread the file-system buffers (page cache) evenly
|
|
over all the nodes that the faulting process is allowed to use, instead
|
|
of preferring to put those pages on the node where the process is running.
|
|
.PP
|
|
If the per-cpuset Boolean flag file
|
|
.I memory_spread_slab
|
|
is set,
|
|
then the kernel will spread some file-system-related slab caches,
|
|
such as those for inodes and directory entries, evenly over all the nodes
|
|
that the faulting process is allowed to use, instead of preferring to
|
|
put those pages on the node where the process is running.
|
|
.PP
|
|
The setting of these flags does not affect the data segment
|
|
(see
|
|
.BR brk (2))
|
|
or stack segment pages of a process.
|
|
.PP
|
|
By default, both kinds of memory spreading are off and the kernel
|
|
prefers to allocate memory pages on the node local to where the
|
|
requesting process is running.
|
|
If that node is not allowed by the
|
|
process's NUMA memory policy or cpuset configuration or if there are
|
|
insufficient free memory pages on that node, then the kernel looks
|
|
for the nearest node that is allowed and has sufficient free memory.
|
|
.PP
|
|
When new cpusets are created, they inherit the memory spread settings
|
|
of their parent.
|
|
.PP
|
|
Setting memory spreading causes allocations for the affected page or
|
|
slab caches to ignore the process's NUMA memory policy and be spread
|
|
instead.
|
|
However, the affect of these changes in memory placement
|
|
caused by cpuset-specified memory spreading is hidden from the
|
|
.BR mbind (2)
|
|
or
|
|
.BR set_mempolicy (2)
|
|
calls.
|
|
These two NUMA memory policy calls always appear to behave as if
|
|
no cpuset-specified memory spreading is in affect, even if it is.
|
|
If cpuset memory spreading is subsequently turned off, the NUMA
|
|
memory policy most recently specified by these calls is automatically
|
|
re-applied.
|
|
.PP
|
|
Both
|
|
.I memory_spread_page
|
|
and
|
|
.I memory_spread_slab
|
|
are Boolean flag files.
|
|
By default they contain "0", meaning that the feature is off
|
|
for that cpuset.
|
|
If a "1" is written to that file, that turns the named feature on.
|
|
.PP
|
|
Cpuset-specified memory spreading behaves similarly to what is known
|
|
(in other contexts) as round-robin or interleave memory placement.
|
|
.PP
|
|
Cpuset-specified memory spreading can provide substantial performance
|
|
improvements for jobs that:
|
|
.IP a) 3
|
|
need to place thread-local data on
|
|
memory nodes close to the CPUs which are running the threads that most
|
|
frequently access that data; but also
|
|
.IP b)
|
|
need to access large file-system data sets that must to be spread
|
|
across the several nodes in the job's cpuset in order to fit.
|
|
.PP
|
|
Without this policy,
|
|
the memory allocation across the nodes in the job's cpuset
|
|
can become very uneven,
|
|
especially for jobs that might have just a single
|
|
thread initializing or reading in the data set.
|
|
.\" ================== Memory Migration ==================
|
|
.SS Memory Migration
|
|
Normally, under the default setting (disabled) of
|
|
.IR memory_migrate ,
|
|
once a page is allocated (given a physical page
|
|
of main memory) then that page stays on whatever node it
|
|
was allocated, so long as it remains allocated, even if the
|
|
cpuset's memory-placement policy
|
|
.I mems
|
|
subsequently changes.
|
|
.PP
|
|
When memory migration is enabled in a cpuset, if the
|
|
.I mems
|
|
setting of the cpuset is changed, then any memory page in use by any
|
|
process in the cpuset that is on a memory node that is no longer
|
|
allowed will be migrated to a memory node that is allowed.
|
|
.PP
|
|
Furthermore, if a process is moved into a cpuset with
|
|
.I memory_migrate
|
|
enabled, any memory pages it uses that were on memory nodes allowed
|
|
in its previous cpuset, but which are not allowed in its new cpuset,
|
|
will be migrated to a memory node allowed in the new cpuset.
|
|
.PP
|
|
The relative placement of a migrated page within
|
|
the cpuset is preserved during these migration operations if possible.
|
|
For example,
|
|
if the page was on the second valid node of the prior cpuset,
|
|
then the page will be placed on the second valid node of the new cpuset,
|
|
if possible.
|
|
.\" ================== Scheduler Load Balancing ==================
|
|
.SS Scheduler Load Balancing
|
|
The kernel scheduler automatically load balances processes.
|
|
If one CPU is underutilized,
|
|
the kernel will look for processes on other more
|
|
overloaded CPUs and move those processes to the underutilized CPU,
|
|
within the constraints of such placement mechanisms as cpusets and
|
|
.BR sched_setaffinity (2).
|
|
.PP
|
|
The algorithmic cost of load balancing and its impact on key shared
|
|
kernel data structures such as the process list increases more than
|
|
linearly with the number of CPUs being balanced.
|
|
For example, it
|
|
costs more to load balance across one large set of CPUs than it does
|
|
to balance across two smaller sets of CPUs, each of half the size
|
|
of the larger set.
|
|
(The precise relationship between the number of CPUs being balanced
|
|
and the cost of load balancing depends
|
|
on implementation details of the kernel process scheduler, which is
|
|
subject to change over time, as improved kernel scheduler algorithms
|
|
are implemented.)
|
|
.PP
|
|
The per-cpuset flag
|
|
.I sched_load_balance
|
|
provides a mechanism to suppress this automatic scheduler load
|
|
balancing in cases where it is not needed and suppressing it would have
|
|
worthwhile performance benefits.
|
|
.PP
|
|
By default, load balancing is done across all CPUs, except those
|
|
marked isolated using the kernel boot time "isolcpus=" argument.
|
|
(See \fBScheduler Relax Domain Level\fR, below, to change this default.)
|
|
.PP
|
|
This default load balancing across all CPUs is not well suited to
|
|
the following two situations:
|
|
.IP * 3
|
|
On large systems, load balancing across many CPUs is expensive.
|
|
If the system is managed using cpusets to place independent jobs
|
|
on separate sets of CPUs, full load balancing is unnecessary.
|
|
.IP *
|
|
Systems supporting real-time on some CPUs need to minimize
|
|
system overhead on those CPUs, including avoiding process load
|
|
balancing if that is not needed.
|
|
.PP
|
|
When the per-cpuset flag
|
|
.I sched_load_balance
|
|
is enabled (the default setting),
|
|
it requests load balancing across
|
|
all the CPUs in that cpuset's allowed CPUs,
|
|
ensuring that load balancing can move a process (not otherwise pinned,
|
|
as by
|
|
.BR sched_setaffinity (2))
|
|
from any CPU in that cpuset to any other.
|
|
.PP
|
|
When the per-cpuset flag
|
|
.I sched_load_balance
|
|
is disabled, then the
|
|
scheduler will avoid load balancing across the CPUs in that cpuset,
|
|
\fIexcept\fR in so far as is necessary because some overlapping cpuset
|
|
has
|
|
.I sched_load_balance
|
|
enabled.
|
|
.PP
|
|
So, for example, if the top cpuset has the flag
|
|
.I sched_load_balance
|
|
enabled, then the scheduler will load balance across all
|
|
CPUs, and the setting of the
|
|
.I sched_load_balance
|
|
flag in other cpusets has no affect,
|
|
as we're already fully load balancing.
|
|
.PP
|
|
Therefore in the above two situations, the flag
|
|
.I sched_load_balance
|
|
should be disabled in the top cpuset, and only some of the smaller,
|
|
child cpusets would have this flag enabled.
|
|
.PP
|
|
When doing this, you don't usually want to leave any unpinned processes in
|
|
the top cpuset that might use nontrivial amounts of CPU, as such processes
|
|
may be artificially constrained to some subset of CPUs, depending on
|
|
the particulars of this flag setting in descendant cpusets.
|
|
Even if such a process could use spare CPU cycles in some other CPUs,
|
|
the kernel scheduler might not consider the possibility of
|
|
load balancing that process to the underused CPU.
|
|
.PP
|
|
Of course, processes pinned to a particular CPU can be left in a cpuset
|
|
that disables
|
|
.I sched_load_balance
|
|
as those processes aren't going anywhere else anyway.
|
|
.\" ================== Scheduler Relax Domain Level ==================
|
|
.SS Scheduler Relax Domain Level
|
|
The kernel scheduler performs immediate load balancing whenever
|
|
a CPU becomes free or another task becomes runnable.
|
|
This load
|
|
balancing works to ensure that as many CPUs as possible are usefully
|
|
employed running tasks.
|
|
The kernel also performs periodic load
|
|
balancing off the software clock described in
|
|
.IR time (7).
|
|
The setting of
|
|
.I sched_relax_domain_level
|
|
only applies to immediate load balancing.
|
|
Regardless of the
|
|
.I sched_relax_domain_level
|
|
setting, periodic load balancing is attempted over all CPUs
|
|
(unless disabled by turning off
|
|
.IR sched_load_balance .)
|
|
In any case, of course, tasks will only be scheduled to run on
|
|
CPUs allowed by their cpuset, as modified by
|
|
.BR sched_setaffinity (2)
|
|
system calls.
|
|
.PP
|
|
On small systems, such as those with just a few CPUs, immediate load
|
|
balancing is useful to improve system interactivity and to minimize
|
|
wasteful idle CPU cycles.
|
|
But on large systems, attempting immediate
|
|
load balancing across a large number of CPUs can be more costly than
|
|
it is worth, depending on the particular performance characteristics
|
|
of the job mix and the hardware.
|
|
.PP
|
|
The exact meaning of the small integer values of
|
|
.I sched_relax_domain_level
|
|
will depend on internal
|
|
implementation details of the kernel scheduler code and on the
|
|
non-uniform architecture of the hardware.
|
|
Both of these will evolve
|
|
over time and vary by system architecture and kernel version.
|
|
.PP
|
|
As of this writing, when this capability was introduced in Linux
|
|
2.6.26, on certain popular architectures, the positive values of
|
|
.I sched_relax_domain_level
|
|
have the following meanings.
|
|
.sp
|
|
.PD 0
|
|
.IP \fB(1)\fR 4
|
|
Perform immediate load balancing across Hyper-Thread
|
|
siblings on the same core.
|
|
.IP \fB(2)\fR
|
|
Perform immediate load balancing across other cores in the same package.
|
|
.IP \fB(3)\fR
|
|
Perform immediate load balancing across other CPUs
|
|
on the same node or blade.
|
|
.IP \fB(4)\fR
|
|
Perform immediate load balancing across over several
|
|
(implementation detail) nodes [On NUMA systems].
|
|
.IP \fB(5)\fR
|
|
Perform immediate load balancing across over all CPUs
|
|
in system [On NUMA systems].
|
|
.PD
|
|
.PP
|
|
The
|
|
.I sched_relax_domain_level
|
|
value of zero (0) always means
|
|
don't perform immediate load balancing,
|
|
hence that load balancing is only done periodically,
|
|
not immediately when a CPU becomes available or another task becomes
|
|
runnable.
|
|
.PP
|
|
The
|
|
.I sched_relax_domain_level
|
|
value of minus one (\-1)
|
|
always means use the system default value.
|
|
The system default value can vary by architecture and kernel version.
|
|
This system default value can be changed by kernel
|
|
boot-time "relax_domain_level=" argument.
|
|
.PP
|
|
In the case of multiple overlapping cpusets which have conflicting
|
|
.I sched_relax_domain_level
|
|
values, then the highest such value
|
|
applies to all CPUs in any of the overlapping cpusets.
|
|
In such cases,
|
|
the value \fBminus one (\-1)\fR is the lowest value, overridden by any
|
|
other value, and the value \fBzero (0)\fR is the next lowest value.
|
|
.SH FORMATS
|
|
The following formats are used to represent sets of
|
|
CPUs and memory nodes.
|
|
.\" ================== Mask Format ==================
|
|
.SS Mask Format
|
|
The \fBMask Format\fR is used to represent CPU and memory-node bitmasks
|
|
in the
|
|
.I /proc/<pid>/status
|
|
file.
|
|
.PP
|
|
This format displays each 32-bit
|
|
word in hexadecimal (using ASCII characters "0" - "9" and "a" - "f");
|
|
words are filled with leading zeros, if required.
|
|
For masks longer than one word, a comma separator is used between words.
|
|
Words are displayed in big-endian
|
|
order, which has the most significant bit first.
|
|
The hex digits within a word are also in big-endian order.
|
|
.PP
|
|
The number of 32-bit words displayed is the minimum number needed to
|
|
display all bits of the bitmask, based on the size of the bitmask.
|
|
.PP
|
|
Examples of the \fBMask Format\fR:
|
|
.PP
|
|
.RS
|
|
.nf
|
|
00000001 # just bit 0 set
|
|
40000000,00000000,00000000 # just bit 94 set
|
|
00000001,00000000,00000000 # just bit 64 set
|
|
000000ff,00000000 # bits 32-39 set
|
|
00000000,000E3862 # 1,5,6,11-13,17-19 set
|
|
.fi
|
|
.RE
|
|
.PP
|
|
A mask with bits 0, 1, 2, 4, 8, 16, 32, and 64 set displays as:
|
|
.PP
|
|
.RS
|
|
.nf
|
|
00000001,00000001,00010117
|
|
.fi
|
|
.RE
|
|
.PP
|
|
The first "1" is for bit 64, the
|
|
second for bit 32, the third for bit 16, the fourth for bit 8, the
|
|
fifth for bit 4, and the "7" is for bits 2, 1, and 0.
|
|
.\" ================== List Format ==================
|
|
.SS List Format
|
|
The \fBList Format\fR for
|
|
.I cpus
|
|
and
|
|
.I mems
|
|
is a comma-separated list of CPU or memory-node
|
|
numbers and ranges of numbers, in ASCII decimal.
|
|
.PP
|
|
Examples of the \fBList Format\fR:
|
|
.PP
|
|
.RS
|
|
.nf
|
|
0-4,9 # bits 0, 1, 2, 3, 4, and 9 set
|
|
0-2,7,12-14 # bits 0, 1, 2, 7, 12, 13, and 14 set
|
|
.fi
|
|
.RE
|
|
.\" ================== RULES ==================
|
|
.SH RULES
|
|
The following rules apply to each cpuset:
|
|
.IP * 3
|
|
Its CPUs and memory nodes must be a (possibly equal)
|
|
subset of its parent's.
|
|
.IP *
|
|
It can only be marked
|
|
.IR cpu_exclusive
|
|
if its parent is.
|
|
.IP *
|
|
It can only be marked
|
|
.IR mem_exclusive
|
|
if its parent is.
|
|
.IP *
|
|
If it is
|
|
.IR cpu_exclusive ,
|
|
its CPUs may not overlap any sibling.
|
|
.IP *
|
|
If it is
|
|
.IR memory_exclusive ,
|
|
its memory nodes may not overlap any sibling.
|
|
.\" ================== PERMISSIONS ==================
|
|
.SH PERMISSIONS
|
|
The permissions of a cpuset are determined by the permissions
|
|
of the directories and pseudo-files in the cpuset file system,
|
|
normally mounted at
|
|
.IR /dev/cpuset .
|
|
.PP
|
|
For instance, a process can put itself in some other cpuset (than
|
|
its current one) if it can write the
|
|
.I tasks
|
|
file for that cpuset.
|
|
This requires execute permission on the encompassing directories
|
|
and write permission on the
|
|
.I tasks
|
|
file.
|
|
.PP
|
|
An additional constraint is applied to requests to place some
|
|
other process in a cpuset.
|
|
One process may not attach another to
|
|
a cpuset unless it would have permission to send that process
|
|
a signal (see
|
|
.BR kill (2)).
|
|
.PP
|
|
A process may create a child cpuset if it can access and write the
|
|
parent cpuset directory.
|
|
It can modify the CPUs or memory nodes
|
|
in a cpuset if it can access that cpuset's directory (execute
|
|
permissions on the each of the parent directories) and write the
|
|
corresponding
|
|
.I cpus
|
|
or
|
|
.I mems
|
|
file.
|
|
.PP
|
|
There is one minor difference between the manner in which these
|
|
permissions are evaluated and the manner in which normal file-system
|
|
operation permissions are evaluated.
|
|
The kernel interprets
|
|
relative pathnames starting at a process's current working directory.
|
|
Even if one is operating on a cpuset file, relative pathnames
|
|
are interpreted relative to the process's current working directory,
|
|
not relative to the process's current cpuset.
|
|
The only ways that
|
|
cpuset paths relative to a process's current cpuset can be used are
|
|
if either the process's current working directory is its cpuset
|
|
(it first did a
|
|
.B cd
|
|
or
|
|
.BR chdir (2)
|
|
to its cpuset directory beneath
|
|
.IR /dev/cpuset ,
|
|
which is a bit unusual)
|
|
or if some user code converts the relative cpuset path to a
|
|
full file-system path.
|
|
.PP
|
|
In theory, this means that user code should specify cpusets
|
|
using absolute pathnames, which requires knowing the mount point of
|
|
the cpuset file system (usually, but not necessarily,
|
|
.IR /dev/cpuset ).
|
|
In practice, all user level code that this author is aware of
|
|
simply assumes that if the cpuset file system is mounted, then
|
|
it is mounted at
|
|
.IR /dev/cpuset .
|
|
Furthermore, it is common practice for carefully written
|
|
user code to verify the presence of the pseudo-file
|
|
.I /dev/cpuset/tasks
|
|
in order to verify that the cpuset pseudo-file system
|
|
is currently mounted.
|
|
.\" ================== WARNINGS ==================
|
|
.SH WARNINGS
|
|
.SS Enabling memory_pressure
|
|
By default, the per-cpuset file
|
|
.I memory_pressure
|
|
always contains zero (0).
|
|
Unless this feature is enabled by writing "1" to the pseudo-file
|
|
.IR /dev/cpuset/memory_pressure_enabled ,
|
|
the kernel does
|
|
not compute per-cpuset
|
|
.IR memory_pressure .
|
|
.SS Using the echo command
|
|
When using the
|
|
.B echo
|
|
command at the shell prompt to change the values of cpuset files,
|
|
beware that the built-in
|
|
.B echo
|
|
command in some shells does not display an error message if the
|
|
.BR write (2)
|
|
system call fails.
|
|
.\" Gack! csh(1)'s echo does this
|
|
For example, if the command:
|
|
.in +4n
|
|
.nf
|
|
|
|
echo 19 > mems
|
|
|
|
.fi
|
|
.in
|
|
failed because memory node 19 was not allowed (perhaps
|
|
the current system does not have a memory node 19), then the
|
|
.B echo
|
|
command might not display any error.
|
|
It is better to use the
|
|
.B /bin/echo
|
|
external command to change cpuset file settings, as this
|
|
command will display
|
|
.BR write (2)
|
|
errors, as in the example:
|
|
.in +4n
|
|
.nf
|
|
|
|
/bin/echo 19 > mems
|
|
/bin/echo: write error: Invalid argument
|
|
.fi
|
|
.in
|
|
.\" ================== EXCEPTIONS ==================
|
|
.SH EXCEPTIONS
|
|
.SS Memory placement
|
|
Not all allocations of system memory are constrained by cpusets,
|
|
for the following reasons.
|
|
.PP
|
|
If hot-plug functionality is used to remove all the CPUs that are
|
|
currently assigned to a cpuset, then the kernel will automatically
|
|
update the
|
|
.I cpus_allowed
|
|
of all processes attached to CPUs in that cpuset
|
|
to allow all CPUs.
|
|
When memory hot-plug functionality for removing
|
|
memory nodes is available, a similar exception is expected to apply
|
|
there as well.
|
|
In general, the kernel prefers to violate cpuset placement,
|
|
rather than starving a process that has had all its allowed CPUs or
|
|
memory nodes taken offline.
|
|
User code should reconfigure cpusets to only refer to online CPUs
|
|
and memory nodes when using hot-plug to add or remove such resources.
|
|
.PP
|
|
A few kernel-critical, internal memory-allocation requests, marked
|
|
GFP_ATOMIC, must be satisfied immediately.
|
|
The kernel may drop some
|
|
request or malfunction if one of these allocations fail.
|
|
If such a request cannot be satisfied within the current process's cpuset,
|
|
then we relax the cpuset, and look for memory anywhere we can find it.
|
|
It's better to violate the cpuset than stress the kernel.
|
|
.PP
|
|
Allocations of memory requested by kernel drivers while processing
|
|
an interrupt lack any relevant process context, and are not confined
|
|
by cpusets.
|
|
.SS Renaming cpusets
|
|
You can use the
|
|
.BR rename (2)
|
|
system call to rename cpusets.
|
|
Only simple renaming is supported; that is, changing the name of a cpuset
|
|
directory is permitted, but moving a directory into
|
|
a different directory is not permitted.
|
|
.\" ================== ERRORS ==================
|
|
.SH ERRORS
|
|
The Linux kernel implementation of cpusets sets
|
|
.I errno
|
|
to specify the reason for a failed system call affecting cpusets.
|
|
.PP
|
|
The possible
|
|
.I errno
|
|
settings and their meaning when set on
|
|
a failed cpuset call are as listed below.
|
|
.TP
|
|
.B E2BIG
|
|
Attempted a
|
|
.BR write (2)
|
|
on a special cpuset file
|
|
with a length larger than some kernel-determined upper
|
|
limit on the length of such writes.
|
|
.TP
|
|
.B EACCES
|
|
Attempted to
|
|
.BR write (2)
|
|
the process ID (PID) of a process to a cpuset
|
|
.I tasks
|
|
file when one lacks permission to move that process.
|
|
.TP
|
|
.B EACCES
|
|
Attempted to add, using
|
|
.BR write (2),
|
|
a CPU or memory node to a cpuset, when that CPU or memory node was
|
|
not already in its parent.
|
|
.TP
|
|
.B EACCES
|
|
Attempted to set, using
|
|
.BR write (2),
|
|
.I cpu_exclusive
|
|
or
|
|
.I mem_exclusive
|
|
on a cpuset whose parent lacks the same setting.
|
|
.TP
|
|
.B EACCESS
|
|
Attempted to
|
|
.BR write (2)
|
|
a
|
|
.I memory_pressure
|
|
file.
|
|
.TP
|
|
.B EACCES
|
|
Attempted to create a file in a cpuset directory.
|
|
.TP
|
|
.B EBUSY
|
|
Attempted to remove, using
|
|
.BR rmdir (2),
|
|
a cpuset with attached processes.
|
|
.TP
|
|
.B EBUSY
|
|
Attempted to remove, using
|
|
.BR rmdir (2),
|
|
a cpuset with child cpusets.
|
|
.TP
|
|
.B EBUSY
|
|
Attempted to remove
|
|
a CPU or memory node from a cpuset
|
|
that is also in a child of that cpuset.
|
|
.TP
|
|
.B EEXIST
|
|
Attempted to create, using
|
|
.BR mkdir (2),
|
|
a cpuset that already exists.
|
|
.TP
|
|
.B EEXIST
|
|
Attempted to
|
|
.BR rename (2)
|
|
a cpuset to a name that already exists.
|
|
.TP
|
|
.B EFAULT
|
|
Attempted to
|
|
.BR read (2)
|
|
or
|
|
.BR write (2)
|
|
a cpuset file using
|
|
a buffer that is outside the writing processes accessible address space.
|
|
.TP
|
|
.B EINVAL
|
|
Attempted to change a cpuset, using
|
|
.BR write (2),
|
|
in a way that would violate a
|
|
.I cpu_exclusive
|
|
or
|
|
.I mem_exclusive
|
|
attribute of that cpuset or any of its siblings.
|
|
.TP
|
|
.B EINVAL
|
|
Attempted to
|
|
.BR write (2)
|
|
an empty
|
|
.I cpus
|
|
or
|
|
.I mems
|
|
list to a cpuset which has attached processes or child cpusets.
|
|
.TP
|
|
.B EINVAL
|
|
Attempted to
|
|
.BR write (2)
|
|
a
|
|
.I cpus
|
|
or
|
|
.I mems
|
|
list which included a range with the second number smaller than
|
|
the first number.
|
|
.TP
|
|
.B EINVAL
|
|
Attempted to
|
|
.BR write (2)
|
|
a
|
|
.I cpus
|
|
or
|
|
.I mems
|
|
list which included an invalid character in the string.
|
|
.TP
|
|
.B EINVAL
|
|
Attempted to
|
|
.BR write (2)
|
|
a list to a
|
|
.I cpus
|
|
file that did not include any online CPUs.
|
|
.TP
|
|
.B EINVAL
|
|
Attempted to
|
|
.BR write (2)
|
|
a list to a
|
|
.I mems
|
|
file that did not include any online memory nodes.
|
|
.TP
|
|
.B EINVAL
|
|
Attempted to
|
|
.BR write (2)
|
|
a list to a
|
|
.I mems
|
|
file that included a node that held no memory.
|
|
.TP
|
|
.B EIO
|
|
Attempted to
|
|
.BR write (2)
|
|
a string to a cpuset
|
|
.I tasks
|
|
file that
|
|
does not begin with an ASCII decimal integer.
|
|
.TP
|
|
.B EIO
|
|
Attempted to
|
|
.BR rename (2)
|
|
a cpuset into a different directory.
|
|
.TP
|
|
.B ENAMETOOLONG
|
|
Attempted to
|
|
.BR read (2)
|
|
a
|
|
.I /proc/<pid>/cpuset
|
|
file for a cpuset path that is longer than the kernel page size.
|
|
.TP
|
|
.B ENAMETOOLONG
|
|
Attempted to create, using
|
|
.BR mkdir (2),
|
|
a cpuset whose base directory name is longer than 255 characters.
|
|
.TP
|
|
.B ENAMETOOLONG
|
|
Attempted to create, using
|
|
.BR mkdir (2),
|
|
a cpuset whose full pathname,
|
|
including the mount point (typically "/dev/cpuset/") prefix,
|
|
is longer than 4095 characters.
|
|
.TP
|
|
.B ENODEV
|
|
The cpuset was removed by another process at the same time as a
|
|
.BR write (2)
|
|
was attempted on one of the pseudo-files in the cpuset directory.
|
|
.TP
|
|
.B ENOENT
|
|
Attempted to create, using
|
|
.BR mkdir (2),
|
|
a cpuset in a parent cpuset that doesn't exist.
|
|
.TP
|
|
.B ENOENT
|
|
Attempted to
|
|
.BR access (2)
|
|
or
|
|
.BR open (2)
|
|
a nonexistent file in a cpuset directory.
|
|
.TP
|
|
.B ENOMEM
|
|
Insufficient memory is available within the kernel; can occur
|
|
on a variety of system calls affecting cpusets, but only if the
|
|
system is extremely short of memory.
|
|
.TP
|
|
.B ENOSPC
|
|
Attempted to
|
|
.BR write (2)
|
|
the process ID (PID)
|
|
of a process to a cpuset
|
|
.I tasks
|
|
file when the cpuset had an empty
|
|
.I cpus
|
|
or empty
|
|
.I mems
|
|
setting.
|
|
.TP
|
|
.B ENOSPC
|
|
Attempted to
|
|
.BR write (2)
|
|
an empty
|
|
.I cpus
|
|
or
|
|
.I mems
|
|
setting to a cpuset that
|
|
has tasks attached.
|
|
.TP
|
|
.B ENOTDIR
|
|
Attempted to
|
|
.BR rename (2)
|
|
a nonexistent cpuset.
|
|
.TP
|
|
.B EPERM
|
|
Attempted to remove a file from a cpuset directory.
|
|
.TP
|
|
.B ERANGE
|
|
Specified a
|
|
.I cpus
|
|
or
|
|
.I mems
|
|
list to the kernel which included a number too large for the kernel
|
|
to set in its bitmasks.
|
|
.TP
|
|
.B ESRCH
|
|
Attempted to
|
|
.BR write (2)
|
|
the process ID (PID) of a nonexistent process to a cpuset
|
|
.I tasks
|
|
file.
|
|
.\" ================== VERSIONS ==================
|
|
.SH VERSIONS
|
|
Cpusets appeared in version 2.6.12 of the Linux kernel.
|
|
.\" ================== NOTES ==================
|
|
.SH NOTES
|
|
Despite its name, the
|
|
.I pid
|
|
parameter is actually a thread ID,
|
|
and each thread in a threaded group can be attached to a different
|
|
cpuset.
|
|
The value returned from a call to
|
|
.BR gettid (2)
|
|
can be passed in the argument
|
|
.IR pid .
|
|
.\" ================== BUGS ==================
|
|
.SH BUGS
|
|
.I memory_pressure
|
|
cpuset files can be opened
|
|
for writing, creation, or truncation, but then the
|
|
.BR write (2)
|
|
fails with
|
|
.I errno
|
|
set to
|
|
.BR EACCESS ,
|
|
and the creation and truncation options on
|
|
.BR open (2)
|
|
have no affect.
|
|
.\" ================== EXAMPLE ==================
|
|
.SH EXAMPLE
|
|
The following examples demonstrate querying and setting cpuset
|
|
options using shell commands.
|
|
.SS Creating and attaching to a cpuset.
|
|
To create a new cpuset and attach the current command shell to it,
|
|
the steps are:
|
|
.sp
|
|
.PD 0
|
|
.IP 1) 4
|
|
mkdir /dev/cpuset (if not already done)
|
|
.IP 2)
|
|
mount \-t cpuset none /dev/cpuset (if not already done)
|
|
.IP 3)
|
|
Create the new cpuset using
|
|
.BR mkdir (1).
|
|
.IP 4)
|
|
Assign CPUs and memory nodes to the new cpuset.
|
|
.IP 5)
|
|
Attach the shell to the new cpuset.
|
|
.PD
|
|
.PP
|
|
For example, the following sequence of commands will set up a cpuset
|
|
named "Charlie", containing just CPUs 2 and 3, and memory node 1,
|
|
and then attach the current shell to that cpuset.
|
|
.in +4n
|
|
.nf
|
|
|
|
mkdir /dev/cpuset
|
|
mount \-t cpuset cpuset /dev/cpuset
|
|
cd /dev/cpuset
|
|
mkdir Charlie
|
|
cd Charlie
|
|
/bin/echo 2-3 > cpus
|
|
/bin/echo 1 > mems
|
|
/bin/echo $$ > tasks
|
|
# The current shell is now running in cpuset Charlie
|
|
# The next line should display '/Charlie'
|
|
cat /proc/self/cpuset
|
|
.fi
|
|
.in
|
|
.SS Migrating a job to different memory nodes.
|
|
To migrate a job (the set of processes attached to a cpuset)
|
|
to different CPUs and memory nodes in the system, including moving
|
|
the memory pages currently allocated to that job,
|
|
perform the following steps.
|
|
.sp
|
|
.PD 0
|
|
.IP 1) 4
|
|
Let's say we want to move the job in cpuset
|
|
.I alpha
|
|
(CPUs 4-7 and memory nodes 2-3) to a new cpuset
|
|
.I beta
|
|
(CPUs 16-19 and memory nodes 8-9).
|
|
.IP 2)
|
|
First create the new cpuset
|
|
.IR beta .
|
|
.IP 3)
|
|
Then allow CPUs 16-19 and memory nodes 8-9 in
|
|
.IR beta .
|
|
.IP 4)
|
|
Then enable
|
|
.I memory_migration
|
|
in
|
|
.IR beta .
|
|
.IP 5)
|
|
Then move each process from
|
|
.I alpha
|
|
to
|
|
.IR beta .
|
|
.PD
|
|
.PP
|
|
The following sequence of commands accomplishes this.
|
|
.in +4n
|
|
.nf
|
|
|
|
cd /dev/cpuset
|
|
mkdir beta
|
|
cd beta
|
|
/bin/echo 16-19 > cpus
|
|
/bin/echo 8-9 > mems
|
|
/bin/echo 1 > memory_migrate
|
|
while read i; do /bin/echo $i; done < ../alpha/tasks > tasks
|
|
.fi
|
|
.in
|
|
.PP
|
|
The above should move any processes in
|
|
.I alpha
|
|
to
|
|
.IR beta ,
|
|
and any memory held by these processes on memory nodes 2-3 to memory
|
|
nodes 8-9, respectively.
|
|
.PP
|
|
Notice that the last step of the above sequence did not do:
|
|
.in +4n
|
|
.nf
|
|
|
|
cp ../alpha/tasks tasks
|
|
.fi
|
|
.in
|
|
.PP
|
|
The
|
|
.I while
|
|
loop, rather than the seemingly easier use of the
|
|
.BR cp (1)
|
|
command, was necessary because
|
|
only one process PID at a time may be written to the
|
|
.I tasks
|
|
file.
|
|
.PP
|
|
The same affect (writing one PID at a time) as the
|
|
.I while
|
|
loop can be accomplished more efficiently, in fewer keystrokes and in
|
|
syntax that works on any shell, but alas more obscurely, by using the
|
|
.I \-u
|
|
(unbuffered) option of
|
|
.BR sed (1):
|
|
.in +4n
|
|
|
|
.nf
|
|
sed \-un p < ../alpha/tasks > tasks
|
|
.fi
|
|
.in
|
|
.\" ================== SEE ALSO ==================
|
|
.SH SEE ALSO
|
|
.BR taskset (1),
|
|
.BR get_mempolicy (2),
|
|
.BR getcpu (2),
|
|
.BR mbind (2),
|
|
.BR sched_getaffinity (2),
|
|
.BR sched_setaffinity (2),
|
|
.BR sched_setscheduler (2),
|
|
.BR set_mempolicy (2),
|
|
.BR proc (5),
|
|
.BR migratepages (8),
|
|
.BR numactl (8)
|
|
.PP
|
|
The kernel source file
|
|
.IR Documentation/cpusets.txt .
|