mirror of https://github.com/mkerrisk/man-pages
254 lines
10 KiB
Groff
254 lines
10 KiB
Groff
.\" Copyright (C) 2015 Serge Hallyn <serge@hallyn.com>
|
|
.\"
|
|
.\" %%%LICENSE_START(VERBATIM)
|
|
.\" Permission is granted to make and distribute verbatim copies of this
|
|
.\" manual provided the copyright notice and this permission notice are
|
|
.\" preserved on all copies.
|
|
.\"
|
|
.\" Permission is granted to copy and distribute modified versions of this
|
|
.\" manual under the conditions for verbatim copying, provided that the
|
|
.\" entire resulting derived work is distributed under the terms of a
|
|
.\" permission notice identical to this one.
|
|
.\"
|
|
.\" Since the Linux kernel and libraries are constantly changing, this
|
|
.\" manual page may be incorrect or out-of-date. The author(s) assume no
|
|
.\" responsibility for errors or omissions, or for damages resulting from
|
|
.\" the use of the information contained herein. The author(s) may not
|
|
.\" have taken the same level of care in the production of this manual,
|
|
.\" which is licensed free of charge, as they might when working
|
|
.\" professionally.
|
|
.\"
|
|
.\" Formatted or processed versions of this manual, if unaccompanied by
|
|
.\" the source, must acknowledge the copyright and authors of this work.
|
|
.\" %%%LICENSE_END
|
|
.\"
|
|
|
|
Name: cgroups - linux process control groups
|
|
|
|
Description
|
|
|
|
Control cgroups, usually referred to as cgroups, are a Linux kernel feature
|
|
which provides for grouping of tasks and resource tracking and limitations for those group.
|
|
While several systems have been introduced to help in configuring and
|
|
managing cgroups, the kernel's cgroup interface is provided through
|
|
a pseudo-filesystem called cgroupfs. Task grouping is implemented in the
|
|
core cgroup kernel code, while resource tracking and limits are implemented in
|
|
a set of per-resource-type subsystems - memory, cpu, etc - which may be
|
|
enabled as separate hierarchies, or joined into comounted hierarchies.
|
|
Each hierarchy constitutes a separate mount of the cgroupfs filesystem,
|
|
with the subsystems enabled in that hierarchy listed in the mount options.
|
|
For each mounted hierarchy, the directory tree mirrors the control group hierarchy.
|
|
Each control group is represented by a directory, with each of its child
|
|
control cgroups represented as a child directory.
|
|
For instance, /user/joe/1.session represents control group
|
|
1.session, which is a child of cgroup joe, which is a child of /user.
|
|
Under each cgroup directory are a set of files which can be read or
|
|
written to, reflecting resource limits and a few general cgroup
|
|
properties.
|
|
|
|
In general, cgroup limits are hierarchical, meaning that the limits placed
|
|
on /user/joe cannot be exceeded by /usr/joe/1.session. There are currently
|
|
exceptions to this, but stricter adherence is a goal as cgroups are being
|
|
largely reworked.
|
|
|
|
The existing subsystems include
|
|
|
|
. cpusets
|
|
. blkio
|
|
. cpuacct
|
|
. devices
|
|
. freezer
|
|
. hugetlb
|
|
. memory
|
|
. net_cls
|
|
. net_pri
|
|
. cpu
|
|
. perf_event
|
|
|
|
In addition, cgroups can be mounted with no bound subsystem, in which case
|
|
they serve only to track processes. An example of this is the name=systemd
|
|
cgroup which is used by systemd to track services and user sessions.
|
|
|
|
Mounting
|
|
|
|
To be available, a given cgroup subsystem must be compiled into the
|
|
kernel. Since they are exposed through a virtual filesystem, subsystems
|
|
must be mounted before they can be controlled. The usual place for this
|
|
is under /sys/fs/cgroup. If all the desired subsystems can be co-mounted,
|
|
then the system may simply
|
|
|
|
mount -t cgroup cgroup /sys/fs/cgroup
|
|
|
|
If multiple, separately mounted subsystems are desired, then this is
|
|
usually done in per-subsystem subdirectories. This requires first mounting
|
|
a tmpfs under /sys/fs/cgroup so that subdirectories can be created. For
|
|
instance, to mount cpu, memory and devices cgroups, you could
|
|
|
|
mount -t tmpfs -o size=100000,mode=755 cgroups /sys/fs/cgroup
|
|
for s in cpu memory devices; do
|
|
mkdir /sys/fs/cgroup/$s
|
|
mount -t cgroup -o $s $s /sys/fs/cgroup/$s
|
|
done
|
|
|
|
Co-mounting subsystems has the effect that a task is in the same cgroup for
|
|
all co-mounted subsystems. Separately mounting subsystems allows a task to
|
|
be in cgroup /foo1 for one subsystem while being in /foo2/foo3 for another.
|
|
|
|
Introspection
|
|
|
|
The list of subsystems compiled into the kernel can be seen in the file
|
|
/proc/cgroups. The file /proc/pid/cgroup lists the task's current cgroup
|
|
membership for each mounted hierarchy.
|
|
|
|
Creating cgroups and moving tasks
|
|
|
|
The system begins with a single root cgroup (per hierarchy), '/', which all tasks belong to.
|
|
A new cgroup is created using mkdir(2):
|
|
|
|
mkdir /sys/fs/cgroup/cpu/cg1
|
|
|
|
This creates a new empty cgroup. Tasks may be moved to this cgroup by writing
|
|
their pids into the cgroup's "cgroup.procs" (deprecated) "tasks" file:
|
|
|
|
echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs
|
|
|
|
The same file can be read to obtain a list of the processes currently in cg1.
|
|
By using the cgroup.procs file instead of the tasks file, all tasks in the
|
|
threadgroup are moved into the new cgroup at once.
|
|
|
|
At fork(2), the new child is created as a member of the parent's cgroup, leading
|
|
to implicit grouping of process hierarchies.
|
|
|
|
Note: in the upcoming unified hierarchy, a new restriction is imposed such
|
|
that tasks may only exist in leaf cgroups. For instance, if cgroup /cg1/cg2
|
|
exists, then a task may exist in /cg1/cg2, but not in /cg1. This is to
|
|
avoid the current ambiguity in the delegation of resources between tasks in /cg1
|
|
and its children. The recommended workaround is to create a subdirectory called
|
|
leaf for any non-leaf cgroup which should contain tasks, and make sure not to
|
|
create child cgroups of it. In the above example, tasks which previously would
|
|
have gone into /cg1 would now go into /cg1/leaf. This has the advantage of
|
|
making explicit the relationship between tasks in /cg1/leaf and /cg1's other
|
|
children.
|
|
|
|
Removing cgroups
|
|
|
|
To remove a cgroup, it must first have no child cgroups and no tasks. So long
|
|
as that is the case, the cgroup is removed using rmdir(2).
|
|
|
|
A special file in each cgroup hierarchy, called 'release_agent', can be used
|
|
to register a program to handle cgroups which become newly empty. The program
|
|
will be called each time a cgroup marked for autoremove becomes empty and childless.
|
|
The cgroup path will be listed as the first argument. The cgroup must be marked
|
|
as eligible for autoremove by writing '1' into its notify_on_release file, and
|
|
this value is inherited by newly created child cgroups.
|
|
|
|
A new feature in 3.15 (?) is the 'cgroup.populated' file. This reads 0 if
|
|
there are no tasks in the cgroup or its descendants, and 1 otherwise. It
|
|
can be watched for changes using inotify. This allows userspace to efficiently
|
|
watch cgroups for autoremove conditions.
|
|
|
|
Unified Hierarchy
|
|
|
|
In order to address a number of shortcomings in the original Control Groups
|
|
design, new semantics are being gradually introduced. In order not to break
|
|
existing applications, the new semantics are hidden behind a mount option
|
|
(subject to change):
|
|
|
|
mount -t cgroup -o __DEVEL__sane_behavior cgroup /sys/fs/cgroup
|
|
|
|
By default all controllers are co-mounted in the unified hierarchy. While
|
|
controllers may be mounted under the legacy hierarchy, they may not be
|
|
mounted at the same time in legacy and unified hierarchies.
|
|
|
|
The new behaviors are summarized below:
|
|
|
|
1 Tasks only in leave nodes
|
|
|
|
With the exception of the root cgroup, tasks may only belong in leaf nodes.
|
|
This avoids the need to decide how to partition resources between tasks which
|
|
are members of cgroup A and tasks in child cgroups of A.
|
|
|
|
2. Active cgroups must be specified
|
|
|
|
The unified hierarchy presents two new files, "cgroup.controllers" and
|
|
"cgroup.subtree_control". When a cgroup A/b is created, its 'cgroup.controllers"
|
|
file contains the list of controllers which were active in its parent, A.
|
|
This is the list of controllers which are available to this cgroup. No
|
|
controllers are active until they are enabled through the "cgroup.subtree_control"
|
|
file, by writing the name of the space-separate list of controllers, each preceded
|
|
by '+' (to enable) or '-' (to disable). If the freezer controller is not
|
|
enabled in /A/B, then it cannot be enabled in /A/B/C.
|
|
|
|
3. No "tasks" or "cgroup.clone_children" files
|
|
|
|
4. Empty cgroup notification
|
|
|
|
A new file "cgroup.populated" under each cgroup contains '0' when the
|
|
cgroup is empty, and 1 when it is populated. It therefore may be
|
|
watched to detect when a cgroup becomes (non-)empty. This replaces
|
|
the original notify-on-release mechanism.
|
|
|
|
For more changes, please see the Documentation/cgroups/unified-hierarchy
|
|
file in the kernel source.
|
|
|
|
Subsystems # give details on each subsystem
|
|
|
|
. cpusets
|
|
|
|
Cpusets bind the tasks in a cgroup to a specified set of cpus and
|
|
numa nodes.
|
|
|
|
. blkio
|
|
|
|
The blkio cgroup controls and limits access to specified block devices by
|
|
applying IO control in the form of throttling and upper limits against leaf
|
|
nodes and intermediate nodes in the storage hierarchy.
|
|
|
|
Two policies are available. The first is a proportional weight time based division
|
|
of disk implemented with CFQ. This is in effect for leaf nodes using CFQ. The
|
|
second is a throttling policy which specifies upper IO rate limits on a device.
|
|
|
|
. cpuacct
|
|
|
|
This provides accounting for cpu usage by groups of tasks.
|
|
|
|
. devices
|
|
|
|
This supports controlling which tasks may create (mknod) devices as
|
|
well as open them for reading or writing. The policies may be specified
|
|
as whitelists and blacklists. Hierarchy is enforced, so new rules must not
|
|
violate existing rules for the target or ancestor cgroups.
|
|
|
|
. freezer
|
|
|
|
The freezer cgroup can suspend and un-suspend all tasks in a cgroup.
|
|
Freezing a cgroup /A also causes its children, i.e. tasks in /A/B,
|
|
to be frozen.
|
|
|
|
. hugetlb
|
|
|
|
This supports limiting the use of huge pages by cgroups.
|
|
|
|
. memory
|
|
|
|
The memory controller supports reporting and limiting of process memory, kernel
|
|
memory, and swap used by cgroups.
|
|
|
|
. net_cls
|
|
|
|
This places a classid, specified for the cgroup, on network packets
|
|
created by a cgroup. These classids can then be used in firewall rules,
|
|
as well as used to shape traffic using tc. This only applies to packets
|
|
leaving the cgroup, not to traffic arriving at the cgroup.
|
|
|
|
. net_prio
|
|
|
|
This allows priorities to be specified, per network interfaces, for cgroups.
|
|
|
|
. cpu
|
|
|
|
Cgroups can be guaranteed a minimum number of "cpu shares" when a system is
|
|
busy. This does not limit a cgroup's cpu usage if the cpus are not busy.
|
|
|
|
. perf_event
|