mirror of https://github.com/mkerrisk/man-pages
+ changed the "policy" parameter to "mode" through out the
descriptions in an attempt to promote the concept that the memory policy is a tuple consisting of a mode and optional set of nodes. + rewrite portions of description for clarification. ++ clarify interaction of policy with mmap()'d files and shared memory regions, including SHM_HUGE regions. ++ defined how "empty set of nodes" specified and what this means for MPOL_PREFERRED. ++ mention what happens if local/target node contains no free memory. ++ clarify semantics of multiple nodes to BIND policy. Note: subject to change. We'll fix the man pages when/if this happens. + added all errors currently returned by sys call. + added mmap(2), shmget(2), shmat(2) to See Also list.
This commit is contained in:
parent
03d2434e5c
commit
9f5682a8ed
344
man2/mbind.2
344
man2/mbind.2
|
@ -18,15 +18,17 @@
|
|||
.\" the source, must acknowledge the copyright and authors of this work.
|
||||
.\"
|
||||
.\" 2006-02-03, mtk, substantial wording changes and other improvements
|
||||
.\" 2007-08-27, Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
||||
.\" more precise specification of behavior.
|
||||
.\"
|
||||
.TH MBIND 2 2006-02-07 "Linux" "Linux Programmer's Manual"
|
||||
.TH MBIND 2 "2007-06-01" "SuSE Labs" "Linux Programmer's Manual"
|
||||
.SH NAME
|
||||
mbind \- Set memory policy for a memory range
|
||||
.SH SYNOPSIS
|
||||
.nf
|
||||
.B "#include <numaif.h>"
|
||||
.sp
|
||||
.BI "int mbind(void *" start ", unsigned long " len ", int " policy ,
|
||||
.BI "int mbind(void *" start ", unsigned long " len ", int " mode ,
|
||||
.BI " unsigned long *" nodemask ", unsigned long " maxnode ,
|
||||
.BI " unsigned " flags );
|
||||
.sp
|
||||
|
@ -34,76 +36,178 @@ mbind \- Set memory policy for a memory range
|
|||
.fi
|
||||
.SH DESCRIPTION
|
||||
.BR mbind ()
|
||||
sets the NUMA memory
|
||||
.I policy
|
||||
sets the NUMA memory policy,
|
||||
which consists of a policy mode and zero or more nodes,
|
||||
for the memory range starting with
|
||||
.I start
|
||||
and continuing for
|
||||
.IR len
|
||||
bytes.
|
||||
The memory of a NUMA machine is divided into multiple nodes.
|
||||
The memory policy defines in which node memory is allocated.
|
||||
The memory policy defines from which node memory is allocated.
|
||||
|
||||
If the memory range specified by the
|
||||
.IR start " and " len
|
||||
arguments includes an "anonymous" region of memory\(emthat is
|
||||
a region of memory created using the
|
||||
.BR mmap (2)
|
||||
system call with the
|
||||
.BR MAP_ANONYMOUS \(emor
|
||||
a memory mapped file, mapped using the
|
||||
.BR mmap (2)
|
||||
system call with the
|
||||
.B MAP_PRIVATE
|
||||
flag, pages will only be allocated according to the specified
|
||||
policy when the application writes [stores] to the page.
|
||||
For anonymous regions, an initial read access will use a shared
|
||||
page in the kernel containing all zeros.
|
||||
For a file mapped with
|
||||
.BR MAP_PRIVATE ,
|
||||
an initial read access will allocate pages according to the
|
||||
process policy of the process that causes the page to be allocated.
|
||||
This may not be the process that called
|
||||
.BR mbind ().
|
||||
|
||||
The specified policy will be ignored for any
|
||||
.B MAP_SHARED
|
||||
mappings in the specified memory range.
|
||||
Rather the pages will be allocated according to the process policy
|
||||
of the process that caused the page to be allocated.
|
||||
Again, this may not be the process that called
|
||||
.BR mbind ().
|
||||
|
||||
If the specified memory range includes a shared memory region
|
||||
created using the
|
||||
.BR shmget (2)
|
||||
system call and attached using the
|
||||
.BR shmat (2)
|
||||
system call,
|
||||
pages allocated for the anonymous or shared memory region will
|
||||
be allocated according to the policy specified, regardless which
|
||||
process attached to the shared memory segment causes the allocation.
|
||||
If, however, the shared memory region was created with the
|
||||
.B SHM_HUGETLB
|
||||
flag,
|
||||
the huge pages will be allocated according to the policy specified
|
||||
only if the page allocation is caused by the task that calls
|
||||
.BR mbind ()
|
||||
for that region.
|
||||
|
||||
By default,
|
||||
.BR mbind ()
|
||||
only has an effect for new allocations; if the pages inside
|
||||
the range have been already touched before setting the policy,
|
||||
then the policy has no effect.
|
||||
This default behavior may be overridden by the
|
||||
.BR MPOL_MF_MOVE
|
||||
and
|
||||
.B MPOL_MF_MOVE_ALL
|
||||
flags described below.
|
||||
|
||||
Available policies are
|
||||
The
|
||||
.I mode
|
||||
argument must specify one of
|
||||
.BR MPOL_DEFAULT ,
|
||||
.BR MPOL_BIND ,
|
||||
.BR MPOL_INTERLEAVE ,
|
||||
and
|
||||
.B MPOL_INTERLEAVE
|
||||
or
|
||||
.BR MPOL_PREFERRED .
|
||||
All policies except
|
||||
All policy modes except
|
||||
.B MPOL_DEFAULT
|
||||
require the caller to specify the nodes to which the policy applies in the
|
||||
require the caller to specify via the
|
||||
.I nodemask
|
||||
parameter.
|
||||
parameter,
|
||||
the node or nodes to which the mode applies.
|
||||
|
||||
.I nodemask
|
||||
is a bit mask of nodes containing up to
|
||||
points to a bitmask of nodes containing up to
|
||||
.I maxnode
|
||||
bits.
|
||||
The actual number of bytes transferred via this argument
|
||||
is rounded up to the next multiple of
|
||||
The bit mask size is rounded to the next multiple of
|
||||
.IR "sizeof(unsigned long)" ,
|
||||
but the kernel will only use bits up to
|
||||
.IR maxnode .
|
||||
A NULL argument means an empty set of nodes.
|
||||
A NULL value of
|
||||
.I nodemask
|
||||
or a
|
||||
.I maxnode
|
||||
value of zero specifies the empty set of nodes.
|
||||
If the value of
|
||||
.I maxnode
|
||||
is zero,
|
||||
the
|
||||
.I nodemask
|
||||
argument is ignored.
|
||||
|
||||
The
|
||||
.B MPOL_DEFAULT
|
||||
policy is the default and means to use the underlying process policy
|
||||
(which can be modified with
|
||||
.BR set_mempolicy (2)).
|
||||
Unless the process policy has been changed this means to allocate
|
||||
memory on the node of the CPU that triggered the allocation.
|
||||
mode specifies that the default policy be used.
|
||||
When applied to a range of memory via
|
||||
.IR mbind (),
|
||||
this means to use the process policy,
|
||||
which may have been set with
|
||||
.BR set_mempolicy (2).
|
||||
If the mode of the process policy is also
|
||||
.BR MPOL_DEFAULT ,
|
||||
the system-wide default policy will be used.
|
||||
The system-wide default policy will allocate
|
||||
pages on the node of the CPU that triggers the allocation.
|
||||
For
|
||||
.BR MPOL_DEFAULT ,
|
||||
the
|
||||
.I nodemask
|
||||
should be specified as NULL.
|
||||
and
|
||||
.I maxnode
|
||||
arguments must be specify the empty set of nodes.
|
||||
|
||||
The
|
||||
.B MPOL_BIND
|
||||
policy is a strict policy that restricts memory allocation to the
|
||||
nodes specified in
|
||||
mode specifies a strict policy that restricts memory allocation to
|
||||
the nodes specified in
|
||||
.IR nodemask .
|
||||
If
|
||||
.I nodemask
|
||||
specifies more than one node, page allocations will come from
|
||||
the node with the lowest numeric node id first, until that node
|
||||
contains no free memory.
|
||||
Allocations will then come from the node with the next highest
|
||||
node id specified in
|
||||
.I nodemask
|
||||
and so forth, until none of the specified nodes contain free memory.
|
||||
Pages will not be allocated from any node not specified in the
|
||||
.IR nodemask .
|
||||
There won't be allocations on other nodes.
|
||||
|
||||
The
|
||||
.B MPOL_INTERLEAVE
|
||||
interleaves allocations to the nodes specified in
|
||||
mode specifies that page allocations be interleaved across the
|
||||
set of nodes specified in
|
||||
.IR nodemask .
|
||||
This optimizes for bandwidth instead of latency.
|
||||
This optimizes for bandwidth instead of latency
|
||||
by spreading out pages and memory accesses to those pages across
|
||||
multiple nodes.
|
||||
To be effective the memory area should be fairly large,
|
||||
at least 1MB or bigger.
|
||||
at least 1MB or bigger with a fairly uniform access pattern.
|
||||
Accesses to a single page of the area will still be limited to
|
||||
the memory bandwidth of a single node.
|
||||
|
||||
.B MPOL_PREFERRED
|
||||
sets the preferred node for allocation.
|
||||
The kernel will try to allocate in this
|
||||
The kernel will try to allocate pages from this
|
||||
node first and fall back to other nodes if the
|
||||
preferred nodes is low on free memory.
|
||||
Only the first node in the
|
||||
If
|
||||
.I nodemask
|
||||
is used.
|
||||
If no node is set in the mask, then the memory is allocated on
|
||||
the node of the CPU that triggered the allocation allocation).
|
||||
specifies more than one node id, the first node in the
|
||||
mask will be selected as the preferred node.
|
||||
If the
|
||||
.I nodemask
|
||||
and
|
||||
.I maxnode
|
||||
arguments specify the empty set, then the memory is allocated on
|
||||
the node of the CPU that triggered the allocation.
|
||||
This is the only way to specify "local allocation" for a
|
||||
range of memory via
|
||||
.IR mbind (2).
|
||||
|
||||
If
|
||||
.B MPOL_MF_STRICT
|
||||
|
@ -115,17 +219,18 @@ is not
|
|||
.BR MPOL_DEFAULT ,
|
||||
then the call will fail with the error
|
||||
.B EIO
|
||||
if the existing pages in the mapping don't follow the policy.
|
||||
In 2.6.16 or later the kernel will also try to move pages
|
||||
to the requested node with this flag.
|
||||
if the existing pages in the memory range don't follow the policy.
|
||||
.\" According to the kernel code, the following is not true --lts
|
||||
.\" In 2.6.16 or later the kernel will also try to move pages
|
||||
.\" to the requested node with this flag.
|
||||
|
||||
If
|
||||
.B MPOL_MF_MOVE
|
||||
is passed in
|
||||
is specified in
|
||||
.IR flags ,
|
||||
then an attempt will be made to
|
||||
move all the pages in the mapping so that they follow the policy.
|
||||
Pages that are shared with other processes are not moved.
|
||||
then the kernel will attempt to move all the existing pages
|
||||
in the memory range so that they follow the policy.
|
||||
Pages that are shared with other processes will not be moved.
|
||||
If
|
||||
.B MPOL_MF_STRICT
|
||||
is also specified, then the call will fail with the error
|
||||
|
@ -136,8 +241,8 @@ If
|
|||
.B MPOL_MF_MOVE_ALL
|
||||
is passed in
|
||||
.IR flags ,
|
||||
then all pages in the mapping will be moved regardless of whether
|
||||
other processes use the pages.
|
||||
then the kernel will attempt to move all existing pages in the memory range
|
||||
regardless of whether other processes use the pages.
|
||||
The calling process must be privileged
|
||||
.RB ( CAP_SYS_NICE )
|
||||
to use this flag.
|
||||
|
@ -146,6 +251,7 @@ If
|
|||
is also specified, then the call will fail with the error
|
||||
.B EIO
|
||||
if some pages could not be moved.
|
||||
.\" ---------------------------------------------------------------
|
||||
.SH RETURN VALUE
|
||||
On success,
|
||||
.BR mbind ()
|
||||
|
@ -153,11 +259,9 @@ returns 0;
|
|||
on error, \-1 is returned and
|
||||
.I errno
|
||||
is set to indicate the error.
|
||||
.\" ---------------------------------------------------------------
|
||||
.SH ERRORS
|
||||
.TP
|
||||
.B EFAULT
|
||||
There was a unmapped hole in the specified memory range
|
||||
or a passed pointer was not valid.
|
||||
.\" I think I got all of the error returns. --lts
|
||||
.TP
|
||||
.B EINVAL
|
||||
An invalid value was specified for
|
||||
|
@ -169,55 +273,102 @@ or
|
|||
was less than
|
||||
.IR start ;
|
||||
or
|
||||
.I policy
|
||||
was
|
||||
.B MPOL_DEFAULT
|
||||
.I start
|
||||
is not a multiple of the system page size.
|
||||
Or,
|
||||
.I mode
|
||||
is
|
||||
.I MPOL_DEFAULT
|
||||
and
|
||||
.I nodemask
|
||||
pointed to a non-empty set;
|
||||
specified a non-empty set;
|
||||
or
|
||||
.I policy
|
||||
was
|
||||
.B MPOL_BIND
|
||||
.I mode
|
||||
is
|
||||
.I MPOL_BIND
|
||||
or
|
||||
.B MPOL_INTERLEAVE
|
||||
.I MPOL_INTERLEAVE
|
||||
and
|
||||
.I nodemask
|
||||
pointed to an empty set,
|
||||
is empty.
|
||||
Or,
|
||||
.I maxnode
|
||||
specifies more than a page worth of bits.
|
||||
Or,
|
||||
.I nodemask
|
||||
specifies one or more node ids that are
|
||||
greater than the maximum supported node id,
|
||||
or are not allowed in the calling task's context.
|
||||
.\" "calling task's context" refers to cpusets. No man page avail to ref. --lts
|
||||
Or, none of the node ids specified by
|
||||
.I nodemask
|
||||
are on-line, or none of the specified nodes contain memory.
|
||||
.TP
|
||||
.B EFAULT
|
||||
Part of all of the memory range specified by
|
||||
.I nodemask
|
||||
and
|
||||
.I maxnode
|
||||
points outside your accessible address space.
|
||||
Or, there was a unmapped hole in the specified memory range.
|
||||
.TP
|
||||
.B EIO
|
||||
.B MPOL_MF_STRICT
|
||||
was specified and an existing page was already on a node
|
||||
that does not follow the policy.
|
||||
that does not follow the policy;
|
||||
or
|
||||
.B MPOL_MF_MOVE
|
||||
or
|
||||
.B MPOL_MF_MOVE_ALL
|
||||
was specified and the kernel was unable to move all existing
|
||||
pages in the range.
|
||||
.TP
|
||||
.B ENOMEM
|
||||
System out of memory.
|
||||
.SH CONFORMING TO
|
||||
This system call is Linux specific.
|
||||
Insufficient kernel memory was available.
|
||||
.TP
|
||||
.B EPERM
|
||||
The
|
||||
.I flags
|
||||
argument included the
|
||||
.B MPOL_MF_MOVE_ALL
|
||||
flag and the caller does not have the
|
||||
.B CAP_SYS_NICE
|
||||
privilege.
|
||||
.\" ---------------------------------------------------------------
|
||||
.SH NOTES
|
||||
NUMA policy is not supported on file mappings.
|
||||
NUMA policy is not supported on a memory mapped file range
|
||||
that was mapped with the
|
||||
.I MAP_SHARED
|
||||
flag.
|
||||
|
||||
.B MPOL_MF_STRICT
|
||||
is ignored on huge page mappings right now.
|
||||
is ignored on huge page mappings.
|
||||
|
||||
It is unfortunate that the same flag,
|
||||
The
|
||||
.BR MPOL_DEFAULT ,
|
||||
has different effects for
|
||||
mode has different effects for
|
||||
.BR mbind (2)
|
||||
and
|
||||
.BR set_mempolicy (2).
|
||||
To select "allocation on the node of the CPU that
|
||||
triggered the allocation" (like
|
||||
.BR set_mempolicy (2)
|
||||
.BR MPOL_DEFAULT )
|
||||
when calling
|
||||
When
|
||||
.B MPOL_DEFAULT
|
||||
is specified for a range of memory using
|
||||
.BR mbind (),
|
||||
any pages subsequently allocated for that range will use
|
||||
the process' policy, as set by
|
||||
.BR set_mempolicy (2).
|
||||
This effectively removes the explicit policy from the
|
||||
specified range.
|
||||
To select "local allocation" for a memory range,
|
||||
specify a
|
||||
.I policy
|
||||
.I mode
|
||||
of
|
||||
.B MPOL_PREFERRED
|
||||
with an empty
|
||||
.IR nodemask .
|
||||
with an empty set of nodes.
|
||||
This method will work for
|
||||
.BR set_mempolicy (2),
|
||||
as well.
|
||||
.\" ---------------------------------------------------------------
|
||||
.SS "Versions and Library Support"
|
||||
The
|
||||
.BR mbind (),
|
||||
|
@ -228,16 +379,18 @@ system calls were added to the Linux kernel with version 2.6.7.
|
|||
They are only available on kernels compiled with
|
||||
.BR CONFIG_NUMA .
|
||||
|
||||
Support for huge page policy was added with 2.6.16.
|
||||
For interleave policy to be effective on huge page mappings the
|
||||
policied memory needs to be tens of megabytes or larger.
|
||||
You can link with
|
||||
.I -lnuma
|
||||
to get system call definitions.
|
||||
.I libnuma
|
||||
and the required
|
||||
.I numaif.h
|
||||
header.
|
||||
are available in the
|
||||
.I numactl
|
||||
package.
|
||||
|
||||
.B MPOL_MF_MOVE
|
||||
and
|
||||
.B MPOL_MF_MOVE_ALL
|
||||
are only available on Linux 2.6.16 and later.
|
||||
|
||||
These system calls should not be used directly.
|
||||
However, applications should not use these system calls directly.
|
||||
Instead, the higher level interface provided by the
|
||||
.BR numa (3)
|
||||
functions in the
|
||||
|
@ -247,20 +400,25 @@ The
|
|||
.I numactl
|
||||
package is available at
|
||||
.IR ftp://ftp.suse.com/pub/people/ak/numa/ .
|
||||
|
||||
You can link with
|
||||
.I \-lnuma
|
||||
to get system call definitions.
|
||||
.I libnuma
|
||||
is available in the
|
||||
.I numactl
|
||||
The package is also included in some Linux distributions.
|
||||
Some distributions include the development library and header
|
||||
in the separate
|
||||
.I numactl-devel
|
||||
package.
|
||||
This package also has the
|
||||
.I numaif.h
|
||||
header.
|
||||
|
||||
Support for huge page policy was added with 2.6.16.
|
||||
For interleave policy to be effective on huge page mappings the
|
||||
policied memory needs to be tens of megabytes or larger.
|
||||
|
||||
.B MPOL_MF_MOVE
|
||||
and
|
||||
.B MPOL_MF_MOVE_ALL
|
||||
are only available on Linux 2.6.16 and later.
|
||||
.SH SEE ALSO
|
||||
.BR numa (3),
|
||||
.BR numactl (8),
|
||||
.BR set_mempolicy (2),
|
||||
.BR get_mempolicy (2),
|
||||
.BR mmap (2)
|
||||
.BR mmap (2),
|
||||
.BR set_mempolicy (2),
|
||||
.BR shmat (2),
|
||||
.BR shmget (2),
|
||||
.BR numa (3),
|
||||
.BR numactl (8)
|
||||
|
|
Loading…
Reference in New Issue