diff --git a/man2/mbind.2 b/man2/mbind.2 index 841f33642..0d8f9f482 100644 --- a/man2/mbind.2 +++ b/man2/mbind.2 @@ -18,15 +18,17 @@ .\" the source, must acknowledge the copyright and authors of this work. .\" .\" 2006-02-03, mtk, substantial wording changes and other improvements +.\" 2007-08-27, Lee Schermerhorn +.\" more precise specification of behavior. .\" -.TH MBIND 2 2006-02-07 "Linux" "Linux Programmer's Manual" +.TH MBIND 2 "2007-06-01" "SuSE Labs" "Linux Programmer's Manual" .SH NAME mbind \- Set memory policy for a memory range .SH SYNOPSIS .nf .B "#include " .sp -.BI "int mbind(void *" start ", unsigned long " len ", int " policy , +.BI "int mbind(void *" start ", unsigned long " len ", int " mode , .BI " unsigned long *" nodemask ", unsigned long " maxnode , .BI " unsigned " flags ); .sp @@ -34,76 +36,178 @@ mbind \- Set memory policy for a memory range .fi .SH DESCRIPTION .BR mbind () -sets the NUMA memory -.I policy +sets the NUMA memory policy, +which consists of a policy mode and zero or more nodes, for the memory range starting with .I start and continuing for .IR len bytes. The memory of a NUMA machine is divided into multiple nodes. -The memory policy defines in which node memory is allocated. +The memory policy defines from which node memory is allocated. + +If the memory range specified by the +.IR start " and " len +arguments includes an "anonymous" region of memory\(emthat is +a region of memory created using the +.BR mmap (2) +system call with the +.BR MAP_ANONYMOUS \(emor +a memory mapped file, mapped using the +.BR mmap (2) +system call with the +.B MAP_PRIVATE +flag, pages will only be allocated according to the specified +policy when the application writes [stores] to the page. +For anonymous regions, an initial read access will use a shared +page in the kernel containing all zeros. +For a file mapped with +.BR MAP_PRIVATE , +an initial read access will allocate pages according to the +process policy of the process that causes the page to be allocated. +This may not be the process that called +.BR mbind (). + +The specified policy will be ignored for any +.B MAP_SHARED +mappings in the specified memory range. +Rather the pages will be allocated according to the process policy +of the process that caused the page to be allocated. +Again, this may not be the process that called +.BR mbind (). + +If the specified memory range includes a shared memory region +created using the +.BR shmget (2) +system call and attached using the +.BR shmat (2) +system call, +pages allocated for the anonymous or shared memory region will +be allocated according to the policy specified, regardless which +process attached to the shared memory segment causes the allocation. +If, however, the shared memory region was created with the +.B SHM_HUGETLB +flag, +the huge pages will be allocated according to the policy specified +only if the page allocation is caused by the task that calls +.BR mbind () +for that region. + +By default, .BR mbind () only has an effect for new allocations; if the pages inside the range have been already touched before setting the policy, then the policy has no effect. +This default behavior may be overridden by the +.BR MPOL_MF_MOVE +and +.B MPOL_MF_MOVE_ALL +flags described below. -Available policies are +The +.I mode +argument must specify one of .BR MPOL_DEFAULT , .BR MPOL_BIND , -.BR MPOL_INTERLEAVE , -and +.B MPOL_INTERLEAVE +or .BR MPOL_PREFERRED . -All policies except +All policy modes except .B MPOL_DEFAULT -require the caller to specify the nodes to which the policy applies in the +require the caller to specify via the .I nodemask -parameter. +parameter, +the node or nodes to which the mode applies. + .I nodemask -is a bit mask of nodes containing up to +points to a bitmask of nodes containing up to .I maxnode bits. -The actual number of bytes transferred via this argument -is rounded up to the next multiple of +The bit mask size is rounded to the next multiple of .IR "sizeof(unsigned long)" , but the kernel will only use bits up to .IR maxnode . -A NULL argument means an empty set of nodes. +A NULL value of +.I nodemask +or a +.I maxnode +value of zero specifies the empty set of nodes. +If the value of +.I maxnode +is zero, +the +.I nodemask +argument is ignored. The .B MPOL_DEFAULT -policy is the default and means to use the underlying process policy -(which can be modified with -.BR set_mempolicy (2)). -Unless the process policy has been changed this means to allocate -memory on the node of the CPU that triggered the allocation. +mode specifies that the default policy be used. +When applied to a range of memory via +.IR mbind (), +this means to use the process policy, + which may have been set with +.BR set_mempolicy (2). +If the mode of the process policy is also +.BR MPOL_DEFAULT , +the system-wide default policy will be used. +The system-wide default policy will allocate +pages on the node of the CPU that triggers the allocation. +For +.BR MPOL_DEFAULT , +the .I nodemask -should be specified as NULL. +and +.I maxnode +arguments must be specify the empty set of nodes. The .B MPOL_BIND -policy is a strict policy that restricts memory allocation to the -nodes specified in +mode specifies a strict policy that restricts memory allocation to +the nodes specified in +.IR nodemask . +If +.I nodemask +specifies more than one node, page allocations will come from +the node with the lowest numeric node id first, until that node +contains no free memory. +Allocations will then come from the node with the next highest +node id specified in +.I nodemask +and so forth, until none of the specified nodes contain free memory. +Pages will not be allocated from any node not specified in the .IR nodemask . -There won't be allocations on other nodes. +The .B MPOL_INTERLEAVE -interleaves allocations to the nodes specified in +mode specifies that page allocations be interleaved across the +set of nodes specified in .IR nodemask . -This optimizes for bandwidth instead of latency. +This optimizes for bandwidth instead of latency +by spreading out pages and memory accesses to those pages across +multiple nodes. To be effective the memory area should be fairly large, -at least 1MB or bigger. +at least 1MB or bigger with a fairly uniform access pattern. +Accesses to a single page of the area will still be limited to +the memory bandwidth of a single node. .B MPOL_PREFERRED sets the preferred node for allocation. -The kernel will try to allocate in this +The kernel will try to allocate pages from this node first and fall back to other nodes if the preferred nodes is low on free memory. -Only the first node in the +If .I nodemask -is used. -If no node is set in the mask, then the memory is allocated on -the node of the CPU that triggered the allocation allocation). +specifies more than one node id, the first node in the +mask will be selected as the preferred node. +If the +.I nodemask +and +.I maxnode +arguments specify the empty set, then the memory is allocated on +the node of the CPU that triggered the allocation. +This is the only way to specify "local allocation" for a +range of memory via +.IR mbind (2). If .B MPOL_MF_STRICT @@ -115,17 +219,18 @@ is not .BR MPOL_DEFAULT , then the call will fail with the error .B EIO -if the existing pages in the mapping don't follow the policy. -In 2.6.16 or later the kernel will also try to move pages -to the requested node with this flag. +if the existing pages in the memory range don't follow the policy. +.\" According to the kernel code, the following is not true --lts +.\" In 2.6.16 or later the kernel will also try to move pages +.\" to the requested node with this flag. If .B MPOL_MF_MOVE -is passed in +is specified in .IR flags , -then an attempt will be made to -move all the pages in the mapping so that they follow the policy. -Pages that are shared with other processes are not moved. +then the kernel will attempt to move all the existing pages +in the memory range so that they follow the policy. +Pages that are shared with other processes will not be moved. If .B MPOL_MF_STRICT is also specified, then the call will fail with the error @@ -136,8 +241,8 @@ If .B MPOL_MF_MOVE_ALL is passed in .IR flags , -then all pages in the mapping will be moved regardless of whether -other processes use the pages. +then the kernel will attempt to move all existing pages in the memory range +regardless of whether other processes use the pages. The calling process must be privileged .RB ( CAP_SYS_NICE ) to use this flag. @@ -146,6 +251,7 @@ If is also specified, then the call will fail with the error .B EIO if some pages could not be moved. +.\" --------------------------------------------------------------- .SH RETURN VALUE On success, .BR mbind () @@ -153,11 +259,9 @@ returns 0; on error, \-1 is returned and .I errno is set to indicate the error. +.\" --------------------------------------------------------------- .SH ERRORS -.TP -.B EFAULT -There was a unmapped hole in the specified memory range -or a passed pointer was not valid. +.\" I think I got all of the error returns. --lts .TP .B EINVAL An invalid value was specified for @@ -169,55 +273,102 @@ or was less than .IR start ; or -.I policy -was -.B MPOL_DEFAULT +.I start +is not a multiple of the system page size. +Or, +.I mode +is +.I MPOL_DEFAULT and .I nodemask -pointed to a non-empty set; +specified a non-empty set; or -.I policy -was -.B MPOL_BIND +.I mode +is +.I MPOL_BIND or -.B MPOL_INTERLEAVE +.I MPOL_INTERLEAVE and .I nodemask -pointed to an empty set, +is empty. +Or, +.I maxnode +specifies more than a page worth of bits. +Or, +.I nodemask +specifies one or more node ids that are +greater than the maximum supported node id, +or are not allowed in the calling task's context. +.\" "calling task's context" refers to cpusets. No man page avail to ref. --lts +Or, none of the node ids specified by +.I nodemask +are on-line, or none of the specified nodes contain memory. +.TP +.B EFAULT +Part of all of the memory range specified by +.I nodemask +and +.I maxnode +points outside your accessible address space. +Or, there was a unmapped hole in the specified memory range. .TP .B EIO .B MPOL_MF_STRICT was specified and an existing page was already on a node -that does not follow the policy. +that does not follow the policy; +or +.B MPOL_MF_MOVE +or +.B MPOL_MF_MOVE_ALL +was specified and the kernel was unable to move all existing +pages in the range. .TP .B ENOMEM -System out of memory. -.SH CONFORMING TO -This system call is Linux specific. +Insufficient kernel memory was available. +.TP +.B EPERM +The +.I flags +argument included the +.B MPOL_MF_MOVE_ALL +flag and the caller does not have the +.B CAP_SYS_NICE +privilege. +.\" --------------------------------------------------------------- .SH NOTES -NUMA policy is not supported on file mappings. +NUMA policy is not supported on a memory mapped file range +that was mapped with the +.I MAP_SHARED +flag. .B MPOL_MF_STRICT -is ignored on huge page mappings right now. +is ignored on huge page mappings. -It is unfortunate that the same flag, +The .BR MPOL_DEFAULT , -has different effects for +mode has different effects for .BR mbind (2) and .BR set_mempolicy (2). -To select "allocation on the node of the CPU that -triggered the allocation" (like -.BR set_mempolicy (2) -.BR MPOL_DEFAULT ) -when calling +When +.B MPOL_DEFAULT +is specified for a range of memory using .BR mbind (), +any pages subsequently allocated for that range will use +the process' policy, as set by +.BR set_mempolicy (2). +This effectively removes the explicit policy from the +specified range. +To select "local allocation" for a memory range, specify a -.I policy +.I mode of .B MPOL_PREFERRED -with an empty -.IR nodemask . +with an empty set of nodes. +This method will work for +.BR set_mempolicy (2), +as well. +.\" --------------------------------------------------------------- .SS "Versions and Library Support" The .BR mbind (), @@ -228,16 +379,18 @@ system calls were added to the Linux kernel with version 2.6.7. They are only available on kernels compiled with .BR CONFIG_NUMA . -Support for huge page policy was added with 2.6.16. -For interleave policy to be effective on huge page mappings the -policied memory needs to be tens of megabytes or larger. +You can link with +.I -lnuma +to get system call definitions. +.I libnuma +and the required +.I numaif.h +header. +are available in the +.I numactl +package. -.B MPOL_MF_MOVE -and -.B MPOL_MF_MOVE_ALL -are only available on Linux 2.6.16 and later. - -These system calls should not be used directly. +However, applications should not use these system calls directly. Instead, the higher level interface provided by the .BR numa (3) functions in the @@ -247,20 +400,25 @@ The .I numactl package is available at .IR ftp://ftp.suse.com/pub/people/ak/numa/ . - -You can link with -.I \-lnuma -to get system call definitions. -.I libnuma -is available in the -.I numactl +The package is also included in some Linux distributions. +Some distributions include the development library and header +in the separate +.I numactl-devel package. -This package also has the -.I numaif.h -header. + +Support for huge page policy was added with 2.6.16. +For interleave policy to be effective on huge page mappings the +policied memory needs to be tens of megabytes or larger. + +.B MPOL_MF_MOVE +and +.B MPOL_MF_MOVE_ALL +are only available on Linux 2.6.16 and later. .SH SEE ALSO -.BR numa (3), -.BR numactl (8), -.BR set_mempolicy (2), .BR get_mempolicy (2), -.BR mmap (2) +.BR mmap (2), +.BR set_mempolicy (2), +.BR shmat (2), +.BR shmget (2), +.BR numa (3), +.BR numactl (8)