proc.5: Document inaccurate RSS due to SPLIT_RSS_COUNTING

[mtk: Manually applied patch, because of conflicts with other
merged changes; also added an edit suggested by Jann; see the
thread at
https://lore.kernel.org/linux-man/20201012114940.1317510-1-jannh@google.com/]

Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
v2.6.34), Linux uses per-thread RSS counters to reduce cache
contention on the per-mm counters. With a 4K page size, that means
that you can end up with the counters off by up to 252KiB per
thread.

Example:

$ cat rsstest.c
#include <stdlib.h>
#include <err.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/eventfd.h>
#include <sys/prctl.h>
void dump(int pid) {
  char cmd[1000];
  sprintf(cmd,
    "grep '^VmRSS' /proc/%d/status;"
    "grep '^Rss:' /proc/%d/smaps_rollup;"
    "echo",
    pid, pid
  );
  system(cmd);
}
int main(void) {
  eventfd_t dummy;
  int child_wait = eventfd(0, EFD_SEMAPHORE|EFD_CLOEXEC);
  int child_resume = eventfd(0, EFD_SEMAPHORE|EFD_CLOEXEC);
  if (child_wait == -1 || child_resume == -1) err(1, "eventfd");
  pid_t child = fork();
  if (child == -1) err(1, "fork");
  if (child == 0) {
    if (prctl(PR_SET_PDEATHSIG, SIGKILL)) err(1, "PDEATHSIG");
    if (getppid() == 1) exit(0);
    char *mapping = mmap(NULL, 80 * 0x1000, PROT_READ|PROT_WRITE,
                         MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    eventfd_write(child_wait, 1);
    eventfd_read(child_resume, &dummy);
    for (int i=0; i<40; i++) mapping[0x1000 * i] = 1;
    eventfd_write(child_wait, 1);
    eventfd_read(child_resume, &dummy);
    for (int i=40; i<80; i++) mapping[0x1000 * i] = 1;
    eventfd_write(child_wait, 1);
    eventfd_read(child_resume, &dummy);
    exit(0);
  }

  eventfd_read(child_wait, &dummy);
  dump(child);
  eventfd_write(child_resume, 1);

  eventfd_read(child_wait, &dummy);
  dump(child);
  eventfd_write(child_resume, 1);

  eventfd_read(child_wait, &dummy);
  dump(child);
  eventfd_write(child_resume, 1);

  exit(0);
}
$ gcc -o rsstest rsstest.c && ./rsstest
VmRSS:	      68 kB
Rss:                 616 kB

VmRSS:	      68 kB
Rss:                 776 kB

VmRSS:	     812 kB
Rss:                 936 kB

$

Let's document that those counters aren't entirely accurate.

Reported-by: Mark Mossberg <mark.mossberg@gmail.com>
Signed-off-by: Jann Horn <jannh@google.com>

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
Jann Horn 2020-10-27 14:35:35 +01:00 committed by Michael Kerrisk
parent 14948ad6ec
commit 20e43cd694
1 changed files with 34 additions and 2 deletions

View File

@ -2265,6 +2265,9 @@ This is just the pages which
count toward text, data, or stack space.
This does not include pages
which have not been demand-loaded in, or which are swapped out.
This value is inaccurate; see
.I /proc/[pid]/statm
below.
.TP
(25) \fIrsslim\fP \ %lu
Current soft limit in bytes on the rss of the process;
@ -2409,10 +2412,11 @@ The columns are:
size (1) total program size
(same as VmSize in \fI/proc/[pid]/status\fP)
resident (2) resident set size
(same as VmRSS in \fI/proc/[pid]/status\fP)
(inaccurate; same as VmRSS in \fI/proc/[pid]/status\fP)
shared (3) number of resident shared pages
(i.e., backed by a file)
(same as RssFile+RssShmem in \fI/proc/[pid]/status\fP)
(inaccurate; same as RssFile+RssShmem in
\fI/proc/[pid]/status\fP)
text (4) text (code)
.\" (not including libs; broken, includes data segment)
lib (5) library (unused since Linux 2.6; always 0)
@ -2421,6 +2425,16 @@ data (6) data + stack
dt (7) dirty pages (unused since Linux 2.6; always 0)
.EE
.in
.IP
.\" See SPLIT_RSS_COUNTING in the kernel.
.\" Inaccuracy is bounded by TASK_RSS_EVENTS_THRESH.
Some of these values are inaccurate because
of a kernel-internal scalability optimization.
If accurate values are required, use
.I /proc/[pid]/smaps
or
.I /proc/[pid]/smaps_rollup
instead, which are much slower but provide accurate, detailed information.
.TP
.I /proc/[pid]/status
Provides much of the information in
@ -2597,6 +2611,9 @@ directly access physical memory.
.TP
.IR VmHWM
Peak resident set size ("high water mark").
This value is inaccurate; see
.I /proc/[pid]/statm
above.
.TP
.IR VmRSS
Resident set size.
@ -2605,16 +2622,25 @@ Note that the value here is the sum of
.IR RssFile ,
and
.IR RssShmem .
This value is inaccurate; see
.I /proc/[pid]/statm
above.
.TP
.IR RssAnon
Size of resident anonymous memory.
.\" commit bf9683d6990589390b5178dafe8fd06808869293
(since Linux 4.5).
This value is inaccurate; see
.I /proc/[pid]/statm
above.
.TP
.IR RssFile
Size of resident file mappings.
.\" commit bf9683d6990589390b5178dafe8fd06808869293
(since Linux 4.5).
This value is inaccurate; see
.I /proc/[pid]/statm
above.
.TP
.IR RssShmem
Size of resident shared memory (includes System V shared memory,
@ -2626,6 +2652,9 @@ and shared anonymous mappings).
.TP
.IR VmData ", " VmStk ", " VmExe
Size of data, stack, and text segments.
This value is inaccurate; see
.I /proc/[pid]/statm
above.
.TP
.IR VmLib
Shared library code size.
@ -2641,6 +2670,9 @@ Size of second-level page tables (added in Linux 4.0; removed in Linux 4.15).
.\" commit b084d4353ff99d824d3bc5a5c2c22c70b1fba722
Swapped-out virtual memory size by anonymous private pages;
shmem swap usage is not included (since Linux 2.6.34).
This value is inaccurate; see
.I /proc/[pid]/statm
above.
.TP
.IR HugetlbPages
Size of hugetlb memory portions