mirror of https://github.com/mkerrisk/man-pages
vdso.7: New page documenting the vDSO mapped into each process by the kernel
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
parent
10850212d0
commit
2800db8247
|
@ -0,0 +1,457 @@
|
|||
.\" Written by Mike Frysinger <vapier@gentoo.org>
|
||||
.\"
|
||||
.\" %%%LICENSE_START(PUBLIC_DOMAIN)
|
||||
.\" This page is in the public domain.
|
||||
.\" %%%LICENSE_END
|
||||
.\"
|
||||
.TH VDSO 7 2013-04-09 "Linux" "Linux Programmer's Manual"
|
||||
.SH NAME
|
||||
vDSO \- overview of the virtual ELF dynamic shared object
|
||||
.SH SYNOPSIS
|
||||
.B #include <sys/auxv.h>
|
||||
|
||||
.B void *vdso = (uintptr_t) getauxval(AT_SYSINFO_EHDR);
|
||||
.SH DESCRIPTION
|
||||
The "vDSO" is a small shared library that the kernel automatically maps into the
|
||||
address space of all user-space applications.
|
||||
Applications themselves usually need not concern themselves with these details
|
||||
as the vDSO is most commonly called by the C library.
|
||||
This way you can write using standard functions and the C library will take care
|
||||
of using any available functionality.
|
||||
|
||||
Why does the vDSO exist at all?
|
||||
There are some facilities the kernel provides that user space ends up using
|
||||
frequently to the point that such calls can dominate overall performance.
|
||||
This is due both to the frequency of the call as well as the context overhead
|
||||
from exiting user space and entering the kernel.
|
||||
|
||||
The rest of this documentation is geared towards the curious and/or C library
|
||||
writers rather than general developers.
|
||||
If you're trying to call the vDSO in your own application rather than using
|
||||
the C library, you're most likely doing it wrong.
|
||||
.SS Example background
|
||||
Making system calls can be slow.
|
||||
In x86 32-bit systems, you can trigger a software interrupt (int $0x80) to tell
|
||||
the kernel you wish to make a system call.
|
||||
However, this instruction is expensive: it goes through the full interrupt
|
||||
handling paths in the processor's microcode as well as in the kernel.
|
||||
Newer processors have faster (but backwards incompatible) instructions to
|
||||
initiate system calls.
|
||||
Rather than require the C library to figure out if this functionality is
|
||||
available at runtime itself, it can use functions provided by the kernel in
|
||||
the vDSO.
|
||||
|
||||
Note that the terminology can be confusing.
|
||||
On x86 systems, the vDSO function is named "__kernel_vsyscall", but on x86_64,
|
||||
the term "vsyscall" also refers to an obsolete way to ask the kernel what time
|
||||
it is or what CPU the caller is on.
|
||||
|
||||
One system call frequently called is gettimeofday().
|
||||
This is called both directly by user-space applications as well as indirectly by
|
||||
the C library.
|
||||
Think timestamps or timing loops or polling -- all of these frequently need to
|
||||
know what time it is right now.
|
||||
This information is also not secret -- any application in any privilege mode
|
||||
(root or any user) will get the same answer.
|
||||
Thus the kernel arranges for the information required to answer this question
|
||||
to be placed in memory the process can access.
|
||||
Now a call to gettimeofday() changes from a system call to a normal function
|
||||
call and a few memory accesses.
|
||||
.SS Finding the vDSO
|
||||
The base address of the vDSO (if one exists) is passed by the kernel to each
|
||||
program in the initial auxiliary vector.
|
||||
Specifically, via the
|
||||
.B AT_SYSINFO_EHDR
|
||||
tag.
|
||||
|
||||
You must not assume the vDSO is mapped at any particular location in the
|
||||
user's memory map.
|
||||
The base address will usually be randomized at runtime every time a new
|
||||
process image is created (at
|
||||
.BR execve (2)
|
||||
time).
|
||||
This is done for security reasons to prevent standard "return-to-libc" attacks.
|
||||
|
||||
For some architectures, there is also a
|
||||
.B AT_SYSINFO
|
||||
tag.
|
||||
This is used only for locating the vsyscall entry point and is frequently
|
||||
omitted or set to 0 (meaning it's not available).
|
||||
It is a throwback to the initial vDSO work (see
|
||||
.IR HISTORY
|
||||
below) and should be avoided.
|
||||
|
||||
Refer to
|
||||
.BR getauxval (3)
|
||||
for more details on accessing these fields.
|
||||
.SS File format
|
||||
Since the vDSO is a fully formed ELF image, you can do symbol lookups on it.
|
||||
This allows new symbols to be added with newer kernel releases, and for the
|
||||
C library to detect available functionality at runtime when running under
|
||||
different kernel versions.
|
||||
Often times the C library will do detection with the first call and then
|
||||
cache the result for subsequent calls.
|
||||
|
||||
All symbols are also versioned (using the GNU version format).
|
||||
This allows the kernel to update the function signature without breaking
|
||||
backwards compatibility.
|
||||
This means changing the arguments that the function accepts as well as the
|
||||
return value.
|
||||
Thus, when looking up a symbol in the vDSO, you must always include the version
|
||||
to match the ABI you expect.
|
||||
|
||||
Typically the vDSO follows the naming convention of prefixing all symbols with
|
||||
"__vdso_" or "__kernel_" so as to distinguish them from other standard symbols.
|
||||
e.g. The "gettimeofday" function is named "__vdso_gettimeofday".
|
||||
|
||||
You use the standard C calling conventions when calling any of these functions.
|
||||
No need to worry about weird register or stack behavior.
|
||||
.SH NOTES
|
||||
.SS Source
|
||||
When you compile the kernel, it will automatically compile and link the vDSO
|
||||
code for you.
|
||||
You will frequently find it under the architecture-specific dir:
|
||||
|
||||
find arch/$ARCH/ -name '*vdso*.so*' -o -name '*gate*.so*'
|
||||
|
||||
Note that the vDSO that is used is based on the ABI of your user-space code
|
||||
and not the ABI of the kernel.
|
||||
i.e. If you run an i386 32-bit ELF under an i386 32-bit kernel or under an
|
||||
x86_64 64-bit kernel, you'll get the same vDSO.
|
||||
So when referring to sections below, use the user-space ABI.
|
||||
.SS vDSO names
|
||||
The name of this shared object varies across architectures.
|
||||
It will often show up in things like glibc's `ldd` output.
|
||||
The exact name should not matter to any code, so do not hardcode it.
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l l.
|
||||
user ABI vDSO name
|
||||
_
|
||||
aarch64 linux-vdso.so.1
|
||||
ia64 linux-gate.so.1
|
||||
ppc/32 linux-vdso32.so.1
|
||||
ppc/64 linux-vdso64.so.1
|
||||
s390 linux-vdso32.so.1
|
||||
s390x linux-vdso64.so.1
|
||||
sh linux-gate.so.1
|
||||
i386 linux-gate.so.1
|
||||
x86_64 linux-vdso.so.1
|
||||
x86/x32 linux-vdso.so.1
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
.SS arm functions
|
||||
.\" See linux/arch/arm/kernel/entry-armv.S
|
||||
.\" See linux/Documentation/arm/kernel_user_helpers.txt
|
||||
The arm port has a code page full of utility functions.
|
||||
Since it's just a raw page of code, there is no ELF information for doing
|
||||
symbol lookups or versioning.
|
||||
It does provide support for different versions though.
|
||||
|
||||
For documentation on this code page, it's better you refer to the kernel doc
|
||||
as it's extremely detailed and covers everything you need to know:
|
||||
.br
|
||||
Documentation/arm/kernel_user_helpers.txt
|
||||
.SS aarch64 functions
|
||||
.\" See linux/arch/arm64/kernel/vdso/vdso.lds.S
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l l.
|
||||
symbol version
|
||||
_
|
||||
__kernel_rt_sigreturn LINUX_2.6.39
|
||||
__kernel_gettimeofday LINUX_2.6.39
|
||||
__kernel_clock_gettime LINUX_2.6.39
|
||||
__kernel_clock_getres LINUX_2.6.39
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
.SS bfin (Blackfin) functions
|
||||
.\" See linux/arch/blackfin/kernel/fixed_code.S
|
||||
.\" See http://docs.blackfin.uclinux.org/doku.php?id=linux-kernel:fixed-code
|
||||
As this CPU lacks a memory management unit (MMU), it doesn't set up a vDSO in
|
||||
the normal sense.
|
||||
Instead, it maps at boot time a few raw functions into a fixed location in
|
||||
memory.
|
||||
User-space applications then call directly into that region.
|
||||
There is no provision for backwards compatibility beyond sniffing raw opcodes,
|
||||
but as this is an embedded CPU, it can get away with things -- some of the
|
||||
object formats it runs aren't even ELF based (they're bFLT/FLAT).
|
||||
|
||||
For documentation on this code page, it's better you refer to the public docs:
|
||||
.br
|
||||
http://docs.blackfin.uclinux.org/doku.php?id=linux-kernel:fixed-code
|
||||
.SS ia64 (Itanium) functions
|
||||
.\" See linux/arch/ia64/kernel/gate.lds.S
|
||||
.\" Also linux/arch/ia64/kernel/fsys.S and linux/Documentation/ia64/fsys.txt
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l l.
|
||||
symbol version
|
||||
_
|
||||
__kernel_sigtramp LINUX_2.5
|
||||
__kernel_syscall_via_break LINUX_2.5
|
||||
__kernel_syscall_via_epc LINUX_2.5
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
|
||||
The Itanium port actually likes to get tricky.
|
||||
In addition to the vDSO above, it also has "light-weight system calls" (also
|
||||
known as "fast syscalls" or "fsys").
|
||||
You can invoke these via the __kernel_syscall_via_epc vDSO helper.
|
||||
The system calls listed here have the same semantics as if you called them
|
||||
directly via
|
||||
.BR syscall (3),
|
||||
so refer to the relevant
|
||||
documentation for each.
|
||||
The table below lists the functions available via this mechanism.
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l.
|
||||
function
|
||||
_
|
||||
clock_gettime
|
||||
getcpu
|
||||
getpid
|
||||
getppid
|
||||
gettimeofday
|
||||
set_tid_address
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
.SS parisc (hppa) functions
|
||||
.\" See linux/arch/parisc/kernel/syscall.S
|
||||
.\" See linux/Documentation/parisc/registers
|
||||
The parisc port has a code page full of utility functions called a gateway page.
|
||||
Rather than use the normal ELF aux vector approach, it passes the address of
|
||||
the page to the process via the SR2 register.
|
||||
The permissions on the page are such that merely executing those addresses
|
||||
automatically executes with kernel privileges and not in user-space.
|
||||
This is done to match the way HP-UX works.
|
||||
|
||||
Since it's just a raw page of code, there is no ELF information for doing
|
||||
symbol lookups or versioning.
|
||||
Simply call into the appropriate offset via the branch instruction, e.g.:
|
||||
.br
|
||||
ble <offset>(%sr2, %r0)
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l l.
|
||||
offset function
|
||||
_
|
||||
00b0 lws_entry
|
||||
00e0 set_thread_pointer
|
||||
0100 linux_gateway_entry (syscall)
|
||||
0268 syscall_nosys
|
||||
0274 tracesys
|
||||
0324 tracesys_next
|
||||
0368 tracesys_exit
|
||||
03a0 tracesys_sigexit
|
||||
03b8 lws_start
|
||||
03dc lws_exit_nosys
|
||||
03e0 lws_exit
|
||||
03e4 lws_compare_and_swap64
|
||||
03e8 lws_compare_and_swap
|
||||
0404 cas_wouldblock
|
||||
0410 cas_action
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
.SS ppc/32 functions
|
||||
.\" See linux/arch/powerpc/kernel/vdso32/vdso32.lds.S
|
||||
The functions marked with a
|
||||
.I *
|
||||
below are only available when the kernel is
|
||||
a powerpc64 (64-bit) kernel.
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l l.
|
||||
symbol version
|
||||
_
|
||||
__kernel_clock_getres LINUX_2.6.15
|
||||
__kernel_clock_gettime LINUX_2.6.15
|
||||
__kernel_datapage_offset LINUX_2.6.15
|
||||
__kernel_get_syscall_map LINUX_2.6.15
|
||||
__kernel_get_tbfreq LINUX_2.6.15
|
||||
__kernel_getcpu \fI*\fR LINUX_2.6.15
|
||||
__kernel_gettimeofday LINUX_2.6.15
|
||||
__kernel_sigtramp_rt32 LINUX_2.6.15
|
||||
__kernel_sigtramp32 LINUX_2.6.15
|
||||
__kernel_sync_dicache LINUX_2.6.15
|
||||
__kernel_sync_dicache_p5 LINUX_2.6.15
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
.SS ppc/64 functions
|
||||
.\" See linux/arch/powerpc/kernel/vdso64/vdso64.lds.S
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l l.
|
||||
symbol version
|
||||
_
|
||||
__kernel_clock_getres LINUX_2.6.15
|
||||
__kernel_clock_gettime LINUX_2.6.15
|
||||
__kernel_datapage_offset LINUX_2.6.15
|
||||
__kernel_get_syscall_map LINUX_2.6.15
|
||||
__kernel_get_tbfreq LINUX_2.6.15
|
||||
__kernel_getcpu LINUX_2.6.15
|
||||
__kernel_gettimeofday LINUX_2.6.15
|
||||
__kernel_sigtramp_rt64 LINUX_2.6.15
|
||||
__kernel_sync_dicache LINUX_2.6.15
|
||||
__kernel_sync_dicache_p5 LINUX_2.6.15
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
.SS s390 functions
|
||||
.\" See linux/arch/s390/kernel/vdso32/vdso32.lds.S
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l l.
|
||||
symbol version
|
||||
_
|
||||
__kernel_clock_getres LINUX_2.6.29
|
||||
__kernel_clock_gettime LINUX_2.6.29
|
||||
__kernel_gettimeofday LINUX_2.6.29
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
.SS s390x functions
|
||||
.\" See linux/arch/s390/kernel/vdso64/vdso64.lds.S
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l l.
|
||||
symbol version
|
||||
_
|
||||
__kernel_clock_getres LINUX_2.6.29
|
||||
__kernel_clock_gettime LINUX_2.6.29
|
||||
__kernel_gettimeofday LINUX_2.6.29
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
.SS sh (SuperH) functions
|
||||
.\" See linux/arch/sh/kernel/vsyscall/vsyscall.lds.S
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l l.
|
||||
symbol version
|
||||
_
|
||||
__kernel_rt_sigreturn LINUX_2.6
|
||||
__kernel_sigreturn LINUX_2.6
|
||||
__kernel_vsyscall LINUX_2.6
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
.SS i386 functions
|
||||
.\" See linux/arch/x86/vdso/vdso32/vdso32.lds.S
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l l.
|
||||
symbol version
|
||||
_
|
||||
__kernel_sigreturn LINUX_2.5
|
||||
__kernel_rt_sigreturn LINUX_2.5
|
||||
__kernel_vsyscall LINUX_2.5
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
.SS x86_64 functions
|
||||
.\" See linux/arch/x86/vdso/vdso.lds.S
|
||||
All of these symbols are also available without the "__vdso_" prefix, but
|
||||
you should ignore those and stick to the names below.
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l l.
|
||||
symbol version
|
||||
_
|
||||
__vdso_clock_gettime LINUX_2.6
|
||||
__vdso_getcpu LINUX_2.6
|
||||
__vdso_gettimeofday LINUX_2.6
|
||||
__vdso_time LINUX_2.6
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
.SS x86/x32 functions
|
||||
.\" See linux/arch/x86/vdso/vdso32.lds.S
|
||||
.if t \{\
|
||||
.ft CW
|
||||
\}
|
||||
.TS
|
||||
l l.
|
||||
symbol version
|
||||
_
|
||||
__vdso_clock_gettime LINUX_2.6
|
||||
__vdso_getcpu LINUX_2.6
|
||||
__vdso_gettimeofday LINUX_2.6
|
||||
__vdso_time LINUX_2.6
|
||||
.TE
|
||||
.if t \{\
|
||||
.in
|
||||
.ft P
|
||||
\}
|
||||
.SS History
|
||||
The vDSO was originally just a single function -- the vsyscall.
|
||||
In older kernels, you might see that in a process's memory map rather than vdso.
|
||||
Over time, people realized that this was a great way to pass more functionality
|
||||
to user space, so it was reconceived as a vDSO in the current format.
|
||||
.SH SEE ALSO
|
||||
.BR syscalls (2),
|
||||
.BR getauxval (3),
|
||||
.BR proc (5)
|
||||
|
||||
The docs/examples/sources in the Linux sources:
|
||||
.nf
|
||||
Documentation/ABI/stable/vdso
|
||||
linux/Documentation/ia64/fsys.txt
|
||||
Documentation/vDSO/* (includes examples of using the vDSO)
|
||||
find arch/ -iname '*vdso*' -o -iname '*gate*'
|
||||
.fi
|
Loading…
Reference in New Issue