mirror of https://github.com/mkerrisk/man-pages
551 lines
15 KiB
Groff
551 lines
15 KiB
Groff
.\" Copyright (C) 2003 Davide Libenzi
|
|
.\"
|
|
.\" %%%LICENSE_START(GPLv2+_SW_3_PARA)
|
|
.\" This program is free software; you can redistribute it and/or modify
|
|
.\" it under the terms of the GNU General Public License as published by
|
|
.\" the Free Software Foundation; either version 2 of the License, or
|
|
.\" (at your option) any later version.
|
|
.\"
|
|
.\" This program is distributed in the hope that it will be useful,
|
|
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
.\" GNU General Public License for more details.
|
|
.\"
|
|
.\" You should have received a copy of the GNU General Public
|
|
.\" License along with this manual; if not, see
|
|
.\" <http://www.gnu.org/licenses/>.
|
|
.\" %%%LICENSE_END
|
|
.\"
|
|
.\" Davide Libenzi <davidel@xmailserver.org>
|
|
.\"
|
|
.TH EPOLL 7 2012-04-17 "Linux" "Linux Programmer's Manual"
|
|
.SH NAME
|
|
epoll \- I/O event notification facility
|
|
.SH SYNOPSIS
|
|
.B #include <sys/epoll.h>
|
|
.SH DESCRIPTION
|
|
The
|
|
.B epoll
|
|
API performs a similar task to
|
|
.BR poll (2):
|
|
monitoring multiple file descriptors to see if I/O is possible on any of them.
|
|
The
|
|
.B epoll
|
|
API can be used either as an edge-triggered or a level-triggered
|
|
interface and scales well to large numbers of watched file descriptors.
|
|
The following system calls are provided to
|
|
create and manage an
|
|
.B epoll
|
|
instance:
|
|
.IP * 3
|
|
.BR epoll_create (2)
|
|
creates an
|
|
.B epoll
|
|
instance and returns a file descriptor referring to that instance.
|
|
(The more recent
|
|
.BR epoll_create1 (2)
|
|
extends the functionality of
|
|
.BR epoll_create (2).)
|
|
.IP *
|
|
Interest in particular file descriptors is then registered via
|
|
.BR epoll_ctl (2).
|
|
The set of file descriptors currently registered on an
|
|
.B epoll
|
|
instance is sometimes called an
|
|
.I epoll
|
|
set.
|
|
.IP *
|
|
.BR epoll_wait (2)
|
|
waits for I/O events,
|
|
blocking the calling thread if no events are currently available.
|
|
.SS Level-triggered and edge-triggered
|
|
The
|
|
.B epoll
|
|
event distribution interface is able to behave both as edge-triggered
|
|
(ET) and as level-triggered (LT).
|
|
The difference between the two mechanisms
|
|
can be described as follows.
|
|
Suppose that
|
|
this scenario happens:
|
|
.IP 1. 3
|
|
The file descriptor that represents the read side of a pipe
|
|
.RI ( rfd )
|
|
is registered on the
|
|
.B epoll
|
|
instance.
|
|
.IP 2.
|
|
A pipe writer writes 2 kB of data on the write side of the pipe.
|
|
.IP 3.
|
|
A call to
|
|
.BR epoll_wait (2)
|
|
is done that will return
|
|
.I rfd
|
|
as a ready file descriptor.
|
|
.IP 4.
|
|
The pipe reader reads 1 kB of data from
|
|
.IR rfd .
|
|
.IP 5.
|
|
A call to
|
|
.BR epoll_wait (2)
|
|
is done.
|
|
.PP
|
|
If the
|
|
.I rfd
|
|
file descriptor has been added to the
|
|
.B epoll
|
|
interface using the
|
|
.B EPOLLET
|
|
(edge-triggered)
|
|
flag, the call to
|
|
.BR epoll_wait (2)
|
|
done in step
|
|
.B 5
|
|
will probably hang despite the available data still present in the file
|
|
input buffer;
|
|
meanwhile the remote peer might be expecting a response based on the
|
|
data it already sent.
|
|
The reason for this is that edge-triggered mode
|
|
delivers events only when changes occur on the monitored file descriptor.
|
|
So, in step
|
|
.B 5
|
|
the caller might end up waiting for some data that is already present inside
|
|
the input buffer.
|
|
In the above example, an event on
|
|
.I rfd
|
|
will be generated because of the write done in
|
|
.B 2
|
|
and the event is consumed in
|
|
.BR 3 .
|
|
Since the read operation done in
|
|
.B 4
|
|
does not consume the whole buffer data, the call to
|
|
.BR epoll_wait (2)
|
|
done in step
|
|
.B 5
|
|
might block indefinitely.
|
|
|
|
An application that employs the
|
|
.B EPOLLET
|
|
flag should use nonblocking file descriptors to avoid having a blocking
|
|
read or write starve a task that is handling multiple file descriptors.
|
|
The suggested way to use
|
|
.B epoll
|
|
as an edge-triggered
|
|
.RB ( EPOLLET )
|
|
interface is as follows:
|
|
.RS
|
|
.TP 4
|
|
.B i
|
|
with nonblocking file descriptors; and
|
|
.TP
|
|
.B ii
|
|
by waiting for an event only after
|
|
.BR read (2)
|
|
or
|
|
.BR write (2)
|
|
return
|
|
.BR EAGAIN .
|
|
.RE
|
|
.PP
|
|
By contrast, when used as a level-triggered interface
|
|
(the default, when
|
|
.B EPOLLET
|
|
is not specified),
|
|
.B epoll
|
|
is simply a faster
|
|
.BR poll (2),
|
|
and can be used wherever the latter is used since it shares the
|
|
same semantics.
|
|
|
|
Since even with edge-triggered
|
|
.BR epoll ,
|
|
multiple events can be generated upon receipt of multiple chunks of data,
|
|
the caller has the option to specify the
|
|
.B EPOLLONESHOT
|
|
flag, to tell
|
|
.B epoll
|
|
to disable the associated file descriptor after the receipt of an event with
|
|
.BR epoll_wait (2).
|
|
When the
|
|
.B EPOLLONESHOT
|
|
flag is specified,
|
|
it is the caller's responsibility to rearm the file descriptor using
|
|
.BR epoll_ctl (2)
|
|
with
|
|
.BR EPOLL_CTL_MOD .
|
|
.SS /proc interfaces
|
|
The following interfaces can be used to limit the amount of
|
|
kernel memory consumed by epoll:
|
|
.\" Following was added in 2.6.28, but them removed in 2.6.29
|
|
.\" .TP
|
|
.\" .IR /proc/sys/fs/epoll/max_user_instances " (since Linux 2.6.28)"
|
|
.\" This specifies an upper limit on the number of epoll instances
|
|
.\" that can be created per real user ID.
|
|
.TP
|
|
.IR /proc/sys/fs/epoll/max_user_watches " (since Linux 2.6.28)"
|
|
This specifies a limit on the total number of
|
|
file descriptors that a user can register across
|
|
all epoll instances on the system.
|
|
The limit is per real user ID.
|
|
Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel,
|
|
and roughly 160 bytes on a 64-bit kernel.
|
|
Currently,
|
|
.\" 2.6.29 (in 2.6.28, the default was 1/32 of lowmem)
|
|
the default value for
|
|
.I max_user_watches
|
|
is 1/25 (4%) of the available low memory,
|
|
divided by the registration cost in bytes.
|
|
.SS Example for suggested usage
|
|
While the usage of
|
|
.B epoll
|
|
when employed as a level-triggered interface does have the same
|
|
semantics as
|
|
.BR poll (2),
|
|
the edge-triggered usage requires more clarification to avoid stalls
|
|
in the application event loop.
|
|
In this example, listener is a
|
|
nonblocking socket on which
|
|
.BR listen (2)
|
|
has been called.
|
|
The function
|
|
.I do_use_fd()
|
|
uses the new ready file descriptor until
|
|
.B EAGAIN
|
|
is returned by either
|
|
.BR read (2)
|
|
or
|
|
.BR write (2).
|
|
An event-driven state machine application should, after having received
|
|
.BR EAGAIN ,
|
|
record its current state so that at the next call to
|
|
.I do_use_fd()
|
|
it will continue to
|
|
.BR read (2)
|
|
or
|
|
.BR write (2)
|
|
from where it stopped before.
|
|
|
|
.in +4n
|
|
.nf
|
|
#define MAX_EVENTS 10
|
|
struct epoll_event ev, events[MAX_EVENTS];
|
|
int listen_sock, conn_sock, nfds, epollfd;
|
|
|
|
/* Set up listening socket, \(aqlisten_sock\(aq (socket(),
|
|
bind(), listen()) */
|
|
|
|
epollfd = epoll_create(10);
|
|
if (epollfd == \-1) {
|
|
perror("epoll_create");
|
|
exit(EXIT_FAILURE);
|
|
}
|
|
|
|
ev.events = EPOLLIN;
|
|
ev.data.fd = listen_sock;
|
|
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == \-1) {
|
|
perror("epoll_ctl: listen_sock");
|
|
exit(EXIT_FAILURE);
|
|
}
|
|
|
|
for (;;) {
|
|
nfds = epoll_wait(epollfd, events, MAX_EVENTS, \-1);
|
|
if (nfds == \-1) {
|
|
perror("epoll_pwait");
|
|
exit(EXIT_FAILURE);
|
|
}
|
|
|
|
for (n = 0; n < nfds; ++n) {
|
|
if (events[n].data.fd == listen_sock) {
|
|
conn_sock = accept(listen_sock,
|
|
(struct sockaddr *) &local, &addrlen);
|
|
if (conn_sock == \-1) {
|
|
perror("accept");
|
|
exit(EXIT_FAILURE);
|
|
}
|
|
setnonblocking(conn_sock);
|
|
ev.events = EPOLLIN | EPOLLET;
|
|
ev.data.fd = conn_sock;
|
|
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
|
|
&ev) == \-1) {
|
|
perror("epoll_ctl: conn_sock");
|
|
exit(EXIT_FAILURE);
|
|
}
|
|
} else {
|
|
do_use_fd(events[n].data.fd);
|
|
}
|
|
}
|
|
}
|
|
.fi
|
|
.in
|
|
|
|
When used as an edge-triggered interface, for performance reasons, it is
|
|
possible to add the file descriptor inside the
|
|
.B epoll
|
|
interface
|
|
.RB ( EPOLL_CTL_ADD )
|
|
once by specifying
|
|
.RB ( EPOLLIN | EPOLLOUT ).
|
|
This allows you to avoid
|
|
continuously switching between
|
|
.B EPOLLIN
|
|
and
|
|
.B EPOLLOUT
|
|
calling
|
|
.BR epoll_ctl (2)
|
|
with
|
|
.BR EPOLL_CTL_MOD .
|
|
.SS Questions and answers
|
|
.TP 4
|
|
.B Q0
|
|
What is the key used to distinguish the file descriptors registered in an
|
|
.B epoll
|
|
set?
|
|
.TP
|
|
.B A0
|
|
The key is the combination of the file descriptor number and
|
|
the open file description
|
|
(also known as an "open file handle",
|
|
the kernel's internal representation of an open file).
|
|
.TP
|
|
.B Q1
|
|
What happens if you register the same file descriptor on an
|
|
.B epoll
|
|
instance twice?
|
|
.TP
|
|
.B A1
|
|
You will probably get
|
|
.BR EEXIST .
|
|
However, it is possible to add a duplicate
|
|
.RB ( dup (2),
|
|
.BR dup2 (2),
|
|
.BR fcntl (2)
|
|
.BR F_DUPFD )
|
|
descriptor to the same
|
|
.B epoll
|
|
instance.
|
|
.\" But a descriptor duplicated by fork(2) can't be added to the
|
|
.\" set, because the [file *, fd] pair is already in the epoll set.
|
|
.\" That is a somewhat ugly inconsistency. On the one hand, a child process
|
|
.\" cannot add the duplicate file descriptor to the epoll set. (In every
|
|
.\" other case that I can think of, descriptors duplicated by fork have
|
|
.\" similar semantics to descriptors duplicated by dup() and friends.) On
|
|
.\" the other hand, the very fact that the child has a duplicate of the
|
|
.\" descriptor means that even if the parent closes its descriptor, then
|
|
.\" epoll_wait() in the parent will continue to receive notifications for
|
|
.\" that descriptor because of the duplicated descriptor in the child.
|
|
.\"
|
|
.\" See http://thread.gmane.org/gmane.linux.kernel/596462/
|
|
.\" "epoll design problems with common fork/exec patterns"
|
|
.\"
|
|
.\" mtk, Feb 2008
|
|
This can be a useful technique for filtering events,
|
|
if the duplicate file descriptors are registered with different
|
|
.I events
|
|
masks.
|
|
.TP
|
|
.B Q2
|
|
Can two
|
|
.B epoll
|
|
instances wait for the same file descriptor?
|
|
If so, are events reported to both
|
|
.B epoll
|
|
file descriptors?
|
|
.TP
|
|
.B A2
|
|
Yes, and events would be reported to both.
|
|
However, careful programming may be needed to do this correctly.
|
|
.TP
|
|
.B Q3
|
|
Is the
|
|
.B epoll
|
|
file descriptor itself poll/epoll/selectable?
|
|
.TP
|
|
.B A3
|
|
Yes.
|
|
If an
|
|
.B epoll
|
|
file descriptor has events waiting then it will
|
|
indicate as being readable.
|
|
.TP
|
|
.B Q4
|
|
What happens if one attempts to put an
|
|
.B epoll
|
|
file descriptor into its own file descriptor set?
|
|
.TP
|
|
.B A4
|
|
The
|
|
.BR epoll_ctl (2)
|
|
call will fail
|
|
.RB ( EINVAL ).
|
|
However, you can add an
|
|
.B epoll
|
|
file descriptor inside another
|
|
.B epoll
|
|
file descriptor set.
|
|
.TP
|
|
.B Q5
|
|
Can I send an
|
|
.B epoll
|
|
file descriptor over a UNIX domain socket to another process?
|
|
.TP
|
|
.B A5
|
|
Yes, but it does not make sense to do this, since the receiving process
|
|
would not have copies of the file descriptors in the
|
|
.B epoll
|
|
set.
|
|
.TP
|
|
.B Q6
|
|
Will closing a file descriptor cause it to be removed from all
|
|
.B epoll
|
|
sets automatically?
|
|
.TP
|
|
.B A6
|
|
Yes, but be aware of the following point.
|
|
A file descriptor is a reference to an open file description (see
|
|
.BR open (2)).
|
|
Whenever a descriptor is duplicated via
|
|
.BR dup (2),
|
|
.BR dup2 (2),
|
|
.BR fcntl (2)
|
|
.BR F_DUPFD ,
|
|
or
|
|
.BR fork (2),
|
|
a new file descriptor referring to the same open file description is
|
|
created.
|
|
An open file description continues to exist until all
|
|
file descriptors referring to it have been closed.
|
|
A file descriptor is removed from an
|
|
.B epoll
|
|
set only after all the file descriptors referring to the underlying
|
|
open file description have been closed
|
|
(or before if the descriptor is explicitly removed using
|
|
.BR epoll_ctl (2)
|
|
.BR EPOLL_CTL_DEL ).
|
|
This means that even after a file descriptor that is part of an
|
|
.B epoll
|
|
set has been closed,
|
|
events may be reported for that file descriptor if other file
|
|
descriptors referring to the same underlying file description remain open.
|
|
.TP
|
|
.B Q7
|
|
If more than one event occurs between
|
|
.BR epoll_wait (2)
|
|
calls, are they combined or reported separately?
|
|
.TP
|
|
.B A7
|
|
They will be combined.
|
|
.TP
|
|
.B Q8
|
|
Does an operation on a file descriptor affect the
|
|
already collected but not yet reported events?
|
|
.TP
|
|
.B A8
|
|
You can do two operations on an existing file descriptor.
|
|
Remove would be meaningless for
|
|
this case.
|
|
Modify will reread available I/O.
|
|
.TP
|
|
.B Q9
|
|
Do I need to continuously read/write a file descriptor
|
|
until
|
|
.B EAGAIN
|
|
when using the
|
|
.B EPOLLET
|
|
flag (edge-triggered behavior) ?
|
|
.TP
|
|
.B A9
|
|
Receiving an event from
|
|
.BR epoll_wait (2)
|
|
should suggest to you that such
|
|
file descriptor is ready for the requested I/O operation.
|
|
You must consider it ready until the next (nonblocking)
|
|
read/write yields
|
|
.BR EAGAIN .
|
|
When and how you will use the file descriptor is entirely up to you.
|
|
.sp
|
|
For packet/token-oriented files (e.g., datagram socket,
|
|
terminal in canonical mode),
|
|
the only way to detect the end of the read/write I/O space
|
|
is to continue to read/write until
|
|
.BR EAGAIN .
|
|
.sp
|
|
For stream-oriented files (e.g., pipe, FIFO, stream socket), the
|
|
condition that the read/write I/O space is exhausted can also be detected by
|
|
checking the amount of data read from / written to the target file
|
|
descriptor.
|
|
For example, if you call
|
|
.BR read (2)
|
|
by asking to read a certain amount of data and
|
|
.BR read (2)
|
|
returns a lower number of bytes, you
|
|
can be sure of having exhausted the read I/O space for the file
|
|
descriptor.
|
|
The same is true when writing using
|
|
.BR write (2).
|
|
(Avoid this latter technique if you cannot guarantee that
|
|
the monitored file descriptor always refers to a stream-oriented file.)
|
|
.SS Possible pitfalls and ways to avoid them
|
|
.TP
|
|
.B o Starvation (edge-triggered)
|
|
.PP
|
|
If there is a large amount of I/O space,
|
|
it is possible that by trying to drain
|
|
it the other files will not get processed causing starvation.
|
|
(This problem is not specific to
|
|
.BR epoll .)
|
|
.PP
|
|
The solution is to maintain a ready list
|
|
and mark the file descriptor as ready
|
|
in its associated data structure, thereby allowing the application to
|
|
remember which files need to be processed but still round robin amongst
|
|
all the ready files.
|
|
This also supports ignoring subsequent events you
|
|
receive for file descriptors that are already ready.
|
|
.TP
|
|
.B o If using an event cache...
|
|
.PP
|
|
If you use an event cache or store all the file descriptors returned from
|
|
.BR epoll_wait (2),
|
|
then make sure to provide a way to mark
|
|
its closure dynamically (i.e., caused by
|
|
a previous event's processing).
|
|
Suppose you receive 100 events from
|
|
.BR epoll_wait (2),
|
|
and in event #47 a condition causes event #13 to be closed.
|
|
If you remove the structure and
|
|
.BR close (2)
|
|
the file descriptor for event #13, then your
|
|
event cache might still say there are events waiting for that
|
|
file descriptor causing confusion.
|
|
.PP
|
|
One solution for this is to call, during the processing of event 47,
|
|
.BR epoll_ctl ( EPOLL_CTL_DEL )
|
|
to delete file descriptor 13 and
|
|
.BR close (2),
|
|
then mark its associated
|
|
data structure as removed and link it to a cleanup list.
|
|
If you find another
|
|
event for file descriptor 13 in your batch processing,
|
|
you will discover the file descriptor had been
|
|
previously removed and there will be no confusion.
|
|
.SH VERSIONS
|
|
The
|
|
.B epoll
|
|
API was introduced in Linux kernel 2.5.44.
|
|
.\" Its interface should be finalized in Linux kernel 2.5.66.
|
|
Support was added to glibc in version 2.3.2.
|
|
.SH CONFORMING TO
|
|
The
|
|
.B epoll
|
|
API is Linux-specific.
|
|
Some other systems provide similar
|
|
mechanisms, for example, FreeBSD has
|
|
.IR kqueue ,
|
|
and Solaris has
|
|
.IR /dev/poll .
|
|
.SH SEE ALSO
|
|
.BR epoll_create (2),
|
|
.BR epoll_create1 (2),
|
|
.BR epoll_ctl (2),
|
|
.BR epoll_wait (2)
|