mirror of https://github.com/mkerrisk/man-pages
399 lines
10 KiB
Groff
399 lines
10 KiB
Groff
.\"
|
|
.\" epoll by Davide Libenzi ( efficient event notification retrieval )
|
|
.\" Copyright (C) 2003 Davide Libenzi
|
|
.\"
|
|
.\" This program is free software; you can redistribute it and/or modify
|
|
.\" it under the terms of the GNU General Public License as published by
|
|
.\" the Free Software Foundation; either version 2 of the License, or
|
|
.\" (at your option) any later version.
|
|
.\"
|
|
.\" This program is distributed in the hope that it will be useful,
|
|
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
.\" GNU General Public License for more details.
|
|
.\"
|
|
.\" You should have received a copy of the GNU General Public License
|
|
.\" along with this program; if not, write to the Free Software
|
|
.\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
|
|
.\"
|
|
.\" Davide Libenzi <davidel@xmailserver.org>
|
|
.\"
|
|
.TH EPOLL 7 2007-06-22 "Linux" "Linux Programmer's Manual"
|
|
.SH NAME
|
|
epoll \- I/O event notification facility
|
|
.SH SYNOPSIS
|
|
.B #include <sys/epoll.h>
|
|
.SH DESCRIPTION
|
|
.B epoll
|
|
is a variant of
|
|
.BR poll (2)
|
|
that can be used either as an edge-triggered or a level-triggered
|
|
interface and scales well to large numbers of watched file descriptors.
|
|
Three system calls are provided to
|
|
set up and control an
|
|
.B epoll
|
|
set:
|
|
.BR epoll_create (2),
|
|
.BR epoll_ctl (2),
|
|
.BR epoll_wait (2).
|
|
|
|
An
|
|
.B epoll
|
|
set is connected to a file descriptor created by
|
|
.BR epoll_create (2).
|
|
Interest for certain file descriptors is then registered via
|
|
.BR epoll_ctl (2).
|
|
Finally, the actual wait is started by
|
|
.BR epoll_wait (2).
|
|
.SS Level-Triggered and Edge-Triggered
|
|
The
|
|
.B epoll
|
|
event distribution interface is able to behave both as edge-triggered
|
|
(ET) and level-triggered (LT).
|
|
The difference between the two mechanisms
|
|
can be described as follows.
|
|
Suppose that
|
|
this scenario happens :
|
|
.IP 1. 3
|
|
The file descriptor that represents the read side of a pipe
|
|
.RI ( rfd )
|
|
is added inside the
|
|
.B epoll
|
|
device.
|
|
.IP 2.
|
|
A pipe writer writes 2 kB of data on the write side of the pipe.
|
|
.IP 3.
|
|
A call to
|
|
.BR epoll_wait (2)
|
|
is done that will return
|
|
.I rfd
|
|
as a ready file descriptor.
|
|
.IP 4.
|
|
The pipe reader reads 1 kB of data from
|
|
.IR rfd .
|
|
.IP 5.
|
|
A call to
|
|
.BR epoll_wait (2)
|
|
is done.
|
|
.PP
|
|
If the
|
|
.I rfd
|
|
file descriptor has been added to the
|
|
.B epoll
|
|
interface using the
|
|
.B EPOLLET
|
|
flag, the call to
|
|
.BR epoll_wait (2)
|
|
done in step
|
|
.B 5
|
|
will probably hang despite the available data still present in the file
|
|
input buffer;
|
|
meanwhile the remote peer might be expecting a response based on the
|
|
data it already sent.
|
|
The reason for this is that edge-triggered mode only
|
|
delivers events when changes occur on the monitored file descriptor.
|
|
So, in step
|
|
.B 5
|
|
the caller might end up waiting for some data that is already present inside
|
|
the input buffer.
|
|
In the above example, an event on
|
|
.I rfd
|
|
will be generated because of the write done in
|
|
.B 2
|
|
and the event is consumed in
|
|
.BR 3 .
|
|
Since the read operation done in
|
|
.B 4
|
|
does not consume the whole buffer data, the call to
|
|
.BR epoll_wait (2)
|
|
done in step
|
|
.B 5
|
|
might block indefinitely.
|
|
|
|
An application that employs the
|
|
.B EPOLLET
|
|
flag (edge-triggered)
|
|
should use non-blocking file descriptors to avoid having a blocking
|
|
read or write starve a task that is handling multiple file descriptors.
|
|
The suggested way to use
|
|
.B epoll
|
|
as an edge-triggered
|
|
.RB ( EPOLLET )
|
|
interface is as follows:
|
|
.RS
|
|
.TP
|
|
.B i
|
|
with non-blocking file descriptors
|
|
.TP
|
|
.B ii
|
|
by waiting for an event only after
|
|
.BR read (2)
|
|
or
|
|
.BR write (2)
|
|
return
|
|
.BR EAGAIN .
|
|
.RE
|
|
.PP
|
|
By contrast, when used as a level-triggered interface,
|
|
.B epoll
|
|
is simply a faster
|
|
.BR poll (2),
|
|
and can be used wherever the latter is used since it shares the
|
|
same semantics.
|
|
|
|
Since even with the edge-triggered
|
|
.B epoll
|
|
multiple events can be generated upon receipt of multiple chunks of data,
|
|
the caller has the option to specify the
|
|
.B EPOLLONESHOT
|
|
flag, to tell
|
|
.B epoll
|
|
to disable the associated file descriptor after the receipt of an event with
|
|
.BR epoll_wait (2).
|
|
When the
|
|
.B EPOLLONESHOT
|
|
flag is specified,
|
|
it is the caller's responsibility to rearm the file descriptor using
|
|
.BR epoll_ctl (2)
|
|
with
|
|
.BR EPOLL_CTL_MOD .
|
|
.SS Example for Suggested Usage
|
|
While the usage of
|
|
.B epoll
|
|
when employed as a level-triggered interface does have the same
|
|
semantics as
|
|
.BR poll (2),
|
|
the edge-triggered usage requires more clarification to avoid stalls
|
|
in the application event loop.
|
|
In this example, listener is a
|
|
non-blocking socket on which
|
|
.BR listen (2)
|
|
has been called.
|
|
The function do_use_fd() uses the new ready
|
|
file descriptor until
|
|
.B EAGAIN
|
|
is returned by either
|
|
.BR read (2)
|
|
or
|
|
.BR write (2).
|
|
An event-driven state machine application should, after having received
|
|
.BR EAGAIN ,
|
|
record its current state so that at the next call to do_use_fd()
|
|
it will continue to
|
|
.BR read (2)
|
|
or
|
|
.BR write (2)
|
|
from where it stopped before.
|
|
|
|
.nf
|
|
struct epoll_event ev, *events;
|
|
|
|
for (;;) {
|
|
nfds = epoll_wait(kdpfd, events, maxevents, \-1);
|
|
|
|
for (n = 0; n < nfds; ++n) {
|
|
if (events[n].data.fd == listener) {
|
|
client = accept(listener, (struct sockaddr *) &local,
|
|
&addrlen);
|
|
if (client < 0){
|
|
perror("accept");
|
|
continue;
|
|
}
|
|
setnonblocking(client);
|
|
ev.events = EPOLLIN | EPOLLET;
|
|
ev.data.fd = client;
|
|
if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, &ev) < 0) {
|
|
fprintf(stderr, "epoll set insertion error: fd=%d\\n",
|
|
client);
|
|
return \-1;
|
|
}
|
|
} else {
|
|
do_use_fd(events[n].data.fd);
|
|
}
|
|
}
|
|
}
|
|
.fi
|
|
|
|
When used as an edge-triggered interface, for performance reasons, it is
|
|
possible to add the file descriptor inside the epoll interface
|
|
.RB ( EPOLL_CTL_ADD )
|
|
once by specifying
|
|
.RB ( EPOLLIN | EPOLLOUT ).
|
|
This allows you to avoid
|
|
continuously switching between
|
|
.B EPOLLIN
|
|
and
|
|
.B EPOLLOUT
|
|
calling
|
|
.BR epoll_ctl (2)
|
|
with
|
|
.BR EPOLL_CTL_MOD .
|
|
.SS Questions and Answers
|
|
.TP
|
|
.B Q1
|
|
What happens if you add the same file descriptor to an epoll set twice?
|
|
.TP
|
|
.B A1
|
|
You will probably get
|
|
.BR EEXIST .
|
|
However, it is possible that two
|
|
threads may add the same file descriptor twice.
|
|
This is a harmless condition.
|
|
.TP
|
|
.B Q2
|
|
Can two
|
|
.B epoll
|
|
sets wait for the same file descriptor?
|
|
If so, are events reported to both
|
|
.B epoll
|
|
file descriptors?
|
|
.TP
|
|
.B A2
|
|
Yes, and events would be reported to both.
|
|
However, it is not recommended.
|
|
.TP
|
|
.B Q3
|
|
Is the
|
|
.B epoll
|
|
file descriptor itself poll/epoll/selectable?
|
|
.TP
|
|
.B A3
|
|
Yes.
|
|
.TP
|
|
.B Q4
|
|
What happens if the
|
|
.B epoll
|
|
file descriptor is put into its own file descriptor set?
|
|
.TP
|
|
.B A4
|
|
It will fail.
|
|
However, you can add an
|
|
.B epoll
|
|
file descriptor inside another epoll file descriptor set.
|
|
.TP
|
|
.B Q5
|
|
Can I send the
|
|
.B epoll
|
|
file descriptor over a unix-socket to another process?
|
|
.TP
|
|
.B A5
|
|
No.
|
|
.TP
|
|
.B Q6
|
|
Will closing a file descriptor cause it to be removed from all
|
|
.B epoll
|
|
sets automatically?
|
|
.TP
|
|
.B A6
|
|
Yes.
|
|
.TP
|
|
.B Q7
|
|
If more than one event occurs between
|
|
.BR epoll_wait (2)
|
|
calls, are they combined or reported separately?
|
|
.TP
|
|
.B A7
|
|
They will be combined.
|
|
.TP
|
|
.B Q8
|
|
Does an operation on a file descriptor affect the
|
|
already collected but not yet reported events?
|
|
.TP
|
|
.B A8
|
|
You can do two operations on an existing file descriptor.
|
|
Remove would be meaningless for
|
|
this case.
|
|
Modify will re-read available I/O.
|
|
.TP
|
|
.B Q9
|
|
Do I need to continuously read/write a file descriptor
|
|
until
|
|
.B EAGAIN
|
|
when using the
|
|
.B EPOLLET
|
|
flag (edge-triggered behavior) ?
|
|
.TP
|
|
.B A9
|
|
No you don't.
|
|
Receiving an event from
|
|
.BR epoll_wait (2)
|
|
should suggest to you that such file descriptor is ready
|
|
for the requested I/O operation.
|
|
You have simply to consider it ready until you will receive the
|
|
next
|
|
.BR EAGAIN .
|
|
When and how you will use such file descriptor is entirely up
|
|
to you.
|
|
Also, the condition that the read/write I/O space is exhausted can
|
|
be detected by checking the amount of data read from / written to the target
|
|
file descriptor.
|
|
For example, if you call
|
|
.BR read (2)
|
|
by asking to read a certain amount of data and
|
|
.BR read (2)
|
|
returns a lower number of bytes,
|
|
you can be sure of having exhausted the read
|
|
I/O space for such file descriptor.
|
|
The same is true when writing using
|
|
.BR write (2).
|
|
.SS Possible Pitfalls and Ways to Avoid Them
|
|
.TP
|
|
.B o Starvation (edge-triggered)
|
|
.PP
|
|
If there is a large amount of I/O space,
|
|
it is possible that by trying to drain
|
|
it the other files will not get processed causing starvation.
|
|
(This problem is not specific to
|
|
.BR epoll .)
|
|
.PP
|
|
The solution is to maintain a ready list
|
|
and mark the file descriptor as ready
|
|
in its associated data structure, thereby allowing the application to
|
|
remember which files need to be processed but still round robin amongst
|
|
all the ready files.
|
|
This also supports ignoring subsequent events you
|
|
receive for file descriptors that are already ready.
|
|
.TP
|
|
.B o If using an event cache...
|
|
.PP
|
|
If you use an event cache or store all the file descriptors returned from
|
|
.BR epoll_wait (2),
|
|
then make sure to provide a way to mark
|
|
its closure dynamically (i.e., caused by
|
|
a previous event's processing).
|
|
Suppose you receive 100 events from
|
|
.BR epoll_wait (2),
|
|
and in event #47 a condition causes event #13 to be closed.
|
|
If you remove the structure and
|
|
.BR close (2)
|
|
the file descriptor for event #13, then your
|
|
event cache might still say there are events waiting for that
|
|
file descriptor causing confusion.
|
|
.PP
|
|
One solution for this is to call, during the processing of event 47,
|
|
.BR epoll_ctl ( EPOLL_CTL_DEL )
|
|
to delete file descriptor 13 and
|
|
.BR close (2),
|
|
then mark its associated
|
|
data structure as removed and link it to a cleanup list.
|
|
If you find another
|
|
event for file descriptor 13 in your batch processing,
|
|
you will discover the file descriptor had been
|
|
previously removed and there will be no confusion.
|
|
.SH VERSIONS
|
|
The
|
|
.B epoll
|
|
API was introduced in Linux kernel 2.5.44.
|
|
Its interface should be finalized in Linux kernel 2.5.66.
|
|
.SH CONFORMING TO
|
|
The epoll API is Linux specific.
|
|
Some other systems provide similar
|
|
mechanisms, for example, FreeBSD has
|
|
.IR kqueue ,
|
|
and Solaris has
|
|
.IR /dev/poll .
|
|
.SH "SEE ALSO"
|
|
.BR epoll_create (2),
|
|
.BR epoll_ctl (2),
|
|
.BR epoll_wait (2)
|