When doing cross-platform network programming, the only API available on Windows that corresponds to the epoll
/kevent
style reactor event model is select
. However, it has a limitation: the number of sockets passed into select
cannot exceed FD_SETSIZE
, which is set to 64.
Therefore, select
in Java’s NIO on Windows also has the same limit. Many services ported to Windows that use the reactor model face this constraint, which often gives the impression that server programs on Windows have weaker performance.
This limit is clearly insufficient for developing a high-concurrency server. So, is there a way to overcome it? And how do systems like Cygwin, which use Win32 API to emulate POSIX APIs, simulate an unrestricted poll
call?
Fortunately, there are roughly three ways to bypass the 64-socket limit.
Method 1: Redefine FD_SETSIZE
First, according to the MSDN documentation on Winsock2's select
, the FD_SETSIZE
value can be customized:
Four macros are defined in the header file Winsock2.h for manipulating and checking the descriptor sets. The variable
FD_SETSIZE
determines the maximum number of descriptors in a set. (The default value ofFD_SETSIZE
is 64, which can be modified by definingFD_SETSIZE
to another value before including Winsock2.h.)
In winsock2.h, you can see that this value can indeed be predefined:
#ifndef FD_SETSIZE
#define FD_SETSIZE 64
#endif
As long as you define FD_SETSIZE
before including winsock2.h
, you can break the 64-socket limit. For example, in Cygwin's poll
implementation poll.cc
, FD_SETSIZE
is redefined at the beginning as follows:
#define FD_SETSIZE 16384 // lots of fds
#include "winsup.h"
#include <sys/poll.h>
#include <sys/param.h>
This sets FD_SETSIZE
to a large value of 16,384, allowing up to 16K sockets to be used with select
. Cygwin then continues to implement POSIX poll
functionality using select
.
This approach generally works well, but it has two limitations. First, there’s the question of how large FD_SETSIZE
should be. Defining it too large wastes memory, while allocating memory dynamically with each select
call can cause memory fragmentation. Secondly, this method reduces portability—if you forget to change header file order, or move the code to another environment, it may not run.
Therefore, a more universal solution is available in Method 2.
Method 2: Custom fd_set Structure
This method is more universal. According to the MSDN documentation, the fd_set
structure is defined as follows:
typedef struct fd_set {
u_int fd_count;
SOCKET fd_array[FD_SETSIZE];
} fd_set, FD_SET, *PFD_SET, *LPFD_SET;
The first member, fd_count
, represents the total number of sockets in this fd_set
. The fd_array
array stores each socket’s value, with its size determined by the FD_SETSIZE
macro, which sets the maximum number of sockets that can be stored.
Let’s look at how some macros in winsock2.h
handle fd_set
operations:
#ifndef FD_ZERO
#define FD_ZERO(set) (((fd_set *)(set))->fd_count=0)
#endif
The FD_ZERO
macro simply clears the fd_set
by setting fd_count
to zero. Adding a socket:
#define FD_SET(fd, set) do { u_int __i;\
for (__i = 0; __i < ((fd_set *)(set))->fd_count ; __i++) {\
if (((fd_set *)(set))->fd_array[__i] == (fd)) {\
break;\
}\
}\
if (__i == ((fd_set *)(set))->fd_count) {\
if (((fd_set *)(set))->fd_count < FD_SETSIZE) {\
((fd_set *)(set))->fd_array[__i] = (fd);\
((fd_set *)(set))->fd_count++;\
}\
}\
} while(0)
Essentially, this macro checks if the socket already exists in the array; if it doesn’t and fd_count
is less than FD_SETSIZE
, it appends the socket to fd_array
and increments fd_count
.
The solution here is to create a dynamic structure to simulate fd_set
. Then, cast it to an fd_set
pointer when passing it to select
. Microsoft’s devblogs have discussed this approach:(A history of the fd_set, FD_SETSIZE, and how it relates to WinSock)
In contrast to using templates to create a new fd_set
structure as in the blog example, here is a more portable implementation:
#define ISOCK_ERECV 1 /* event - recv */
#define ISOCK_ESEND 2 /* event - send */
#define ISOCK_ERROR 4 /* event - error */
/* iselect: fds(fd set), events(mask), revents(received events) */
int iselect(const int *fds, const int *events, int *revents,
int count, long millisec, void *workmem);
The fds
parameter is the array of file descriptors, events
specifies the events to capture (like poll
’s events
), and revents
receives the resulting events. The count
parameter represents the total number of file descriptors. workmem
represents the memory required; if NULL
, the function calculates the needed memory and returns it without calling select/poll
.
int my_select1(const int *fds, const int *event, int *revent, int count, long millisec) {
int require = iselect(NULL, NULL, NULL, count, 0, NULL);
if (require > current_buffer_size) {
current_buffer = realloc(current_buffer, require);
current_buffer_size = require;
}
return iselect(fds, event, revent, count, millisec, current_buffer);
}
In this example, current_buffer
could be a global variable or stored in a selector/poller object.
Alternatively, allocate memory on the stack for small select
calls, and use dynamic allocation if the required size exceeds a predefined limit:
int my_select2(const int *fds, const int *event, int *revent, int count, long millisec) {
#define MAX_BUFFER_SIZE 2048
char stack[MAX_BUFFER_SIZE];
char *buffer = stack;
int require = iselect(NULL, NULL, NULL, count, 0, NULL);
if (require > MAX_BUFFER_SIZE) buffer = (char*)malloc(require);
int hr = iselect(fds, event, revent, count, millisec, buffer);
if (buffer != stack) free(buffer);
return hr;
}
This approach avoids the need to maintain a global variable.
The following select
function fully simulates the behavior of poll
, bypassing the FD_SETSIZE
limit, and uses poll
on non-Windows platforms and select
on Windows.
/* iselect: fds(fd set), events(mask), revents(received events) */
int iselect(const int *fds, const int *events, int *revents, int count,
long millisec, void *workmem)
{
int retval = 0;
int i;
if (workmem == NULL) {
#if defined(__unix) || defined(__linux)
return count * sizeof(struct pollfd);
#else
size_t unit = (size_t)(&(((FD_SET*)0)->fd_array[1]));
size_t size = count * sizeof(SOCKET) + unit + 8;
return (int)(size * 3);
#endif
}
else {
#if defined(__unix) || defined(__linux)
struct pollfd *pfds = (struct pollfd*)workmem;
for (i = 0; i < count; i++) {
pfds[i].fd = fds[i];
pfds[i].events = 0;
pfds[i].revents = 0;
if (events[i] & ISOCK_ERECV) pfds[i].events |= POLLIN;
if (events[i] & ISOCK_ESEND) pfds[i].events |= POLLOUT;
if (events[i] & ISOCK_ERROR) pfds[i].events |= POLLERR;
}
poll(pfds, count, millisec);
for (i = 0; i < count; i++) {
int event = events[i];
int pevent = pfds[i].revents;
int revent = 0;
if ((event & ISOCK_ERECV) && (pevent & POLLIN))
revent |= ISOCK_ERECV;
if ((event & ISOCK_ESEND) && (pevent & POLLOUT))
revent |= ISOCK_ESEND;
if ((event & ISOCK_ERROR) && (pevent & POLLERR))
revent |= ISOCK_ERROR;
revents[i] = revent & event;
if (revents[i]) retval++;
}
#else
struct timeval tmx = { 0, 0 };
size_t unit = (size_t)(&(((FD_SET*)0)->fd_array[1]));
size_t size = count * sizeof(SOCKET) + unit + 8;
FD_SET *fdr = (FD_SET*)(((char*)workmem) + 0);
FD_SET *fdw = (FD_SET*)(((char*)workmem) + size);
FD_SET *fde = (FD_SET*)(((char*)workmem) + size * 2);
void *dr, *dw, *de;
int maxfd = 0;
int j;
fdr->fd_count = fdw->fd_count = fde->fd_count = 0;
for (i = 0; i < count; i++) {
int event = events[i];
int fd = fds[i];
if (event & ISOCK_ERECV) fdr->fd_array[(fdr->fd_count)++] = fd;
if (event & ISOCK_ESEND) fdw->fd_array[(fdw->fd_count)++] = fd;
if (event & ISOCK_ERROR) fde->fd_array[(fde->fd_count)++] = fd;
if (fd > maxfd) maxfd = fd;
}
dr = fdr->fd_count? fdr : NULL;
dw = fdw->fd_count? fdw : NULL;
de = fde->fd_count? fde : NULL;
tmx.tv_sec = millisec / 1000;
tmx.tv_usec = (millisec % 1000) * 1000;
select(maxfd + 1, (fd_set*)dr, (fd_set*)dw, (fd_set*)de,
(millisec >= 0)? &tmx : 0);
for (i = 0; i < count; i++) {
int event = events[i];
int fd = fds[i];
int revent = 0;
if (event & ISOCK_ERECV) {
for (j = 0; j < (int)fdr->fd_count; j++) {
if (fdr->fd_array[j] == (SOCKET)fd) {
revent |= ISOCK_ERECV;
break;
}
}
}
if (event & ISOCK_ESEND) {
for (j = 0; j < (int)fdw->fd_count; j++) {
if (fdw->fd_array[j] == (SOCKET)fd) {
revent |= ISOCK_ESEND;
break;
}
}
}
if (event & ISOCK_ERROR) {
for (j = 0; j < (int)fde->fd_count; j++) {
if (fde->fd_array[j] == (SOCKET)fd) {
revent |= ISOCK_ERROR;
break;
}
}
}
revents[i] = revent & event;
if (revent) retval++;
}
#endif
}
return retval;
}
This is exactly the method I’m currently using, and it’s just over a hundred lines. I’ve tested it, and on my desktop, it can maintain around 10,000 socket connections without issues. Running an echo server, where each connection exchanges one message per second, keeps CPU usage at around 70%.
For Windows client programs, which typically don’t maintain many connections, this function is more than sufficient. For server programs, it’s workable and allows server applications that usually run on Linux to operate reliably on Windows, handling any number of connections smoothly during development and debugging.
The only drawback is high CPU usage. So, is there an asynchronous event model on Windows that’s as smooth as kevent
or epoll
—one that can handle tens of thousands of sockets without heavy CPU usage? There is, but before discussing Method 3, let’s look at two common mistakes.
Wrong Choices: WSAEventSelect and WSAAsyncSelect
Some suggest using WSAEventSelect
, which binds socket events to a WSAEVENT
:
int WSAAPI WSAEventSelect(
[in] SOCKET s,
[in] WSAEVENT hEventObject,
[in] long lNetworkEvents
);
A WSAEVENT
is similar to an event object and seems unrestricted by FD_SETSIZE
, but WSAWaitForMultipleEvents
still faces the WSA_MAXIMUM_WAIT_EVENTS
limit in winsock2.h
:
#define WSA_MAXIMUM_WAIT_EVENTS (MAXIMUM_WAIT_OBJECTS)
This MAXIMUM_WAIT_OBJECTS
count is 64, so you still can’t escape the limitation. Another function, WSAAsyncSelect
, associates a socket event with a window handle:
int WSAAsyncSelect(
[in] SOCKET s,
[in] HWND hWnd,
[in] u_int wMsg,
[in] long lEvent
);
This does remove the connection limit, but it requires a window handle (HWND
), meaning you’d need to create a virtual window. To emulate poll
behavior, where would you place this virtual window? Would you need a dedicated thread to run its message loop?
Unix’s philosophy is that everything is a file, whereas Windows’ philosophy is that everything is a window. It’s surprising to find network programming involving windows! But ultimately, this approach is not very clean.
Method 3: Using IOCP to Implement epoll
Yes, it’s possible to use IOCP (I/O Completion Ports) to fully emulate epoll
, giving you a high-performance reactor event model that can handle hundreds of thousands of sockets effortlessly. While this sounds appealing, it can be challenging to implement—but luckily, there’s already a solution for this:
The wepoll
project aims to provide high-performance epoll
functionality on Windows using IOCP. It supports systems from Vista onward and consists of only two files, wepoll.h
and wepoll.c
, making it easy to integrate. The interface aligns with epoll
:
HANDLE epoll_create(int size);
HANDLE epoll_create1(int flags);
int epoll_close(HANDLE ephnd);
int epoll_ctl(HANDLE ephnd,
int op,
SOCKET sock,
struct epoll_event* event);
int epoll_wait(HANDLE ephnd,
struct epoll_event* events,
int maxevents,
int timeout);
It works exactly like epoll
, though it only supports level-triggered mode, not edge-triggered. However, performance tests suggest that edge-triggered mode doesn’t provide a significant advantage and isn’t widely compatible across platforms, as most asynchronous event APIs don’t support this mode. Therefore, level-triggered mode works well enough in most cases.
Summary
So, if you’re building a cross-platform poll
module, which of the three methods above is best on Windows? My approach is to implement Method 2 (custom fd_set
) as the internal mechanism since it’s compatible back to Windows 95, making it a reliable fallback. Additionally, I include a plugin mechanism to allow external enhancements.
Then, if the main program detects that the system is Vista or newer and wepoll
is available, it wraps wepoll
as a plugin and integrates it. This layered approach offers flexibility and optimal performance on Windows.
Translated from https://skywind.me/blog/archives/3112