How to make select in WinSock exceed the 64-socket limit

  tr_cn        2024-10-31 23:47:50       1,596        0         

When doing cross-platform network programming, the only API available on Windows that corresponds to the epoll/kevent style reactor event model is select. However, it has a limitation: the number of sockets passed into select cannot exceed FD_SETSIZE, which is set to 64.

Therefore, select in Java’s NIO on Windows also has the same limit. Many services ported to Windows that use the reactor model face this constraint, which often gives the impression that server programs on Windows have weaker performance.

This limit is clearly insufficient for developing a high-concurrency server. So, is there a way to overcome it? And how do systems like Cygwin, which use Win32 API to emulate POSIX APIs, simulate an unrestricted poll call?

Fortunately, there are roughly three ways to bypass the 64-socket limit.

Method 1: Redefine FD_SETSIZE

First, according to the MSDN documentation on Winsock2's select, the FD_SETSIZE value can be customized:

Four macros are defined in the header file Winsock2.h for manipulating and checking the descriptor sets. The variable FD_SETSIZE determines the maximum number of descriptors in a set. (The default value of FD_SETSIZE is 64, which can be modified by defining FD_SETSIZE to another value before including Winsock2.h.)

In winsock2.h, you can see that this value can indeed be predefined:

#ifndef FD_SETSIZE
#define FD_SETSIZE 64
#endif

As long as you define FD_SETSIZE before including winsock2.h, you can break the 64-socket limit. For example, in Cygwin's poll implementation poll.cc, FD_SETSIZE is redefined at the beginning as follows:

#define FD_SETSIZE 16384        // lots of fds
#include "winsup.h"
#include <sys/poll.h>
#include <sys/param.h>

This sets FD_SETSIZE to a large value of 16,384, allowing up to 16K sockets to be used with select. Cygwin then continues to implement POSIX poll functionality using select.

This approach generally works well, but it has two limitations. First, there’s the question of how large FD_SETSIZE should be. Defining it too large wastes memory, while allocating memory dynamically with each select call can cause memory fragmentation. Secondly, this method reduces portability—if you forget to change header file order, or move the code to another environment, it may not run.

Therefore, a more universal solution is available in Method 2.

Method 2: Custom fd_set Structure

This method is more universal. According to the MSDN documentation, the fd_set structure is defined as follows:

typedef struct fd_set {
  u_int  fd_count;
  SOCKET fd_array[FD_SETSIZE];
} fd_set, FD_SET, *PFD_SET, *LPFD_SET;

The first member, fd_count, represents the total number of sockets in this fd_set. The fd_array array stores each socket’s value, with its size determined by the FD_SETSIZE macro, which sets the maximum number of sockets that can be stored.

Let’s look at how some macros in winsock2.h handle fd_set operations:

#ifndef FD_ZERO
#define FD_ZERO(set) (((fd_set *)(set))->fd_count=0)
#endif

The FD_ZERO macro simply clears the fd_set by setting fd_count to zero. Adding a socket:

#define FD_SET(fd, set) do { u_int __i;\
    for (__i = 0; __i < ((fd_set *)(set))->fd_count ; __i++) {\
        if (((fd_set *)(set))->fd_array[__i] == (fd)) {\
            break;\
        }\
    }\
    if (__i == ((fd_set *)(set))->fd_count) {\
        if (((fd_set *)(set))->fd_count < FD_SETSIZE) {\
            ((fd_set *)(set))->fd_array[__i] = (fd);\
            ((fd_set *)(set))->fd_count++;\
        }\
    }\
    } while(0)

Essentially, this macro checks if the socket already exists in the array; if it doesn’t and fd_count is less than FD_SETSIZE, it appends the socket to fd_array and increments fd_count.

The solution here is to create a dynamic structure to simulate fd_set. Then, cast it to an fd_set pointer when passing it to select. Microsoft’s devblogs have discussed this approach:(A history of the fd_set, FD_SETSIZE, and how it relates to WinSock)

In contrast to using templates to create a new fd_set structure as in the blog example, here is a more portable implementation:

#define ISOCK_ERECV     1       /* event - recv           */
#define ISOCK_ESEND     2       /* event - send           */
#define ISOCK_ERROR     4       /* event - error          */

/* iselect: fds(fd set), events(mask), revents(received events) */
int iselect(const int *fds, const int *events, int *revents, 
    int count, long millisec, void *workmem);

The fds parameter is the array of file descriptors, events specifies the events to capture (like poll’s events), and revents receives the resulting events. The count parameter represents the total number of file descriptors. workmem represents the memory required; if NULL, the function calculates the needed memory and returns it without calling select/poll.

int my_select1(const int *fds, const int *event, int *revent, int count, long millisec) {
    int require = iselect(NULL, NULL, NULL, count, 0, NULL);
    if (require > current_buffer_size) {
        current_buffer = realloc(current_buffer, require);
        current_buffer_size = require;
    }
    return iselect(fds, event, revent, count, millisec, current_buffer);
}

In this example, current_buffer could be a global variable or stored in a selector/poller object.

Alternatively, allocate memory on the stack for small select calls, and use dynamic allocation if the required size exceeds a predefined limit:

int my_select2(const int *fds, const int *event, int *revent, int count, long millisec) {
    #define MAX_BUFFER_SIZE 2048
    char stack[MAX_BUFFER_SIZE];
    char *buffer = stack;
    int require = iselect(NULL, NULL, NULL, count, 0, NULL);
    if (require > MAX_BUFFER_SIZE) buffer = (char*)malloc(require);
    int hr = iselect(fds, event, revent, count, millisec, buffer);
    if (buffer != stack) free(buffer);
    return hr;
}

This approach avoids the need to maintain a global variable.

The following select function fully simulates the behavior of poll, bypassing the FD_SETSIZE limit, and uses poll on non-Windows platforms and select on Windows.

/* iselect: fds(fd set), events(mask), revents(received events) */
int iselect(const int *fds, const int *events, int *revents, int count, 
    long millisec, void *workmem)
{
    int retval = 0;
    int i;

    if (workmem == NULL) {
        #if defined(__unix) || defined(__linux)
        return count * sizeof(struct pollfd);
        #else
        size_t unit = (size_t)(&(((FD_SET*)0)->fd_array[1]));
        size_t size = count * sizeof(SOCKET) + unit + 8;
        return (int)(size * 3);
        #endif
    }    
    else {
        #if defined(__unix) || defined(__linux)
        struct pollfd *pfds = (struct pollfd*)workmem;

        for (i = 0; i < count; i++) {
            pfds[i].fd = fds[i];
            pfds[i].events = 0;
            pfds[i].revents = 0;
            if (events[i] & ISOCK_ERECV) pfds[i].events |= POLLIN;
            if (events[i] & ISOCK_ESEND) pfds[i].events |= POLLOUT;
            if (events[i] & ISOCK_ERROR) pfds[i].events |= POLLERR;
        }

        poll(pfds, count, millisec);

        for (i = 0; i < count; i++) {
            int event = events[i];
            int pevent = pfds[i].revents;
            int revent = 0;
            if ((event & ISOCK_ERECV) && (pevent & POLLIN)) 
                revent |= ISOCK_ERECV;
            if ((event & ISOCK_ESEND) && (pevent & POLLOUT))
                revent |= ISOCK_ESEND;
            if ((event & ISOCK_ERROR) && (pevent & POLLERR))
                revent |= ISOCK_ERROR;
            revents[i] = revent & event;
            if (revents[i]) retval++;
        }

        #else
        struct timeval tmx = { 0, 0 };
        size_t unit = (size_t)(&(((FD_SET*)0)->fd_array[1]));
        size_t size = count * sizeof(SOCKET) + unit + 8;
        FD_SET *fdr = (FD_SET*)(((char*)workmem) + 0);
        FD_SET *fdw = (FD_SET*)(((char*)workmem) + size);
        FD_SET *fde = (FD_SET*)(((char*)workmem) + size * 2);
        void *dr, *dw, *de;
        int maxfd = 0;
        int j;

        fdr->fd_count = fdw->fd_count = fde->fd_count = 0;

        for (i = 0; i < count; i++) {
            int event = events[i];
            int fd = fds[i];
            if (event & ISOCK_ERECV) fdr->fd_array[(fdr->fd_count)++] = fd;
            if (event & ISOCK_ESEND) fdw->fd_array[(fdw->fd_count)++] = fd;
            if (event & ISOCK_ERROR) fde->fd_array[(fde->fd_count)++] = fd;
            if (fd > maxfd) maxfd = fd;
        }

        dr = fdr->fd_count? fdr : NULL;
        dw = fdw->fd_count? fdw : NULL;
        de = fde->fd_count? fde : NULL;

        tmx.tv_sec = millisec / 1000;
        tmx.tv_usec = (millisec % 1000) * 1000;

        select(maxfd + 1, (fd_set*)dr, (fd_set*)dw, (fd_set*)de, 
            (millisec >= 0)? &tmx : 0);

        for (i = 0; i < count; i++) {
            int event = events[i];
            int fd = fds[i];
            int revent = 0;
            if (event & ISOCK_ERECV) {
                for (j = 0; j < (int)fdr->fd_count; j++) {
                    if (fdr->fd_array[j] == (SOCKET)fd) { 
                        revent |= ISOCK_ERECV; 
                        break; 
                    }
                }
            }
            if (event & ISOCK_ESEND) {
                for (j = 0; j < (int)fdw->fd_count; j++) {
                    if (fdw->fd_array[j] == (SOCKET)fd) { 
                        revent |= ISOCK_ESEND; 
                        break; 
                    }
                }
            }
            if (event & ISOCK_ERROR) {
                for (j = 0; j < (int)fde->fd_count; j++) {
                    if (fde->fd_array[j] == (SOCKET)fd) { 
                        revent |= ISOCK_ERROR; 
                        break; 
                    }
                }
            }
            revents[i] = revent & event;
            if (revent) retval++;
        }
        #endif
    }

    return retval;
}

This is exactly the method I’m currently using, and it’s just over a hundred lines. I’ve tested it, and on my desktop, it can maintain around 10,000 socket connections without issues. Running an echo server, where each connection exchanges one message per second, keeps CPU usage at around 70%.

For Windows client programs, which typically don’t maintain many connections, this function is more than sufficient. For server programs, it’s workable and allows server applications that usually run on Linux to operate reliably on Windows, handling any number of connections smoothly during development and debugging.

The only drawback is high CPU usage. So, is there an asynchronous event model on Windows that’s as smooth as kevent or epoll—one that can handle tens of thousands of sockets without heavy CPU usage? There is, but before discussing Method 3, let’s look at two common mistakes.

Wrong Choices: WSAEventSelect and WSAAsyncSelect

Some suggest using WSAEventSelect, which binds socket events to a WSAEVENT:

int WSAAPI WSAEventSelect(
  [in] SOCKET   s,
  [in] WSAEVENT hEventObject,
  [in] long     lNetworkEvents
);

A WSAEVENT is similar to an event object and seems unrestricted by FD_SETSIZE, but WSAWaitForMultipleEvents still faces the WSA_MAXIMUM_WAIT_EVENTS limit in winsock2.h:

#define WSA_MAXIMUM_WAIT_EVENTS (MAXIMUM_WAIT_OBJECTS)

This MAXIMUM_WAIT_OBJECTS count is 64, so you still can’t escape the limitation. Another function, WSAAsyncSelect, associates a socket event with a window handle:

int WSAAsyncSelect(
  [in] SOCKET s,
  [in] HWND   hWnd,
  [in] u_int  wMsg,
  [in] long   lEvent
);

This does remove the connection limit, but it requires a window handle (HWND), meaning you’d need to create a virtual window. To emulate poll behavior, where would you place this virtual window? Would you need a dedicated thread to run its message loop?

Unix’s philosophy is that everything is a file, whereas Windows’ philosophy is that everything is a window. It’s surprising to find network programming involving windows! But ultimately, this approach is not very clean.

Method 3: Using IOCP to Implement epoll

Yes, it’s possible to use IOCP (I/O Completion Ports) to fully emulate epoll, giving you a high-performance reactor event model that can handle hundreds of thousands of sockets effortlessly. While this sounds appealing, it can be challenging to implement—but luckily, there’s already a solution for this:

The wepoll project aims to provide high-performance epoll functionality on Windows using IOCP. It supports systems from Vista onward and consists of only two files, wepoll.h and wepoll.c, making it easy to integrate. The interface aligns with epoll:

HANDLE epoll_create(int size);
HANDLE epoll_create1(int flags);

int epoll_close(HANDLE ephnd);

int epoll_ctl(HANDLE ephnd,
              int op,
              SOCKET sock,
              struct epoll_event* event);

int epoll_wait(HANDLE ephnd,
               struct epoll_event* events,
               int maxevents,
               int timeout);

It works exactly like epoll, though it only supports level-triggered mode, not edge-triggered. However, performance tests suggest that edge-triggered mode doesn’t provide a significant advantage and isn’t widely compatible across platforms, as most asynchronous event APIs don’t support this mode. Therefore, level-triggered mode works well enough in most cases.

Summary

So, if you’re building a cross-platform poll module, which of the three methods above is best on Windows? My approach is to implement Method 2 (custom fd_set) as the internal mechanism since it’s compatible back to Windows 95, making it a reliable fallback. Additionally, I include a plugin mechanism to allow external enhancements.

Then, if the main program detects that the system is Vista or newer and wepoll is available, it wraps wepoll as a plugin and integrates it. This layered approach offers flexibility and optimal performance on Windows.

Translated from https://skywind.me/blog/archives/3112

JAVA  WINDOWS  SELECT  EXCEED 

       

  RELATED


  0 COMMENT


No comment for this article.



  RANDOM FUN

Love with Python

For Python lover.