I apologize in advance for the length of this question, but I wanted to make it clear what I have already attempted.
Setup:
*Clients and server are running Ubuntu and have their file descriptor limits raised to 102400.
Run Case:
The 4 clients try to make n number of connections (simple http get request) a second, ranging from 400 to 1000, until 80,000 requests are made. The server has a hard response wait time, y, tested at 500, 1000, 2000, and 3000 milliseconds before it responds with "hello".
Problem:
At anything more than 500 connections/second, there is a several second (up to 10 or 15) halt, where the server no longer responds to any of the clients, and clients are left idle waiting for the responses. This is consistently at exactly 31449 requests. The clients show the appropriate amount of ESTABLISHED connections (using netstat) holding during this time. Meanwhile, the server shows around 31550 TIME_WAIT connections. After a few seconds this number reported by the server begins to drop, and eventually it starts responding to the clients again. Then, the same issue occurs at some later total request count, e.g. 62198 (though this is not as consistent). The file descriptor count for that port also drops to 0.
Attempted Resolutions:
Increasing ephemeral port range. Default was 32768-61000, or about 30k. Note that despite coming from 4 different physical clients, the traffic is routed through the local ip of the ELB and all ports are thus assigned to that ip. Effectively, all 4 clients are treated as 1 instead of the usually expected result of each of them being able to use the full port range. So instead of 30k x 4 total ports, all 4 are limited to 30k. So I increased the port range to 1024-65535 with net.ipv4.ip_local_port_range, restarted the server and the following was observed:
Other tcp configurations were also changed, independent of each other and in conjuction with each other such as tc_fin_timeout, tcp_tw_recycle, tcp_tw_reuse, and several others without any sizable improvement. tcp_tw_recycle seems to help the most, but it makes the status results on the clients print out strangely and in the wrong order, and still doesn't guarantee that the connections don't get stuck. Also I understand that this is a dangerous option to enable.
Question:
I simply want to have as many connections as possible, so that the real server that gets put on the c1.medium has a high baseline when it is benchmarked. What else can I do to avoid hitting this 31449 connection wall short of recompiling the kernel or making the server unstable? I feel like I should be able to go much higher than 500/s, and I thought that increasing the port range alone should have shown some improvement, but I am clearly missing something else.
Thanks!