TL;DR - Are there any best practices when configuring the globalAgent that allow for high throughput with a high volume of concurrent requests?
Here's our issue:
As far as I can tell, connection pools in Node are managed by the http module, which queues requests in a globalAgent object, which is global to the Node process. The number of requests pulled from the globalAgent queue at any given time is determined by the number of open socket connections, which is determined by the maxSockets property of globalAgent (defaults to 5).
When using "keep-alive" connections, I would expect that as soon as a request is resolved, the connection that handled the request would be available and can handle the next request in the globalAgent's queue.
Instead, however, it appears that each connection up to the max number is resolved before any additional queued requests are handled.
When watching networking traffic between components, we see that if maxSockets is 10, then 10 requests resolve successfully. Then there is a pause 3-5 second pause (presumably while new tcp connections are established), then 10 more requests resolve, then another pause, etc.
This seems wrong. Node is supposed to excel at handling a high volume of concurrent requests. So if, even with 1000 available socket connections, if request 1000 cannot be handled until 1-999 resolve, you'd hit a bottleneck. Yet I can't figure out what we're doing incorrectly.
Update
Here's an example of how we're making requests -- though it's worth noting that this behavior occurs whenever a node process makes an http request, including when that request is initiated by widely-used third-party libs. I don't believe it is specific to our implementation. Nevertheless...
class Client
constructor: (@endpoint, @options = {}) ->
@endpoint = @_cleanEndpoint(@endpoint)
throw new Error("Endpoint required") unless @endpoint && @endpoint.length > 0
_.defaults @options,
maxCacheItems: 1000
maxTokenCache: 60 * 10
clientId : null
bearerToken: null # If present will be added to the request header
headers: {}
@cache = {}
@cards = new CardMethods @
@lifeStreams = new LifeStreamMethods @
@actions = new ActionsMethods @
_cleanEndpoint: (endpoint) =>
return null unless endpoint
endpoint.replace /\/+$/, ""
_handleResult: (res, bodyBeforeJson, callback) =>
return callback new Error("Forbidden") if res.statusCode is 401 or res.statusCode is 403
body = null
if bodyBeforeJson and bodyBeforeJson.length > 0
try
body = JSON.parse(bodyBeforeJson)
catch e
return callback( new Error("Invalid Body Content"), bodyBeforeJson, res.statusCode)
return callback(new Error(if body then body.message else "Request failed.")) unless res.statusCode >= 200 && res.statusCode < 300
callback null, body, res.statusCode
_reqWithData: (method, path, params, data, headers = {}, actor, callback) =>
headers['Content-Type'] = 'application/json' if data
headers['Accept'] = 'application/json'
headers['authorization'] = "Bearer #{@options.bearerToken}" if @options.bearerToken
headers['X-ClientId'] = @options.clientId if @options.clientId
# Use method override (AWS ELB problems) unless told not to do so
if (not config.get('clients:useRealHTTPMethods')) and method not in ['POST', 'PUT']
headers['x-http-method-override'] = method
method = 'POST'
_.extend headers, @options.headers
uri = "#{@endpoint}#{path}"
#console.log "making #{method} request to #{uri} with headers", headers
request
uri: uri
headers: headers
body: if data then JSON.stringify data else null
method: method
timeout: 30*60*1000
, (err, res, body) =>
if err
err.status = if res && res.statusCode then res.statusCode else 503
return callback(err)
@_handleResult res, body, callback
To be honest, coffeescript isn't my strong point so can't comment really on the code.
However, I can give you some thoughts: in what we are working on, we use nano to connect to cloudant and we're seeing up to 200requests/s into cloudant from a micro AWS instance. So you are right, node should be up to it.
Try using request https://github.com/mikeal/request if you're not already. (I don't think it will make a difference, but nevertheless worth a try as that is what nano uses).
These are the areas I would look into:
The server doesn't deal well with multiple requests and throttles it. Have you run any performance tests against your server? If it can't handle the load for some reason or your requests are throttled in the os, then it doesn't matter what your client does.
Your client code has a long running function somewhere which prevents node processing any reponses you get back from the server. Perhaps 1 specific response causes a response callback to spend far too long.
Are the endpoints all different servers/hosts?