I've been trying to implement a simple long polling service for use in my own projects and maybe release it as a SAAS if I succeed. These are the two approaches I've tried so far, both using Node.js (polling PostgreSQL in the back).
Every new connection is pushed onto a queue of connections, which is being walked through in an interval.
var queue = [];
function acceptConnection(req, res) {
res.setTimeout(5000);
queue.push({ req: req, res: res });
}
function checkAll() {
queue.forEach(function(client) {
// respond if there is something new for the client
});
}
// this could be replaced with a timeout after all the clients are served
setInterval(checkAll, 500);
Every client gets his own ticker which checks for new data
function acceptConnection(req, res) {
// something which periodically checks data for the client
// and responds if there is anything new
new Ticker(req, res);
}
While this keeps the minimum latency for each client lower, it also introduces overhead by setting a lot of timeouts.
Both of these approaches solve the problem quite easily, but I don't feel that this will scale up easily to something like 10 million open connections, especially since I'm polling the database on every check for every client.
I thought about doing this without the database and just immediately broadcast new messages to all open connections, but that will fail if a client's connection dies for a few seconds while the broadcast is happening, because it is not persistent. Which means I basically need to be able to look up messages in history when the client polls for the first time.
I guess one step up here would be to have a data source where I can subscribe to new data coming in (CouchDB change notifications?), but maybe I'm missing something in the big picture here?
What is the usual approach for doing highly scalable long polling? I'm not specifically bound to Node.js, I'd actually prefer any other suggestion with a reasoning why.
Not sure if this answers your question, but I like the approach of PushPin (+ explanation of concepts).
I love the idea (using reverse proxy and communicating with return codes + delayed REST return requests), but I do have reservations about the implementation. I might be underestimating the problem, but is seems to me that the technologies used are a bit on an overkill. Not sure if I will use it or not yet, would prefer a more lightweight solution, but I find the concept phenomenal.
Would love to hear what you used eventually.
Since you mentioned scalability, I have to get a little bit theoretical, as the only practical measure is load testing. Therefore, all I can offer is advice.
Generally speaking, once-per anything is bad for scalability. Especially once-per-connection or once-per-request since that makes part of your app proportional to the amount of traffic. Node.js removed the thread-per-connection dependency with its single-threaded asynchronous I/O model. Of course, you can't completely eliminate having something per-connection, like a request and response object and a socket.
I suggest avoiding anything that opens a database connection for every HTTP connection. This is what connections pools are for.
As for choosing between your two options above, I would personally go for the second choice because it keeps each connection isolated. The first option uses a loop over connections, which means actual execution time per connection. It's probably not a big deal given that I/O is asynchronous, but given a choice between an iteration-per-connection and the mere existence of an object-per-connection, I would prefer to just have an object. Then I have less to worry about when suddenly there are 10,000 connections.
The C10K problem seems like a good reference for this, though this is really personal judgement to be honest.