I have collaborative web application that handles JSON-objects like the following:
var post = {
id: 123,
title: 'Sterling Archer',
comments: [
{text: 'Comment text', tags: ['tag1', 'tag2', 'tag3']},
{text: 'Comment test', tags: ['tag2', 'tag5']}
]
};
My approach is using rfc6902 (JSONPatch) specification with jsonpatch library for patching JSON document. All such documents store in MongoDB database and as you know the last one very slow for frequent writes.
To get more speed and highload application I use redis as queue for a patch operations like the following:
{ "op": "add", "path": "/comments/2", "value": {text: 'Comment test3', tags: ['tag4']}" }
I just store all such patch operations in queue and at midnight run cron script to get all patches and construct full document and update it in MongoDB database.
I don't understand yet what should I do in case corrupted patch like:
{ "op": "add", "path": "/comments/0/tags/5", "value": 'tag4'}
The patch above don't gets applied to document above because tags array has length only 3 (according official docs http://tools.ietf.org/html/rfc6902#page-5)
The specified index MUST NOT be greater than the number of elements in the array.
So when user is online he don't get any errors because his patch operations get stored in redis queue but next day he get broken document due broken patch that don't got applied in cron script.
So my question if how could I guarantee that all patches that stored in redis queue is correct and don't corrupts primary document?
As with any system that can become inconsistent, you must allow for patches to be applied as quickly as possible if you wish to catch conflicts sooner and decrease the likelihood of running into them. That is likely your main issue if you are not notifying the other clients of any updated data as soon as possible (and are just waiting for the CRON to run to update the shared data that the other clients can access).
As others have asked, it's important to understand how a "bad" patch got into the operation queue in the first place. Here are some guesses from my standpoint:
Although I have no code to go off of, I can take a shot in the dark and help you analyze the latter point. The first thing we need to analyze is the different scenarios that may come up with updating a "shared" resource. It's important to note that, in any system that must eventually be consistent, we care about the:
The latter is really up to you, and you will need a good notification/messaging system to update the "truth" that clients see.
User A applies operations 1 & 2. The document is updated on the server and then User B is notified of this. User B was going to apply operations 3 & 4, but these operations (in this order) do not conflict with operations 1 & 2. All is well in the world. This is a good situation.
User A applies operations 1 & 2. User B applies operations 3 & 4.
If you apply the operations atomically per user, you can get the following queues:
[1,2,3,4] [3,4,1,2]
Anywhere along the line, if there is a conflict, you must notify either User A or User B based on "who got there first" (or any other weighting semantics you wish to use). Again, how you deal with conflicts is up to you. If you have not read up on vector clocks, you should do so.
If you don't apply operations atomically per user, you can get the following queues:
[1,2,3,4] [3,4,1,2] [1,3,2,4] [3,1,4,2] [3,1,2,4] [1,3,4,2]
As you can see, forgoing atomic updates per user increases the combinations of updates and will therefore increase the likelihood of a collision happening. I urge you to ensure that operations are being added to the queue atomically per user.
Some important things you should remember:
EDIT:
It seems like Google Docs handles conflict resolutions with transformations. That is, by shifting whole characters/lines over to make way for a hybrid application of all operations: http://googledocs.blogspot.com/2010/09/whats-different-about-new-google-docs_22.html
As I had said before, it's all up to how you want to handle your own conflicts, which should largely be determined by the application/product itself and its use cases.
IMHO you are introducing unneeded complexity instead of simpler solution. These would be my alternate suggestions instead of your approach of a json patch cron which is very hard to make consistent and atomic.
Use mongodb only : With proper database design and indexing, and proper hdarware allocation/sharding, the write performance of mongodb is really fast. And the kind of operations you are using in jsonpatch are natively supported in mongodb BSON documents and their query language .e.g $push,$set,$inc,$pull etc. Perhaps you want to not interrupt users activities with a syncronous write to Mongodb , for that the solution is using async queus as mentioned in point#2.
2.Use task queues & mongodb: Instead of storing patches in redis like you do now, you can push the patching task to a task queue, which will asyncronously do the mongodb update , and user will not experience any slow performance. One very good task queue is Celery , which can use Redis as a broker & messaging backend. So, each users updates get a single task, and will get applied to mongodb by the task queue, and there will be no performance hit.