How to store json-patch operations in redis queue and guarantee their consistency?

I have collaborative web application that handles JSON-objects like the following:

var post = {
  id: 123,
  title: 'Sterling Archer',    
  comments: [
    {text: 'Comment text', tags: ['tag1', 'tag2', 'tag3']},
    {text: 'Comment test', tags: ['tag2', 'tag5']}
  ]  
};

My approach is using rfc6902 (JSONPatch) specification with jsonpatch library for patching JSON document. All such documents store in MongoDB database and as you know the last one very slow for frequent writes.

To get more speed and highload application I use redis as queue for a patch operations like the following:

{ "op": "add", "path": "/comments/2", "value":  {text: 'Comment test3', tags: ['tag4']}" }

I just store all such patch operations in queue and at midnight run cron script to get all patches and construct full document and update it in MongoDB database.

I don't understand yet what should I do in case corrupted patch like:

{ "op": "add", "path": "/comments/0/tags/5", "value": 'tag4'}

The patch above don't gets applied to document above because tags array has length only 3 (according official docs http://tools.ietf.org/html/rfc6902#page-5)

 The specified index MUST NOT be greater than the number of elements in the array.

So when user is online he don't get any errors because his patch operations get stored in redis queue but next day he get broken document due broken patch that don't got applied in cron script.

So my question if how could I guarantee that all patches that stored in redis queue is correct and don't corrupts primary document?

As with any system that can become inconsistent, you must allow for patches to be applied as quickly as possible if you wish to catch conflicts sooner and decrease the likelihood of running into them. That is likely your main issue if you are not notifying the other clients of any updated data as soon as possible (and are just waiting for the CRON to run to update the shared data that the other clients can access).

As others have asked, it's important to understand how a "bad" patch got into the operation queue in the first place. Here are some guesses from my standpoint:

  1. A user had applied some operations that got lost in translation. How? I don't know, but it would explain the discrepancy.
  2. Operations are not being applied in the correct order. How? I don't know. I have no code to go off of.

Although I have no code to go off of, I can take a shot in the dark and help you analyze the latter point. The first thing we need to analyze is the different scenarios that may come up with updating a "shared" resource. It's important to note that, in any system that must eventually be consistent, we care about the:

  1. Order of the operations.
  2. How we will deal with conflicts.

The latter is really up to you, and you will need a good notification/messaging system to update the "truth" that clients see.

Scenario 1

User A applies operations 1 & 2. The document is updated on the server and then User B is notified of this. User B was going to apply operations 3 & 4, but these operations (in this order) do not conflict with operations 1 & 2. All is well in the world. This is a good situation.

Scenario 2

User A applies operations 1 & 2. User B applies operations 3 & 4.

If you apply the operations atomically per user, you can get the following queues:

[1,2,3,4] [3,4,1,2]

Anywhere along the line, if there is a conflict, you must notify either User A or User B based on "who got there first" (or any other weighting semantics you wish to use). Again, how you deal with conflicts is up to you. If you have not read up on vector clocks, you should do so.

If you don't apply operations atomically per user, you can get the following queues:

[1,2,3,4] [3,4,1,2] [1,3,2,4] [3,1,4,2] [3,1,2,4] [1,3,4,2]

As you can see, forgoing atomic updates per user increases the combinations of updates and will therefore increase the likelihood of a collision happening. I urge you to ensure that operations are being added to the queue atomically per user.

A Recap

Some important things you should remember:

  1. Make sure updates to the queue are atomically applied per user.
  2. Figure out how you will deal with several versions of a shared resource arising from multiple mutations from different clients (again I suggest you read up on vector clocks).
  3. Don't update a shared resource that may be accessed by several clients in real-time as a cron job.
  4. When there is a conflict that cannot be resolved, figure out how you will deal with it.
  5. As a result of point 3, you will need to come up with a notification system so that clients can get updated resources quickly. As a result of point 4, you may choose to include telling clients that something went wrong with their update. Something that has just come to the top of my head is that you're already using Redis, which has pub/sub capabilities.

EDIT:

It seems like Google Docs handles conflict resolutions with transformations. That is, by shifting whole characters/lines over to make way for a hybrid application of all operations: http://googledocs.blogspot.com/2010/09/whats-different-about-new-google-docs_22.html

As I had said before, it's all up to how you want to handle your own conflicts, which should largely be determined by the application/product itself and its use cases.

IMHO you are introducing unneeded complexity instead of simpler solution. These would be my alternate suggestions instead of your approach of a json patch cron which is very hard to make consistent and atomic.

  1. Use mongodb only : With proper database design and indexing, and proper hdarware allocation/sharding, the write performance of mongodb is really fast. And the kind of operations you are using in jsonpatch are natively supported in mongodb BSON documents and their query language .e.g $push,$set,$inc,$pull etc. Perhaps you want to not interrupt users activities with a syncronous write to Mongodb , for that the solution is using async queus as mentioned in point#2.

    2.Use task queues & mongodb: Instead of storing patches in redis like you do now, you can push the patching task to a task queue, which will asyncronously do the mongodb update , and user will not experience any slow performance. One very good task queue is Celery , which can use Redis as a broker & messaging backend. So, each users updates get a single task, and will get applied to mongodb by the task queue, and there will be no performance hit.