How to handle mongodb "schema" change in production

Question

How to handle mongodb "schema" change in production

I use mongodb + node.js + mongoose.js ORM backend.

Let say I I have some nested array of object without _id field

mongoose.Schema({
  nested: [{
    _id: false, prop: 'string'
  }]
})

And then I want to ad _id field to all nested objectds, so the mongoose schema would be

mongoose.Schema({
  nested: [{
    prop: 'string'
  }]
})

Then I should run some script to modify production DB, right? What is the best way to handle such change? Which tool (or approach) is best to use to implement the change?

node.js
mongodb
mongoose

Answer 1

One of the significant advantages of schema-less databases is that you don't have to update the entire database with new schema layouts. If some of the documents in the DB don't have particular information, then your code can do the appropriate thing instead, or elect to now do anything with that record.

Another option is to lazily update the documents as required - only when they are looked at again. In this instance, you might elect to have a per-record/document version flag - which initially may not even appear (and thus signify a 'version 0'). Even that is optional though. Instead, your database access code looks for data it requires, and if it does not exist, because it is new information, added after a code update, then it would fill in the results to the best of its ability.

For your example, converting an _id:false into a standard MongoId field, when the code is read (or written back after an update), and the _id:false is currently set, then make the change and write it only when it is absolutely required.

Answer 2

You indeed have to write the script that will go over collection and add new field to each document. However exact way how you'll do it really depends on the size of your DB and performance of you storage system. Adding field to the document will change it's size and thus cause relocation in most of the cases. This operation has an impact on IO and also bounded by it. If your collection is just few thousands documents, may be up to one hundred thousands, then you may simply iterate over it in one loop because whole collection is probably fits into memory and all IO will happen afterwards. But if collection spans far beyond available memory, then approach is more complicated. We usually follow the next approach in production use of MongoDB:

Open cursor with timeout=False
Read a chunk of documents into memory
Run update queries on these documents
Sleep for some time to avoid overloading IO subsystem and hurting production application
Repeat until done
Close the cursor :)

Size of documents chunk and sleeping period must be determined experimentally. Usually you want to avoid QR/QW in mongostats for the period of migration. For larger collections on slower drives (like EBS on Amazon) this IO-safe approach can take from hours to days.