Parsing inconsistent data

Here's what the data's supposed to look like:

Some junk data
More junk data 

1. fairly long key, all on one line
value: some other text with spaces and stuff

2. hey look! another long key. still on one line
value: a different value with some different information

There's several of these per file, usually between twenty and thirty. The total number of key-value pairs exceeds 20,000, meaning manually correcting each file is a non-option. The number prefacing each key is supposed to increment properly. There is supposed to be a newline between a value and the following key. Each value should be prefaced with the string "value: "

Right now, I go line by line and classify each line as either key, value, or junk. I then parse the number out of the key and store the number, key, and value in an object.

Issues arise when the data is improperly formatted. Here are a few issues I've encountered thus far:

  • no newline between the key and value.
  • an unexpected newline in the middle of the key or value, which results in the program viewing a portion of each key or value as junk data.
  • the word "value" being spelled wrong.

I handle the third scenario by computing the Levenstein distance between the first six characters of each line and a master string "value:". How can I fix the other two issues?

If it matters, the parsing is happening on a node.js server, but I'm open to other languages if they can work with this inconsistent data more easily.

Take a look at this:

RegEx: ^(\d+)\. ?(.+?)(?:value|vlaue|balue|valie): ?(.+?)[\n\r]{2,} Explained demo here: http://regex101.com/r/gG0wH8

If you have your 'misspelled value' issue fixed you can simplify it to:
^(\d+)\. ?(.+?)value: ?(.+?)[\n\r]{2,} otherwise add as many misspellings with a | in that RegEx part.

For this to work I hooked on:

  • line must start with digit(s) and a dot with a optional space
  • key is everything after the id and before the value
  • value ends after at least 2 line breaks

You should also remove the correct entries and then reexamine the file to check if anything else is missing.