Here's what the data's supposed to look like:
Some junk data
More junk data
1. fairly long key, all on one line
value: some other text with spaces and stuff
2. hey look! another long key. still on one line
value: a different value with some different information
There's several of these per file, usually between twenty and thirty. The total number of key-value pairs exceeds 20,000, meaning manually correcting each file is a non-option. The number prefacing each key is supposed to increment properly. There is supposed to be a newline between a value and the following key. Each value should be prefaced with the string "value: "
Right now, I go line by line and classify each line as either key, value, or junk. I then parse the number out of the key and store the number, key, and value in an object.
Issues arise when the data is improperly formatted. Here are a few issues I've encountered thus far:
I handle the third scenario by computing the Levenstein distance between the first six characters of each line and a master string "value:". How can I fix the other two issues?
If it matters, the parsing is happening on a node.js server, but I'm open to other languages if they can work with this inconsistent data more easily.
Take a look at this:
RegEx: ^(\d+)\. ?(.+?)(?:value|vlaue|balue|valie): ?(.+?)[\n\r]{2,}
Explained demo here: http://regex101.com/r/gG0wH8
If you have your 'misspelled value' issue fixed you can simplify it to:
^(\d+)\. ?(.+?)value: ?(.+?)[\n\r]{2,} otherwise add as many misspellings with a | in that RegEx part.
For this to work I hooked on:
key is everything after the id and before the value value ends after at least 2 line breaksYou should also remove the correct entries and then reexamine the file to check if anything else is missing.