Node.JS Regex engine fails on large input

Question

Node.JS Regex engine fails on large input

The question is a bit complicated, and googling didn't really help. I will try to put in only relevant aspects of it.

I have a large document in approximately the following format:

Sample Input:

ABC is a word from one line of this document. It is followed by
some random line
PQR which happens to be another word.
This is just another line
I have to fix my regular expression.
Here GHI appears in the middle.
This may be yet another line.
VWX is a line
this is the last line

I am trying to remove the section of the text according to the below:

From either of:
- ABC
- DEF
- GHI
To either of(while retaining this word):
- PQR
- STU
- VWX

The words that make up "From" can appear anywhere in a line (Look at GHI). But for removal the entire line needs to be removed. (The entire line containing GHI needs to be removed as in the sample output below)

Sample Output:

PQR which happens to be another word.
This is just another line
I have to fix my regular expression.
VWX is a line
this is the last line

The above example actually seemed easy for me until I ran it against very large input files ( 49KB)

What I have tried:

The regular expression I am currently using is (with case insensitive and multiline modifier):

^.*\b(abc|def|ghi)\b(.|\s)*?\b(pqr|stu|vwx)\b

Problem

The above regexp works wonderfully on small text files. But fails/crashes the engine on large files. I have tried it against the below:

V8 (Node.js) : Hangs
Rhino : Hangs
Python : Hangs
Java : StackoverflowError (Stack trace posted at the end of this question)
IonMonkey (Firefox) : WORKS!

Actual Input:

My original Input: http://ideone.com/W4sZmB

My regular expression (split across multiple lines for clarity):

^.*\\b(patient demographics|electronically signed|md|rn|mspt|crnp|rt)\\b
 (.|\\s)*?
 \\b(history of present illness|hpi|chief complaint|cc|reason for consult|patientis|inpatient is|inpatientpatient|pt is|pts are|start end frequency user)\\b

Question:

Is my regular expression correct? Can it be optimized further to avoid this problem?
In case it is correct, why do other engines hang infinitely? A section of stack trace is below:

Stack Trace:

Exception in thread "main" java.lang.StackOverflowError
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4218)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)

^{PS: I'm adding several tags to this question since I have tried it on those environments and the experiment failed.}

java
python
regex
node.js
v8

Answer 1

The problem is the (.|\s)* because any space character will match both and it will allow it to go down both options. This makes it get exponentially larger.

You can see the issue with this regex in ruby

str = "b" + "a" * 200 + "cbab"

/b(a|a)*b/.match str

which takes forever, while a basically identical one

/ba*b/.match str

matches quickly.

You can fix this by either using just .* or if . doesn't match newlines (.|\n)*

Answer 2

I'd be tempted to try simplifying the re. It's not very complicated at the moment to be honest but how about:

\b(abc|def|ghi)\b.*\b(pqr|stu|vwx)\b

Doesn't that still do what you're after but with the start of line anchor and the unnecessary optional element in the middle? Might not make any difference but it might be worth trying.

Answer 3

I think your problem might lie in the fact that as files get longer and longer, you can match pairs of from and to blocks going by approximately n x m / 2. This means that you get exponentially more results that take up more and more of the source file. If the file began with ABC and ended with VWX, then one of the matches would be the entire file.

To give the regex engine less matches to deal with, my first approach would be to only regex on (abc|def|ghi) and (pqr|stu|vwx) separately. After you get the results back, you can go through each from match and try and find the first matching to block. Some psuedo-code to accomplish this would be

from = regex.match(file, '(abc|def|ghi)')
to = regex.match(file, '(pqr|stu|vwx)')
for each match in from:
  for index in to:
    if index > match:
      add index, match to results
      break
for each result:
  parse backwards to the beginning of the line
  edit the file to remove the matching text

Although this creates more work for yourself, it means that the regex parser doesn't have to keep the entire n kB file in memory at once, and can parse through small blocks much more effectively.