How to match high-value unicode characters using a regex?

Specifically, I want to match the range [#x10000-#xEFFFF]. AFAIK, the \u escape sequences only accept 4 hex digits, not 5. Is there a way to match higher values?

Internally, JavaScript uses UCS-2, which is limited to the base plane. For higher-range characters, you will have to use surrogate pairs. For instance, to find U+13FFA, you can match \uD80F\uDFFA.

More details can be found here.

Unfortunately, this doesn't work well within character classes in a regex. With BMP characters, you can do things like /[a-z]/. You can't do that with higher-range characters because JavaScript doesn't understand that surrogate pairs should be treated as a unit. You may be able to hunt around for third-party libraries that deal with this. Sadly, I don't know of any to recommend. This one might be worth a look. I've never used it, so I cannot attest to it's quality.

P.S. You may find this shim useful for dealing with higher-order characters generally.

Maybe something like this?

var regex = /#x[1-9a-eA-E][0-9a-fA-F]{4}/g;

console.log(regex.test("#x03FFA")); // false
console.log(regex.test("#x13FFA")); // true

http://jsbin.com/awidew/1

mz