How to detect and remove invalid unicode sequences in JavaScript

Question

How to detect and remove invalid unicode sequences in JavaScript

Given an input string, generate an output string where all invalid sequences are either removed or replaced with U+FFFD.

Is there a better method than implementing a state-machine char-by-char, or a non-native node.JS module available?

Invalid sequences are, for example, orphaned surrogates "\uD800", or other invalid multi-char sequences.

javascript
node.js
unicode

Answer 1

The regex needed to match invalid sequences depends on what you want to include. To replace orphaned surrogates with U+FFFD, you can use something like this:

var surrogates = /[\ud800-\udbff][\udc00-\udfff]|[\ud800-\udfff]/g;
str = str.replace(surrogates , function ($0) {
    return $0.length > 1 ? $0 : '\ufffd';
});

If you use the XRegExp library with its Unicode addons, you can use the \p{Cs} or \p{Surrogate} Unicode category instead of [\ud800-\udfff]. Using XRegExp will also give you easy access to other potentially relevant Unicode properties such as \p{Noncharacter_Code_Point}, \p{Co} or \p{Private_Use}, and \p{Cn} or \p{Unassigned}.

Since you're using Node.js, you can install XRegExp via npm using npm install xregexp. XRegExp's npm module automatically includes the Unicode addons.