Given an input string, generate an output string where all invalid sequences are either removed or replaced with U+FFFD.
Is there a better method than implementing a state-machine char-by-char, or a non-native node.JS module available?
Invalid sequences are, for example, orphaned surrogates "\uD800"
, or other invalid multi-char sequences.
The regex needed to match invalid sequences depends on what you want to include. To replace orphaned surrogates with U+FFFD, you can use something like this:
var surrogates = /[\ud800-\udbff][\udc00-\udfff]|[\ud800-\udfff]/g;
str = str.replace(surrogates , function ($0) {
return $0.length > 1 ? $0 : '\ufffd';
});
If you use the XRegExp library with its Unicode addons, you can use the \p{Cs}
or \p{Surrogate}
Unicode category instead of [\ud800-\udfff]
. Using XRegExp will also give you easy access to other potentially relevant Unicode properties such as \p{Noncharacter_Code_Point}
, \p{Co}
or \p{Private_Use}
, and \p{Cn}
or \p{Unassigned}
.
Since you're using Node.js, you can install XRegExp via npm using npm install xregexp
. XRegExp's npm module automatically includes the Unicode addons.