How to detect and remove invalid unicode sequences in JavaScript

Given an input string, generate an output string where all invalid sequences are either removed or replaced with U+FFFD.

Is there a better method than implementing a state-machine char-by-char, or a non-native node.JS module available?

Invalid sequences are, for example, orphaned surrogates "\uD800", or other invalid multi-char sequences.

The regex needed to match invalid sequences depends on what you want to include. To replace orphaned surrogates with U+FFFD, you can use something like this:

var surrogates = /[\ud800-\udbff][\udc00-\udfff]|[\ud800-\udfff]/g;
str = str.replace(surrogates , function ($0) {
    return $0.length > 1 ? $0 : '\ufffd';
});

If you use the XRegExp library with its Unicode addons, you can use the \p{Cs} or \p{Surrogate} Unicode category instead of [\ud800-\udfff]. Using XRegExp will also give you easy access to other potentially relevant Unicode properties such as \p{Noncharacter_Code_Point}, \p{Co} or \p{Private_Use}, and \p{Cn} or \p{Unassigned}.

Since you're using Node.js, you can install XRegExp via npm using npm install xregexp. XRegExp's npm module automatically includes the Unicode addons.