Thursday 19 May 2022

Regex Match All Except a Specific Word, Character, or Pattern

https://regexland.com/regex-match-all-except/

Regex Match All Except a Specific Word, Character, or Pattern

 

A regular expression that matches everything except a specific pattern or word makes use of a negative lookahead. Inside the negative lookahead, various unwanted words, characters, or regex patterns can be listed, separated by an OR character.

For example, here’s an expression that will match any input that does not contain the text “ignoreThis”.

/^(?!.*ignoreThis).*/

Note that you can replace the text ignoreThis above with just about any regular expression, including:

  • A word (e.g. apple or password)
  • A set of unwanted characters in square brackets (e.g. [aeiou])
  • A regex pattern (e.g. mis{2}is{2}ip{2}i)
  • A list of regex patterns separated by the OR sybmol |(e.g. (cats?|dogs?)

Before we dive into each of these, let’s first discuss how the whole thing works:

ALSO READ: Regex Match Everything After A Specific Character

How The Main Expression Works

To begin our expression, we first start by allowing everything to be matched. This is done by the dot symbol . which matches any character, followed by a zero-or-more quantifier *. This allows us to match zero or more of any character:

/.*/

Next, we add a negative lookahead, written in the form (?!abc). The negative lookahead looks ahead into the string to see if the specified expression (abc in this case) is present. It work by only checking whether the abc expression is present, without actually matching or returning the expression.

/(?!abc).*/

Note that we place the negative lookahead at the start of the expression to ensure that it is validated before anything else is checked.

The expression above will now start from the first character in the string, checking every substring for abc, and won’t match if it finds this expression. However, upon validating the substring starting with the second character, bc, the test will fail since bc is not equal to abc. Therefore, the remainder of the string will be matched. To prevent this from happening, we need to provide a start-of-string anchor ^:

/^(?!abc).*/

This anchor forces the matched expression to start at the beginning of the string and ensures that no subsequent sub-strings can be matched.

Finally, this expression above will reject any string starting with abc but will accept any string that starts with a different character followed by abc. In other words, it will accept aabc or xabc.

To prevent this from happening, we need to provide an additional expression that will notice the characters at the start of the string, together with the unwanted expression. To do this, we need to add another dot character . and zero-or-more quantifier * that will notice zero-or-more characters in front of the unwanted expression.

/^(?!.*abc).*/

Notice that we place the .* inside the negative lookahead. If we placed it in front of the negative lookahead, the entire string will be matched before the negative lookahead is even checked.

And this completes the general expression required. We can now tweak it to suit specific use-cases.

Let’s look at some examples.

Match All Except a Specific Word

To match everything except a specific word, we simply enter the unwanted word inside the negative lookahead. The following expression will not match any string containing the word foo:

/^(?!.*foo).*/

We can list multiple unwanted words by separating them with the OR symbol |. The following expression will ignore strings that contain any of the words dollar, euro, or pound:

/^(?!.*(dollar|euro|pound)).*/

Notice that we need to enclose the list of unwanted words in round brackets () for this to work correctly. If the round brackets are ignored, the .* at the front of the negative lookahead will work together with dollar but not with euro or pound, causing sentences that contain other characters before these unwanted words to be matched.

Match All Except a Specific Character

To match everything except a specific character, simply insert the character inside the negative lookahead. This expression will ignore any string containing an a:

/^(?!.*a).*/

If the character you want to exclude is a reserved character in regex (such as ? or *) you need to include a backslash \ in front of the character to escape it, as shown:

/^(?!.*\?).*/

For a set of characters, one can include them in square brackets. Note that special characters inside square brackets don’t need to be escaped. The following expression will not match any string that contains a vowel:

/^(?!.*[aeiou]).*/

Match All Except a Specific Pattern

In addition to unwanted words or characters, one can specify a pattern that must be avoided in all matches. The pattern must be placed inside the negative lookahead:

For example, this expression will not match any string that contains three consecutive digits \d:

/^(?!.*\d{3}).*/

The following expression will not match any spelling of the word grey or gray:

/^(?!.*gr(e|a)y).*/

Match All Except a List of Patterns

Finally, patterns can be combined by enclosing them in parentheses () and separating them using the OR symbol |.

The following expression will not match any string containing three consecutive digits \d, nor a string containing a vowel:

/^(?!.*(\d{3}|[aeiou])).*/

Lookahead Support

It should be noted that some programming languages does not support lookaheads in their regex implementations and will therefore not be able to run the expressions above.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.

Blog Archive