Jay Taylor's notes

back to listing index

Regular expression to match line that doesn't contain a word?

[web search]
Original source (stackoverflow.com)
Tags: regular-expressions stackoverflow.com
Clipped on: 2016-08-13

I know it's possible to match a word and then reverse the matches using other tools (e.g. grep -v). However, I'd like to know if it's possible to match lines that don't contain a specific word (e.g. hede) using a regular expression?

Input:

hoho
hihi
haha
hede

Code:

# grep "Regex for doesn't contain hede" Input

Desired output:

hoho
hihi
haha

protected by Community Oct 8 '11 at 12:34

This question is protected to prevent "thanks!", "me too!", or spam answers by new users. To answer it, you must have earned at least 10 reputation on this site (the association bonus does not count).

54 upvote
  flag
Probably a couple years late, but what's wrong with: ([^h]*(h([^e]|$)|he([^d]|$)|hed([^e]|$)))*? The idea is simple. Keep matching until you see the start of the unwanted string, then only match in the N-1 cases where the string is unfinished (where N is the length of the string). These N-1 cases are "h followed by non-e", "he followed by non-d", and "hed followed by non-e". If you managed to pass these N-1 cases, you successfully didn't match the unwanted string so you can start looking for [^h]* again – stevendesu Sep 29 '11 at 3:44
175 upvote
  flag
@stevendesu: try this for 'a-very-very-long-word' or even better half a sentence. Have fun typing. BTW, it is nearly unreadable. Don't know about the performance impact. – Peter Schuetze Jan 30 '12 at 18:45
9 upvote
  flag
@PeterSchuetze: Sure it's not pretty for very very long words, but it is a viable and correct solution. Although I haven't run tests on the performance, I wouldn't imagine it being too slow since most of the latter rules are ignored until you see an h (or the first letter of the word, sentence, etc.). And you could easily generate the regex string for long strings using iterative concatenation. If it works and can be generated quickly, is legibility important? That's what comments are for. – stevendesu Feb 2 '12 at 3:14
32 upvote
  flag
@stevendesu: i'm even later, but that answer is almost completely wrong. for one thing, it requires the subject to contain "h" which it shouldn't have to, given the task is "match lines which [do] not contain a specific word". let us assume you meant to make the inner group optional, and that the pattern is anchored: ^([^h]*(h([^e]|$)|he([^d]|$)|hed([^e]|$))?)*$ this fails when instances of "hede" are preceded by partial instances of "hede" such as in "hhede". – jaytea Sep 10 '12 at 10:41
3 upvote
  flag
This question has been added to the Stack Overflow Regular Expression FAQ, under "Advanced Regex-Fu". – aliteralmind Apr 10 '14 at 1:30
up vote 3489 down vote accepted

The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$

The regex above will match any string, or line without a line break, not containing the (sub) string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s

or use it inline:

/(?s)^((?!hede).)*$/

(where the /.../ are the regex delimiters, ie, not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/

Explanation

A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

    +--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+
S = |e1| A |e2| B |e3| h |e4| e |e5| d |e6| e |e7| C |e8| D |e9|
    +--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+

index    0      1      2      3      4      5      6      7

where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.

So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).

4 upvote
  flag
I would not go so far as to say that this is something regex is bad at. The convenience of this solution is pretty obvious and the performance hit compared to a programmatic search is often going to be unimportant. – Archimaredes Mar 3 at 16:09

Note that the solution to does not start with “hede”:

^(?!hede).*$

is generally much more efficient than the solution to does not contain “hede”:

^((?!hede).)*$

The former checks for “hede” only at the input string’s first position, rather than at every position.

3 upvote
  flag
You're right. But I saw the anchors ^ and $ in another answer which threw me off (implying we are checking the entire string only). – FireCoding Mar 17 '11 at 19:52
14 upvote
  flag
Maybe you could add some text to your answer to make it more obvious that it's only a solution to 'does not start with'. – Rory Jun 19 '12 at 11:40
2 upvote
  flag
Thanks, I used it to validate that the string dosn't contain squence of digits ^((?!\d{5,}).)* – Samih A May 10 '15 at 10:42
   upvote
  flag
^((?!hede).)*$ worked for me using the jQuery DataTable plugin to exclude a string from the dataset – Alex Jun 26 '15 at 10:34
1 upvote
  flag
Hello! I can't compose does not end with "hede" regex. Can you help with it? – Aleks Ya Oct 18 '15 at 21:33

If you're just using it for grep, you can use grep -v hede to get all lines which do not contain hede.

ETA Oh, rereading the question, grep -v is probably what you meant by "tools options".

community wiki

12 upvote
  flag
Tip: for progressively filtering out what you don't want: grep -v "hede" | grep -v "hihi" | ...etc. – Olivier Lalonde May 5 '14 at 22:08
16 upvote
  flag
Or using only one process grep -v -e hede -e hihi -e ... – Olaf Dietsche Apr 26 '15 at 5:42

The given answers are perfectly fine, just an academic point:

Regular Expressions in the meaning of theoretical computer sciences ARE NOT ABLE do it like this. For them it had to look something like this:

^([^h].*$)|(h([^e].*$|$))|(he([^h].*$|$))|(heh([^e].*$|$))|(hehe.+$) 

This only does a FULL match. Doing it for sub-matches would even be more awkward.

community wiki

   upvote
  flag
Important to note this only uses basic POSIX.2 regular expressions and thus whilst terse is more portable for when PCRE is not available. – Steve-o Feb 19 '14 at 17:25
3 upvote
  flag
I agree. Many if not most regular expressions are not regular languages and could not be recognized by a finite automata. – ThomasMcLeod Mar 22 '14 at 21:36
   upvote
  flag
@ThomasMcLeod, Hades32: Is it within the realms of any possible regular language to be able to say ‘not’ and ‘and’ as well as the ‘or’ of an expression such as ‘(hede|Hihi)’? (This maybe a question for CS.) – James Haigh Jun 13 '14 at 16:54
7 upvote
  flag
@JohnAllen: ME!!! …Well, not the actual regex but the academic reference, which also relates closely to computational complexity; PCREs fundamentally can not guarantee the same efficiency as POSIX regular expressions. – James Haigh Jun 13 '14 at 17:04
1 upvote
  flag
Sorry -this answer just doesn't work, it will match hhehe and even match hehe partially (the second half) – Falco Aug 13 '14 at 12:57
^((?!hede).)*$

Explanation:

^the beginning of the string

( group and capture to \1 (0 or more times (matching the most amount possible))

(?! look ahead to see if there is not:

hede your string

) end of look-ahead

. any character except \n

)* end of \1 (NOTE: because you are using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \1)

$ before an optional \n, and the end of the string

community wiki

2 upvote
  flag
awesome that worked for me in sublime text 2 using multiple words '^((?!DSAU_PW8882WEB2|DSAU_PW8884WEB2|DSAU_PW8884WEB).)*$' – Damodar Bashyal Aug 11 '15 at 2:07

Here's a good explanation of why it's not easy to negate an arbitrary regex. I have to agree with the other answers, though: if this is anything other than a hypothetical question, then a regex is not the right choice here.

community wiki

6 upvote
  flag
Some tools, and specifically mysqldumpslow, only offer this way to filter data, so in such a case, finding a regex to do this is the best solution apart from rewriting the tool (various patches for this have not been included by MySQL AB / Sun / Oracle. – FGM Aug 7 '12 at 12:21
1 upvote
  flag
Exactly analagous to my situation. Velocity template engine uses regular expressions to decide when to apply a transformation (escape html) and I want it to always work EXCEPT in one situation. – Henno Vermeulen Oct 18 '13 at 14:43
1 upvote
  flag
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. – Strawberry Dec 24 '14 at 1:04

If you want the regex test to only fail if the entire string matches, the following will work:

^(?!hede$).*

e.g. -- If you want to allow all values except "foo" (i.e. "foofoo", "barfoo", and "foobar" will pass, but "foo" will fail), use: ^(?!foo$).*

Of course, if you're checking for exact equality, a better general solution in this case is to check for string equality, i.e.

myStr !== 'foo'

You could even put the negation outside the test if you need any regex features (here, case insensitivity and range matching):

!/^[a-f]oo$/i.test(myStr)

The regex solution at the top may be helpful, however, in situations where a positive regex test is required (perhaps by an API).

community wiki

Not regex, but I've found it logical and useful to use serial greps with pipe to eliminate noise.

eg. search an apache config file without all the comments-

grep -v '\#' /opt/lampp/etc/httpd.conf      # this gives all the non-comment lines

and

grep -v '\#' /opt/lampp/etc/httpd.conf |  grep -i dir

The logic of serial grep's is (not a comment) and (matches dir)

2 upvote
  flag
I think he is asking for the regex version of the grep -v – Angel.King.47 Jul 12 '11 at 15:27
6 upvote
  flag
This is dangerous. Also misses lines like good_stuff #comment_stuff – Xavi Montero Mar 1 '13 at 19:54

Benchmarks

I decided to evaluate some of the presented Options and compare their performance, as well as use some new Features. Benchmarking on .NET Regex Engine: http://regexhero.net/tester/

Benchmark Text:

The first 7 lines should not match, since they contain the searched Expression, while the lower 7 lines should match!

Regex Hero is a real-time online Silverlight Regular Expression Tester.
XRegex Hero is a real-time online Silverlight Regular Expression Tester.
Regex HeroRegex HeroRegex HeroRegex HeroRegex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her Regex Her Regex Her Regex Her Regex Her Regex Her Regex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her is a real-time online Silverlight Regular Expression Tester.Regex Hero
egex Hero egex Hero egex Hero egex Hero egex Hero egex Hero Regex Hero is a real-time online Silverlight Regular Expression Tester.
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRegex Hero is a real-time online Silverlight Regular Expression Tester.

Regex Her
egex Hero
egex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her is a real-time online Silverlight Regular Expression Tester.
Regex Her Regex Her Regex Her Regex Her Regex Her Regex Her is a real-time online Silverlight Regular Expression Tester.
Nobody is a real-time online Silverlight Regular Expression Tester.
Regex Her o egex Hero Regex  Hero Reg ex Hero is a real-time online Silverlight Regular Expression Tester.

Results:

Results are Iterations per second as the median of 3 runs - Bigger Number = Better

01: ^((?!Regex Hero).)*$                    3.914   // Accepted Answer
02: ^(?:(?!Regex Hero).)*$                  5.034   // With Non-Capturing group
03: ^(?>[^R]+|R(?!egex Hero))*$             6.137   // Lookahead only on the right first letter
04: ^(?>(?:.*?Regex Hero)?)^.*$             7.426   // Match the word and check if you're still at linestart
05: ^(?(?=.*?Regex Hero)(?#fail)|.*)$       7.371   // Logic Branch: Find Regex Hero? match nothing, else anything

P1: ^(?(?=.*?Regex Hero)(*FAIL)|(*ACCEPT))  ?????   // Logic Branch in Perl - Quick FAIL
P2: .*?Regex Hero(*COMMIT)(*FAIL)|(*ACCEPT) ?????   // Direct COMMIT & FAIL in Perl

Since .NET doesn't support action Verbs (*FAIL, etc.) I couldn't test the solutions P1 and P2.

Summary:

I tried to test most proposed solutions, some Optimizations are possible for certain words. For Example if the First two letters of the search string are not the Same, answer 03 can be expanded to ^(?>[^R]+|R+(?!egex Hero))*$ resulting in a small performance gain.

But the overall most readable and performance-wise fastest solution seems to be 05 using a conditional statement or 04 with the possesive quantifier. I think the Perl solutions should be even faster and more easily readable.

community wiki

With negative lookahead, regular expression can match something not contains specific pattern. This is answered and explained by Bart Kiers. Great explanation!

However, with Bart Kiers' answer, the lookahead part will test 1 to 4 characters ahead while matching any single character. We can avoid this and let the lookahead part check out the whole text, ensure there is no 'hede', and then the normal part (.*) can eat the whole text all at one time.

Here is the improved regex:

/^(?!.*?hede).*$/

Note the (*?) lazy quantifier in the negative lookahead part is optional, you can use (*) greedy quantifier instead, depending on your data: if 'hede' does present and in the beginning half of the text, the lazy quantifier can be faster; otherwise, the greedy quantifier be faster. However if 'hede' does not present, both would be equal slow.

Here is the demo code.

For more information about lookahead, please check out the great article: Mastering Lookahead and Lookbehind.

Also, please check out RegexGen.js, a JavaScript Regular Expression Generator that helps to construct complex regular expressions. With RegexGen.js, you can construct the regex in a more readable way:

var _ = regexGen;

var regex = _(
    _.startOfLine(),             
    _.anything().notContains(       // match anything that not contains:
        _.anything().lazy(), 'hede' //   zero or more chars that followed by 'hede',
                                    //   i.e., anything contains 'hede'
    ), 
    _.endOfLine()
);
community wiki

FWIW, since regular languages (aka rational languages) are closed under complementation, it's always possible to find an regular expression (aka rational expression) that negates another expression. But not many tools implement this.

Vcsn supports this operator (which it denotes {c}, postfix).

You first define the type of your expressions: labels are letter (lal_char) to pick from a to z for instance (defining the alphabet when working with complementation is, of course, very important), and the "value" computed for each word is just a Boolean: true the word is accepted, false, rejected.

In Python:

In [5]: import vcsn
        c = vcsn.context('lal_char(a-z), b')
        c
Out[5]: {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}  𝔹

then you enter your expression:

In [6]: e = c.expression('(hede){c}'); e
Out[6]: (hede)^c

convert this expression to an automaton:

In [7]: a = e.automaton(); a

Image (Asset 2/2) alt=

finally, convert this automaton back to a simple expression.

In [8]: print(a.expression())
        \e+h(\e+e(\e+d))+([^h]+h([^e]+e([^d]+d([^e]+e[^]))))[^]*

where + is usually denoted |, \e denotes the empty word, and [^] is usually written . (any character). So, with a bit of rewriting ()|h(ed?)?|([^h]|h([^e]|e([^d]|d([^e]|e.)))).*.

You can see this example here, and try Vcsn online there.

community wiki

1 upvote
  flag
True, but ugly, and only doable for small character sets. You don't want to do this with Unicode strings :-) – reinierpost Nov 8 '15 at 23:43

with this, you avoid to test a lookahead on each positions:

/^(?:[^h]+|h++(?!ede))*+$/

equivalent to (for .net):

/^(?>(?:[^h]+|h+(?!ede))*)$/

Old answer:

/^(?>[^R]+|R+(?!egex Hero))*$/
6 upvote
  flag
Good point; I'm surprised nobody mentioned this approach before. However, that particular regex is prone to catastrophic backtracking when applied to text that doesn't match. Here's how I would do it: /^[^h]*(?:h+(?!ede)[^h]*)*$/ – Alan Moore Apr 14 '13 at 5:26
   upvote
  flag
...or you can just make all the quantifiers possessive. ;) – Alan Moore Apr 15 '13 at 15:17
   upvote
  flag
@Alan Moore - I'm surprised too. I saw your comment (and best regex in the pile) here only after posting this same pattern in an answer below. – ridgerunner Dec 20 '13 at 3:08
   upvote
  flag
@ridgerunner, doesn't have to be the best tho. I've seen benchmarks where the top answer performs better. (I was surprised about that tho.) – Qtax Feb 20 '14 at 13:10

Here's how I'd do it:

^[^h]*(h(?!ede)[^h]*)*$

Accurate and more efficient than the other answers. It implements Friedl's "unrolling-the-loop" efficiency technique and requires much less backtracking.

community wiki

If you want to match a character to negate a word similar to negate character class:

For example, a string:

<?
$str="aaa        bbb4      aaa     bbb7";
?>

Do not use:

<?
preg_match('/aaa[^bbb]+?bbb7/s', $str, $matches);
?>

Use:

<?
preg_match('/aaa(?:(?!bbb).)+?bbb7/s', $str, $matches);
?>

Notice "(?!bbb)." is neither lookbehind nor lookahead, it's lookcurrent, for example:

"(?=abc)abcde", "(?!abc)abcde"
3 upvote
  flag
There is no "lookcurrent" in perl regexp's. This is truly a negative lookahead (prefix (?!). Positive lookahead's prefix would be (?= while the corresponding lookbehind prefixes would be (?<! and (?<= respectively. A lookahead means that you read the next characters (hence “ahead”) without consuming them. A lookbehind means that you check characters that have already been consumed. – Didier L May 21 '12 at 16:35

The OP did not specify or Tag the post to indicate the context (programming language, editor, tool) the Regex will be used within.

For me, I sometimes need to do this while editing a file using Textpad.

Textpad supports some Regex, but does not support lookahead or lookbehind, so it takes a few steps.

If I am looking to retain all lines that Do NOT contain the string hede, I would do it like this:

1. Search/replace the entire file to add a unique "Tag" to the beginning of each line containing any text.

    Search string:^(.)  
    Replace string:<@#-unique-#@>\1  
    Replace-all  

2. Delete all lines that contain the string hede (replacement string is empty):

    Search string:<@#-unique-#@>.*hede.*\n  
    Replace string:<nothing>  
    Replace-all  

3. At this point, all remaining lines Do NOT contain the string hede. Remove the unique "Tag" from all lines (replacement string is empty):

    Search string:<@#-unique-#@>
    Replace string:<nothing>  
    Replace-all  

Now you have the original text with all lines containing the string hede removed.


If I am looking to Do Something Else to only lines that Do NOT contain the string hede, I would do it like this:

1. Search/replace the entire file to add a unique "Tag" to the beginning of each line containing any text.

    Search string:^(.)  
    Replace string:<@#-unique-#@>\1  
    Replace-all  

2. For all lines that contain the string hede, remove the unique "Tag":

    Search string:<@#-unique-#@>(.*hede)
    Replace string:\1  
    Replace-all  

3. At this point, all lines that begin with the unique "Tag", Do NOT contain the string hede. I can now do my Something Else to only those lines.

4. When I am done, I remove the unique "Tag" from all lines (replacement string is empty):

    Search string:<@#-unique-#@>
    Replace string:<nothing>  
    Replace-all  
community wiki

Through PCRE verb (*SKIP)(*F)

^hede$(*SKIP)(*F)|^.*$

This would completely skips the line which contains the exact string hede and matches all the remaining lines.

DEMO

Execution of the parts:

Let us consider the above regex by splitting it into two parts.

  1. Part before the | symbol. Part shouldn't be matched.

    ^hede$(*SKIP)(*F)
  2. Part after the | symbol. Part should be matched.

    ^.*$

PART 1

Regex engine will start its execution from the first part.

^hede$(*SKIP)(*F)

Explanation:

  • ^ Asserts that we are at the start.
  • hede Matches the string hede
  • $ Asserts that we are at the line end.

So the line which contains the string hede would be matched. Once the regex engine sees the following (*SKIP)(*F) (Note: You could write (*F) as (*FAIL)) verb, it skips and make the match to fail. | called alteration or logical OR operator added next to the PCRE verb which inturn matches all the boundaries exists between each and every character on all the lines except the line contains the exact string hede. See the demo here. That is, it tries to match the characters from the remaining string. Now the regex in the second part would be executed.

PART 2

^.*$

Explanation:

  • ^ Asserts that we are at the start. ie, it matches all the line starts except the one in the hede line. See the demo here.
  • .* In the Multiline mode, . would match any character except newline or carriage return characters. And * would repeat the previous character zero or more times. So .* would match the whole line. See the demo here.

    Hey why you added .* instead of .+ ?

    Because .* would match a blank line but .+ won't match a blank. We want to match all the lines except hede , there may be a possibility of blank lines also in the input . so you must use .* instead of .+ . .+ would repeat the previous character one or more times. See .* matches a blank line here.

  • $ End of the line anchor is not necessary here.

community wiki

It may be more maintainable to two regexes in your code, one to do the first match, and then if it matches run the second regex to check for outlier cases you wish to block for example ^.(hede). then have appropriate logic in your code.

Ok, I admit this is not really an answer to the posted question posted and it may also use slightly more processing than a single regex. But for developers who came here looking for a fast emergency fix for an outlier case then this solution should not be overlooked.

community wiki

The TXR Language supports regex negation.

$ txr -c '@(repeat)
@{nothede /~hede/}
@(do (put-line nothede))
@(end)'  Input

A more complicated example: match all lines that start with a and end with z, but do not contain the substring hede:

$ txr -c '@(repeat)
@{nothede /a.*z&~.*hede.*/}
@(do (put-line nothede))
@(end)' -
az         <- echoed
az
abcz       <- echoed
abcz
abhederz   <- not echoed; contains hede
ahedez     <- not echoed; contains hede
ace        <- not echoed; does not end in z
ahedz      <- echoed
ahedz

Regex negation is not particularly useful on its own but when you also have intersection, things get interesting, since you have a full set of boolean set operations: you can express "the set which matches this, except for things which match that".

community wiki

Your Answer

asked

7 years ago

viewed

1714078 times

active

8 months ago

Hot Network Questions

Technology Life / Arts Culture / Recreation Science Other
  1. Stack Overflow
  2. Server Fault
  3. Super User
  4. Web Applications
  5. Ask Ubuntu
  6. Webmasters
  7. Game Development
  8. TeX - LaTeX
  1. Programmers
  2. Unix & Linux
  3. Ask Different (Apple)
  4. WordPress Development
  5. Geographic Information Systems
  6. Electrical Engineering
  7. Android Enthusiasts
  8. Information Security
  1. Database Administrators
  2. Drupal Answers
  3. SharePoint
  4. User Experience
  5. Mathematica
  6. Salesforce
  7. ExpressionEngine® Answers
  8. more (13)
  1. Photography
  2. Science Fiction & Fantasy
  3. Graphic Design
  4. Movies & TV
  5. Seasoned Advice (cooking)
  6. Home Improvement
  7. Personal Finance & Money
  8. Academia
  9. more (9)
  1. English Language & Usage
  2. Skeptics
  3. Mi Yodeya (Judaism)
  4. Travel
  5. Christianity
  6. Arqade (gaming)
  7. Bicycles
  8. Role-playing Games
  9. more (21)
  1. Mathematics
  2. Cross Validated (stats)
  3. Theoretical Computer Science
  4. Physics
  5. MathOverflow
  6. Chemistry
  7. Biology
  8. more (5)
  1. Stack Apps
  2. Meta Stack Exchange
  3. Area 51
  4. Stack Overflow Careers
site design / logo © 2016 Stack Exchange Inc; user contributions licensed under cc by-sa 3.0 with attribution required
rev 2016.8.13.3888