
| Easily create and understand regular expressions today. Compose and analyze regex patterns with RegexBuddy's easy-to-grasp regex blocks and intuitive regex tree, instead of or in combination with the traditional regex syntax. Developed by the author of this website, RegexBuddy makes learning and using regular expressions easier than ever. Get your own copy of RegexBuddy now |
Lookaround, which I introduced in detail in the previous topic, is a very powerful concept. Unfortunately, it is often underused by people new to regular expressions, because lookaround is a bit confusing. The confusing part is that the lookaround is zero-width. So if you have a regex in which a lookahead is followed by another piece of regex, or a lookbehind is preceded by another piece of regex, then the regex will traverse part of the string twice.
To make this clear, I would like to give you another, a bit more practical example. Let's say we want to find a word that is six letters long and contains the three subsequent letters cat. Actually, we can match this without lookaround. We just specify all the options and hump them together using alternation: cat\w{3}|\wcat\w{2}|\w{2}cat\w|\w{3}cat. Easy enough. But this method gets unwieldy if you want to find any word between 6 and 12 letters long containing either "cat", "dog" or "mouse".
In this example, we basically have two requirements for a successful match. First, we want a word that is 6 letters long. Second, the word we found must contain the word "cat".
Matching a 6-letter word is easy with \b\w{6}\b. Matching a word containing "cat" is equally easy: \b\w*cat\w*\b.
Combining the two, we get: (?=\b\w{6}\b)\b\w*cat\w*\b
. Easy! Here's how this works. At each character position in the string where the regex is attempted, the engine will first attempt the regex inside the positive lookahead. This sub-regex, and therefore the lookahead, matches only when the current character position in the string is at the start of a 6-letter word in the string. If not, the lookahead will fail, and the engine will continue trying the regex from the start at the next character position in the string.
The lookahead is zero-width. So when the regex inside the lookahead has found the 6-letter word, the current position in the string is still at the beginning of the 6-letter word. At this position will the regex engine attempt the remainder of the regex. Because we already know that a 6-letter word can be matched at the current position, we know that \b matches and that the first \w* will match 6 times. The engine will then backtrack, reducing the number of characters matched by \w*, until cat can be matched. If cat cannot be matched, the engine has no other choice but to restart at the beginning of the regex, at the next character position in the string. This is at the second letter in the 6-letter word we just found, where the lookahead will fail, causing the engine to advance character by character until the next 6-letter word.
If cat can be successfully matched, the second \w* will consume the remaining letters, if any, in the 6-letter word. After that, the last \b in the regex is guaranteed to match where the second \b inside the lookahead matched. Our double-requirement-regex has matched successfully.
While the above regex works just fine, it is not the most optimal solution. This is not a problem if you are just doing a search in a text editor. But optimizing things is a good idea if this regex will be used repeatedly and/or on large chunks of data in an application you are developing.
You can discover these optimizations by yourself if you carefully examine the regex and follow how the regex engine applies it, as I did above. I said the third and last \b are guaranteed to match. Since it is zero-width, and therefore does not change the result returned by the regex engine, we can remove them, leaving: (?=\b\w{6}\b)\w*cat\w*. Though the last \w* is also guaranteed to match, we cannot remove it because it adds characters to the regex match. Remember that the lookahead discards its match, so it does not contribute to the match returned by the regex engine. If we omitted the \w*, the resulting match would be the start of a 6-letter word containing "cat", up to and including "cat", instead of the entire word.
But we can optimize the first \w*. As it stands, it will match 6 letters and then backtrack. But we know that in a successful match, there can never be more than 3 letters before "cat". So we can optimize this to \w{0,3}. Note that making the asterisk lazy would not have optimized this sufficiently. The lazy asterisk would find a successful match sooner, but if a 6-letter word does not contain "cat", it would still cause the regex engine to try matching "cat" at the last two letters, at the last single letter, and even at one character beyond the 6-letter word.
So we have (?=\b\w{6}\b)\w{0,3}cat\w*. One last, minor, optimization involves the first \b. Since it is zero-width itself, there's no need to put it inside the lookahead. So the final regex is: \b(?=\w{6}\b)\w{0,3}cat\w*
.
So, what would you use to find any word between 6 and 12 letters long containing either "cat", "dog" or "mouse"? Again we have two requirements, which we can easily combine using a lookahead: \b(?=\w{6,12}\b)\w{0,9}(cat|dog|mouse)\w*
. Very easy, once you get the hang of it. This regex will also put "cat", "dog" or "mouse" into the first backreference.
Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!
Page URL: http://www.Regular-Expressions.info/lookaround2.html
Page last updated: 09 February 2007
Site last updated: 01 December 2008
Copyright © 2003-2008 Jan Goyvaerts. All rights reserved.
| More Information |
| Introduction |
| Quick Start |
| Tutorial |
| Tools and Languages |
| Examples |
| Books |
| Reference |
| Print PDF |
| About This Site |
| RSS Feed & Blog |
| PowerGREP 3 |
| Use regular expressions to search through large numbers of text and binary files, such as source code, correspondence, server or system logs, reference texts, archives, etc. Quickly find the files you are looking for, or extract the information you need. Look through just a handful of files, or thousands of files and folders. |
| Perform comprehensive text and binary replacement operations for easy maintenance of websites, source code, reports, etc. Preview replacements before modifying files, and stay safe with flexible backup and undo options. |
| Work with plain text files, Unicode files, binary files, files stored in zip archives, and even MS Word documents, Excel spreadsheets and PDF files. Runs on Windows 98, ME, NT4, 2000, XP & Vista. |
| More information |
| Download PowerGREP now |