Share this page

Learn X in Y minutes

Where X=PCRE

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. e.g. to extract domain name from a string we can say /^[a-z]+:/ and it will match http: from

PCRE (Perl Compatible Regular Expressions) is a C library implementing regex. It was written in 1997 when Perl was the de-facto choice for complex text processing tasks. The syntax for patterns used in PCRE closely resembles Perl. PCRE syntax is being used in many big projects including PHP, Apache, R to name a few.

There are two different sets of metacharacters: * Those that are recognized anywhere in the pattern except within square brackets \ general escape character with several uses ^ assert start of string (or line, in multiline mode) $ assert end of string (or line, in multiline mode) . match any character except newline (by default) [ start character class definition | start of alternative branch ( start subpattern ) end subpattern ? extends the meaning of ( also 0 or 1 quantifier also quantifier minimizer * 0 or more quantifier + 1 or more quantifier also "possessive quantifier" { start min/max quantifier

  \      general escape character
  ^      negate the class, but only if the first character
  -      indicates character range
  [      POSIX character class (only if followed by POSIX syntax)
  ]      terminates the character class

PCRE provides some generic character types, also called as character classes. \d any decimal digit \D any character that is not a decimal digit \h any horizontal white space character \H any character that is not a horizontal white space character \s any white space character \S any character that is not a white space character \v any vertical white space character \V any character that is not a vertical white space character \w any "word" character \W any "non-word" character


We will test our examples on following string - - [18/Sep/2004:11:07:48 +1000] "GET /robots.txt HTTP/1.0" 200 468 "-" "Googlebot/2.1". It is a standard Apache access log.

Regex Result Comment
GET GET GET matches the characters GET literally (case sensitive)
\d+.\d+.\d+.\d+ \d+ match a digit [0-9] one or more times defined by + quantifier, \. matches . literally
(\d+.){3}\d+ (\d+\.){3} is trying to match group (\d+\.) exactly three times.
[.+] [18/Sep/2004:11:07:48 +1000] .+ matches any character (except newline), . is any character
^\S+ ^ means start of the line, \S+ matches any number of non-space characters
+[0-9]+ +1000 \+ matches the character + literally. [0-9] character class means single number. Same can be achieved using \+\d+

All these examples can be tried at

  1. Copy the example string in TEST STRING section
  2. Copy regex code in Regular Expression section
  3. The web application will show the matching result

Further Reading

Got a suggestion? A correction, perhaps? Open an Issue on the Github Repo, or make a pull request yourself!

Originally contributed by Sachin Divekar, and updated by 0 contributor(s).