Regular Expressions Simple Guide
Regular expressions provide the foundation for describing or matching data according to defined syntax rules. A regular expression is nothing more than a pattern of characters itself, matched against a certain parcel of text.
A common operation when editing text is to search for a given string of characters, sometimes with the purpose of replacing it with another string. Many "search and replace" facilities have the option of using regular expressions instead of simple strings of characters.
You can think of a regular expression as a pattern that matches certain strings, namely all the strings in the language described by the regular expression. When a regular expression is used in a search operation, the goal is to find a string that matches the expression. This type of pattern matching is very useful.
In regular expressions, the alphabet usually includes all the characters on the keyboard. This leads to a problem, because regular expressions actually use two types of symbols: symbols that are members of the alphabet and special symbols such a * and ) that are used to construct expressions. These special symbols, which are not part of the language being described but are used in the description, are called metacharacters.
Metacharacters
There are 12 special characters (also called metacharacters) that have different special meaning.
S.No. | Metacharacter | Meaning | |
1 | [ | Opening Square Bracket | |
2 | ] | Closing Square Bracket | |
3 | \ | Backslash | Escape Sequence |
4 | ^ | Caret | Matches start of the position of the string regex is applied to |
5 | $ | Dollar Sign | Matches end of the position of the string regex is applied to |
6 | . | Period / Dot | Matches any single character |
7 | | | Pipe Symbol | Or; Match either right side or left side of the symbol |
8 | ? | Question Mark | Repeats previous item zero or one time (previous item optional) |
9 | * | Asterisk | Repeats previous item zero or more times |
10 | + | Plus Sign | Repeats previous item one or more times |
11 | ( | Opening Round Bracket | |
12 | ) | Closing Round Bracket |
Escape Sequence
An escape sequence indicates that you want to use one of metacharacters as a literal. In a regular expression, an escape sequence involves placing the metacharacter \ (backslash) in front of the metacharacter.
Backslash escapes special characters to suppress their special meaning. For example, + (plus sign) has a special meaning, but using \+ matches +
Square Brackets / Character Class
Brackets ([]) are used to represent a list, or range, of characters to be matched.
To make it easier to deal with the large number of characters in the alphabet, character classes are introduced. A character class consists of a list of characters enclosed between brackets, [ and ]. A character class matches a single character, which can be any of the characters in the list. For example, [0123456789] matches any one of the digits 0 through 9. The same thing could be expressed as (0|1|2|3|4|5|6|7|8|9).
For convenience, a hyphen can be included in a character class to indicate a range of characters. This means that [0123456789] could also be written as [0-9] and that the regular expression [a-z] will match any single lowercase letter. A character class can include multiple ranges, so that [a-zA-Z] will match any letter, lowercase or uppercase.
[ ]: It match anything inside the square brackets for one character position. For example:
- [ab] matches any single character that has either a or b
- [a-d] matches any single character that has lowercase letters a through d
Several commonly used character ranges:
- [0-9] matches any decimal digit from 0 through 9.
- [a-z] matches any character from lowercase a through lowercase z.
- [A-Z] matches any character from uppercase A through uppercase Z.
- [A-Za-z] matches any character from uppercase A through lowercase z.
Quantifiers
Sometimes you might want to create regular expressions that look for characters based on their frequency or position. For example, you might want to find strings containing one or more instances of the letter p, strings containing at least two p’s, or even strings with the letter p as the beginning or terminating character.
- p+ matches any string containing at least one p.
- p* matches any string containing zero or more p’s.
- p? matches any string containing zero or one p.
- p{2} matches any string containing a sequence of two p’s.
- p{2,3} matches any string containing a sequence of two or three p’s.
- p{2,} matches any string containing a sequence of at least two p’s.
- p$ matches any string with p at the end of it.
Negated Character Classes
^ means "not the following" when inside and at the start of [ ].
Typing a caret after the opening square bracket will negate the character class. The result is that the character class will match any character that is not in the character class.
- [^abc] matches not a, b or c
It is important to note that a negated character class still must match a character. For example, q[^u] does not mean: "q not followed by u". It means: "q followed by a character that is not u".
Repeating Character Classes
If you repeat a character class by using the ?, * or + operators, you will repeat the entire character class, and not just the character that it matched. For example, the regex [0-9]+ can match 837 as well as 222.
If you want to repeat the matched character, rather than the class, you will need to use back references. For example, ([0-9])\1+ will match 222 but not 837. When applied to the string 833337, it will match 3333 in the middle of this string.
Caret and Dollar
In most implementations, the meta-character ^ can be used in a regular expression to match the beginning of a line of text, so that the expression ^[a-zA-Z]+ will only match a word that occurs at the start of a line. Similarly, $ is used as a meta-character to match the end of a line.
Back References / Parentheses
When regular expressions are used in search-and-replace operations, a regular expression is used for the search pattern. A search is made in a (typically long) string for a substring that matches the pattern, and then the substring is replaced by a specified replacement pattern. The replacement pattern is not used for matching and is not a regular expression. However, it can be more than just a simple string. It’s possible to include parts of the substring that is being replaced in the replacement string.
The notations \0, \1, ... , \9 are used for this purpose. The first of these, \0, stands for the entire substring that is being replaced. The others are only available when parentheses are used in the search pattern. The notation \1 stands for "the part of the substring that matched the part of the search pattern beginning with the first ( in the pattern and ending with the matching )." Similarly, \2 represents whatever matched the part of the search pattern between the second pair of parentheses, and so on.
Examples
gray|grey
It can match "gray" or "grey".
gr(a|e)y
It can match "gray" or "grey".
gr[a|e]y
It can match "gray" or "grey".
colou?r
It matches "color" or "colour". (zero or one occurrence of u)
.at
It matches any three-character string ending with at. For example: "hat", "cat", "bat", "rat", "fat"
[bc]at
It matches "bat" and "cat". (any single character inside square bracket)
<b>(.*)</b>
It matches any string enclosed within <b> and </b>.