PHP Regular Expressions
PHP Regular Expression ReferenceThis reference is an accumulation of things I've learnt from the sites listed at the bottom and through my own learning. I recommend using this program to try things out.
http://weitz.de/regex-coach/
Thanks.
Identifying Positions
The caret symbol identifies the beginning of a string; while the dollar symbol represents the end of it.
/^Hello/ - This would match a string beginning with the word "Hello".
/Goodbye$/ - while this would match a string ending with "Goodbye".
These symbols can also be used together.
/^Hello Goodbye$/ - This matches the exact string "Hello Goodbye".
However, the caret and dollar symbols act differently when using the multiple line pattern modifier.
/^Hello$\n^Goodbye$/m - Would match the string:
"Hello
Goodbye"
The two symbols can be used to match the start and finish of each new line. To avoid any complications "\A" and "\Z" can be used instead.
/\AHello Goodbye\Z/ - Matches the string "Hello Goodbye"
/\AHello\Z\n\AGoodbye\Z/m - Would not match the string:
"Hello
Goodbye"
Identifying Characters
Note that these are case insensitive.
\b - Word boundary
This allows to to distinguish whole words. For instance with the strings "This is invalid" and "This is valid". A word boundary is located between a whitespace character and a non-whitespace character.
/valid/ - This search for valid would match in both strings, but by using a word boundary, we can differentiate between the words "invalid" and "valid".
/\bvalid\b/ - This would only match the string "This is valid". Notice that this word boundary is used on both sides; if it were only used before "valid", words such as "validate" would be matched as well.
\B - Non-word boundary
This is the opposite of a word boundary. In the string "This is a word"; the non-word boundary can be thought of as the space between "T" and "h", or between any characters of the same word.
/T\Bhis/ - Would match "This" in the string "This is a word".
/This\B/ - Since there are no more non-whitespace characters after "s", this is classed as a word boundary and therefore the pattern does not match.
\d - Numerical character
A single character which is number, (0 to 9).
/\d/ - Would match any number from 0 to 9.
\D - Non-numerical character
A single character which is not numeric.
/\D/ - Would not match a number.
\n - Newline character
A single newline character. ASCII Number 10.
\r - Carriage return character
A single carriage return character. ASCII Number 13.
\s - Whitespace character
A single whitespace character, which represents a carriage return, new line, space or tab.
\S - Non-whitespace character
A single non-whitespace character, represents any character besides those mentioned for Whitespace character.
\t - Tab character
A single tab character. ASCII Number 9.
\w - Word character
A single word character represents all the characters of the alphabet, numbers 0-9 and the underscore character.
/\w/ - Would match strings such as "a", "x", "4", "_".
\W - Non-word character
A single non-word character, represents any character besides those mentioned for Word character.
/\W/ - Would not match strings like "a", "x", "4", "_".
. - Dot character
A single dot character represents any character, with the exception of the new line character.
/./ - Will match any single character string, with the exception of new lines.
Repeating Characters
Single characters are rarely searched for, instead we may be searching for a specific length word.
/\w{4}/ - This would match any 4 letter word.
On other occasions we may have a minimum length.
/\w{4,}/ - This would match any 4 letter word or larger
On other occasions we may have a maximum length.
/\w{0, 4}/ - This would match any word with a length from 0 up to a length of 4.
With a slight variation of the one above, we can search for words within a set range.
/\w{2, 4}/ - This would match words with a length between 2 and 4.
There are also special symbols that do this.
* This is the same as using {0, }
+ This is the same as using {1,}
? This is the same as using {0, 1}
The OR operator
The or operator can match one statement OR the other.
/Hello|Goodbye/ - This will match the string "Hello" OR the string "Goodbye"
Sub patterns
Characters can be grouped together to create more complex patterns.
/\w{4}(\d{2}\w{2})?/ - This pattern searches for 4 words, followed by nothing or 2 numerical digits and 2 alphanumeric characters.
Usually when using sub patterns, the script will place each matched sub pattern into the array. To prevent the script from doing this place ?: at the beginning of the sub pattern.
/\w{4}(?:\d{2}\w{2})?/
Character Classes
In a character class you can specify a list of single character or a range the string can match.
You must escape certain characters with a backslash, these are the backslash, caret and hyphen
/[abc]/ - This will match either of these strings "a", "b" and "c".
/[a-c]/ - This will also match the three strings, but is a short method of writing a large list.
However, you should be careful when using a range.
/[A-z]/ - This seem harmless, but this will actually match underscores and carets, amongst other non-alphabetic characters. To prevent this, specify two ranges
/[A-Za-z]/ - This over comes the problem.
The caret has a special function when not backslashed. It negates the character class.
/[^0-9]/ - This would match a single character that is not numeric.
Pattern Modifiers
Modifiers alter the behaviour of the pattern. They are placed after the delimiter.
/pattern/Z
Where Z is the modifier.
i - Case insensitive
When this modifier is in use the whole pattern becomes case insensitive.
m - Multiple Lines
When this modifier is active the string is treated as a single line. It is in the situation where the caret and dollar symbols act differently.
s - Dot All
When this modifier is being used the dot character can also represent the new line character.
Resources
http://www.regular-expressions.info/tutorial.html
http://weblogtoolscollection.com/regex/regex.php