Saturday, July 21, 2007

PHP Regular Expressions

PHP Regular Expression Reference

This reference is an accumulation of things I've learnt from the sites listed at the bottom and through my own learning. I recommend using this program to try things out.

http://weitz.de/regex-coach/

Thanks.


Identifying Positions

The caret symbol identifies the beginning of a string; while the dollar symbol represents the end of it.

/^Hello/ - This would match a string beginning with the word "Hello".
/Goodbye$/ - while this would match a string ending with "Goodbye".

These symbols can also be used together.

/^Hello Goodbye$/ - This matches the exact string "Hello Goodbye".

However, the caret and dollar symbols act differently when using the multiple line pattern modifier.

/^Hello$\n^Goodbye$/m - Would match the string:
"Hello
Goodbye"

The two symbols can be used to match the start and finish of each new line. To avoid any complications "\A" and "\Z" can be used instead.

/\AHello Goodbye\Z/ - Matches the string "Hello Goodbye"

/\AHello\Z\n\AGoodbye\Z/m - Would not match the string:
"Hello
Goodbye"


Identifying Characters

Note that these are case insensitive.

\b - Word boundary

This allows to to distinguish whole words. For instance with the strings "This is invalid" and "This is valid". A word boundary is located between a whitespace character and a non-whitespace character.

/valid/ - This search for valid would match in both strings, but by using a word boundary, we can differentiate between the words "invalid" and "valid".

/\bvalid\b/ - This would only match the string "This is valid". Notice that this word boundary is used on both sides; if it were only used before "valid", words such as "validate" would be matched as well.

\B - Non-word boundary
This is the opposite of a word boundary. In the string "This is a word"; the non-word boundary can be thought of as the space between "T" and "h", or between any characters of the same word.

/T\Bhis/ - Would match "This" in the string "This is a word".

/This\B/ - Since there are no more non-whitespace characters after "s", this is classed as a word boundary and therefore the pattern does not match.

\d - Numerical character
A single character which is number, (0 to 9).
/\d/ - Would match any number from 0 to 9.

\D - Non-numerical character
A single character which is not numeric.
/\D/ - Would not match a number.

\n - Newline character
A single newline character. ASCII Number 10.

\r - Carriage return character

A single carriage return character. ASCII Number 13.

\s - Whitespace character
A single whitespace character, which represents a carriage return, new line, space or tab.

\S - Non-whitespace character
A single non-whitespace character, represents any character besides those mentioned for Whitespace character.

\t - Tab character
A single tab character. ASCII Number 9.

\w - Word character
A single word character represents all the characters of the alphabet, numbers 0-9 and the underscore character.

/\w/ - Would match strings such as "a", "x", "4", "_".

\W - Non-word character
A single non-word character, represents any character besides those mentioned for Word character.

/\W/ - Would not match strings like "a", "x", "4", "_".

. - Dot character

A single dot character represents any character, with the exception of the new line character.

/./ - Will match any single character string, with the exception of new lines.


Repeating Characters

Single characters are rarely searched for, instead we may be searching for a specific length word.

/\w{4}/ - This would match any 4 letter word.

On other occasions we may have a minimum length.

/\w{4,}/ - This would match any 4 letter word or larger

On other occasions we may have a maximum length.

/\w{0, 4}/ - This would match any word with a length from 0 up to a length of 4.

With a slight variation of the one above, we can search for words within a set range.

/\w{2, 4}/ - This would match words with a length between 2 and 4.

There are also special symbols that do this.

* This is the same as using {0, }
+ This is the same as using {1,}
? This is the same as using {0, 1}


The OR operator


The or operator can match one statement OR the other.

/Hello|Goodbye/ - This will match the string "Hello" OR the string "Goodbye"


Sub patterns

Characters can be grouped together to create more complex patterns.

/\w{4}(\d{2}\w{2})?/ - This pattern searches for 4 words, followed by nothing or 2 numerical digits and 2 alphanumeric characters.

Usually when using sub patterns, the script will place each matched sub pattern into the array. To prevent the script from doing this place ?: at the beginning of the sub pattern.

/\w{4}(?:\d{2}\w{2})?/

Character Classes

In a character class you can specify a list of single character or a range the string can match.
You must escape certain characters with a backslash, these are the backslash, caret and hyphen

/[abc]/ - This will match either of these strings "a", "b" and "c".

/[a-c]/ - This will also match the three strings, but is a short method of writing a large list.

However, you should be careful when using a range.

/[A-z]/ - This seem harmless, but this will actually match underscores and carets, amongst other non-alphabetic characters. To prevent this, specify two ranges

/[A-Za-z]/ - This over comes the problem.

The caret has a special function when not backslashed. It negates the character class.

/[^0-9]/ - This would match a single character that is not numeric.



Pattern Modifiers

Modifiers alter the behaviour of the pattern. They are placed after the delimiter.

/pattern/Z


Where Z is the modifier.

i - Case insensitive
When this modifier is in use the whole pattern becomes case insensitive.


m - Multiple Lines


When this modifier is active the string is treated as a single line. It is in the situation where the caret and dollar symbols act differently.

s - Dot All

When this modifier is being used the dot character can also represent the new line character.



Resources
http://www.regular-expressions.info/tutorial.html
http://weblogtoolscollection.com/regex/regex.php