Regex has today evolved as a language in itself. Regexes are a quick and very very efficient way to search string patterns throughout a large text collection.

So for those who came late………………….

What is a regex??

Well a regex is a pattern describing some string text.

Regexes are a combination of literals and meta characters.

Literals:

Literals are literal pieces of text meant to be searched within a given text. For example, the literal ‘a’ occurs one time in the text ‘cat & dog’. But literals alone can’t do much unless combined with something called as metacharacters, for advanced searching and text manipulation.

Metacharacters:

Metacharacters is a group of 11 symbols which hold a special meaning in regex terminology. These symbols are,

1.The square bracket–>[

2.The caret–>^

3.The backslash–>\

4.The dollar–>$

5.The period–>.

6.The pipe–>|

7.The asterisk–>*

8.The question mark–>?

9.The plus sign–>+

10.The opening bracket–>(

11.The closing bracket–>)

All these metacharacters hold some very definite instructions or some ‘special meaning’ for the R-engine(We’ll discuss the instructions given by these metachracters to the regex engine in detail just in a little while), so if you need to search for an expression which includes a plus/any other metacharacter, then you need to escape the metacharacter with a backslash. For eg. If you want to search for ‘1+1’, then ‘1\+1’ is the correct regex for it.C/C++/Java compilers will take 1\\+1 as the input does not go directly to the R-engine but the compiler provides it.

A combination of literal character and a metacharacter results in a regex token.

Regex Tokens:

\d->will match a single digit from 0 to 9

\s->a whitespace character

\w->a word character(a letter, a digit or an underscore)

\t->tab character

\r->carriage return

\n->line feed or new line(for unix;use \r\n for windows)

\a->bell

\e->escape

\f->form feed or new page or page break

\v->vertical tab

Regex Engines:

Regex pattern searching works on something called as a Regex engine.R-engines are of 2 types:

1.Text based.(DFA)

2.Regex based. (NFA)

Regex engines are case sensitive by default.

To test a regex engine, to find out that it is text based or regex based, you need to take “regex not” as a text string and regex|regex not” as the test regex expression, if the answer is “regex”, then only it is a regex based search engine.As a developer you will encounter mostly regex based search engines.

Regex based search engines always return the first-leftmost match, and do not check the teststring any further once the match is found .

Now you would have understood the reason that why a regex based engine returns the string “regex” from “regex not” using the regex “regex|regex not”. This is also called as the eager behavior or the lazy behaviour of  regex based search engines, that they only check the first leftmost condition and report the first match.

Character class: Character classes or character sets are a way of telling the R-engine that you need one out of several matches, the range for which you are providing in your regex. E.g.:

1.The regex “T[ae]d” will find both, “Ted” and “Tad” in a given text.

2.You can also specify ranges,

[0-9], will look for a digit between 0-9 in a given text.

3.You can also specify multiple ranges,

[0-9a-fA-F]–>we’re defining 3 specific ranges here, which should be obvious to you, and this regex has to do absolutely nothing with the search of ‘9a’ or ‘fA’.

Negation, the significance of a caret(^):

Remember the meta characters, well, we’ll discuss here the meaning of one of them, i.e., caret. A literal ‘q’ will look for q in the given text and ‘q[^u]’ will look for a ‘q which is followed by a character that is not a  u‘ in the test text.

The character part is important as it won’t match “Iraq.”, but it will match “Iraq is” .

How do meta characters work inside character classes?

Inside character classes the metacharacters behave in 2 ways:

1.As Meta/special characters(^,-,\,]),

(a.)\–>To search for a ‘\’ you need to escape it with  a ‘\’;

[\\x]–>will search for ‘\’ or ‘x’.

(b.)^–>You need to place it anywhere except the right of the opening bracket, otherwise, it will have its own special meaning off negation. Eg.: [x^]–> will look for ‘x’ or an ‘^’.

(c.)]–>Closing brackets are to be put directly after an opening bracket or the negating caret.

[]x]–>will look for a ‘]’ or ‘x’.

[^]x]–>look for a character that is not a ‘]’ or ‘x’.

(d.)’-‘ –> To be included right before closing/opening of square brackets.

[a-d]–>will look for characters lying b/w a to d,

[-ad]–>will look for ‘a’ or ‘d’ or ‘-‘.

[ad-]–>same

2.As Normal characters (All other meta characters instead of the above 4 meta characters)–>All the other meta characters behave normally inside character classes. E.g.: To search ‘+’, you simply need to search for a ‘+’ , w’out escaping it with a back slash;

[+*]–>will simply search for  ‘+’ or  ‘*’.

Repetition, the use of  ‘+’:

+–>represents any regex operation that is to be done one or more times. 

You can repeat an entire character class or a matching character by a precise use of ‘+’.

1.'[0-9]+’ –> will match 837 and 222(Repetition of entire sequence/class).

2.'[0-9]\1+’–>This is an example of back referencing which we’ll study later. This will match 222 but not 837.

A match for everything, the use of ‘.’:

The ‘.’ will match with anything and everything and so it is useful and also harmful.

How??

Eg: Let’s say we want to match a date in the dd/mm/yy format but the separation is left to the user,

so we have a valid regex, \d\d.\d\d.\d\d

It matches with a proper date but also with 29505585 as the dot will give a valid match for a ‘5’ too.

So a better but not the best solution is:

\d\d[- /.]\d\d[- /.]\d\d–>This will allow a choice of four separators ‘-‘,’/’,’ ‘,’.’ ;Point to keep in mind is that a ‘.’ does not behave like a meat character once inside a character class.

Anchors:

Literals and character classes match characters. Anchors, do not match characters rather they match position before, after or between the characters.

^ and $ are the 2 primarily used anchors, commonly called as start of line(^) and end of line($) anchors.

(Caret here demonstrates a totally different usage as it is not being used inside a character class where it demonstrates negation)

E.g.: ^a to abc matches a but ^b will not match b as ^a is the start of line.

Similarly,c$ will match c but will fail for b.

Word Boundaries:

Well there are 3 different places in any string that qualify as word boundry:

1.Before the first character in the string.

2.After the first character in the string.

3.Between 2 word characters where one is a word character and the other is not.

?, One of the greedy meta characters:

?–>represents a regex operation to be done zero or one time

? is said to be one of the greedy meta characters, as it makes the preceding token optional

Eg.:

Feb(ruary)? 23(rd)? matches February 23rd,February 23Feb 23rd and Feb 23

Advertisements