Regular Expression Basics

Regular Expressions are a powerful utility within the JavaScript language. Essentially it allows you to search and manipulate strings of text with both simple and complex patterns.

For example, you’ll normally find a lot of form validation is done using Regular Expressions (see: http://codecanyon.net/item/validate/54689).

Regular Expressions are available in one form or another in most programming languages. One minor point to make here is that although most programming languages include Regular Expressions, they can differ in implementation from language to language. So you may find the particular Regular Expression feature you want to use is not supported in your language of choice.

JavaScript for example implements the Perl 5 syntax (and as such doesn’t support certain features such as ‘look-behind’ constructs).

When writing a Regular Expression you will start with the string you want to match a pattern against. You’ll write your Regular Expression ‘pattern’ and will then assign one (or more) ‘modifiers’ to the Regular Expression object which will effect how the pattern is run.

For example if you wanted to search a string for the word “JavaScript” your pattern would be /JavaScript/, but if your string had that word only in lowercase (i.e. “I love javascript”) then the pattern match would fail. So you would assign a “case insensitive” modifier to let the Regular Expression engine know that it is free to find matches as either “JavaScript” or “javascript” or even “JAVASCRIPT”.

There are three types of modifiers:

  • i, – “ignore case” – the case (uppercase/lowercase) of all letters within the string will be ignored during testing.
  • g, – “global search” – the search is carried out across the entire string, regardless of whether a match has already been found.
  • m, – “multiline search” – the regular expression will match over multiple lines.

In JavaScript there are essentially three main methods for handling Regular Expressions:

  1. String.match (returns an Array of matches)
  2. RegExp.exec (returns an Array of matches)
  3. RegExp.test (returns a Boolean true/false)

The only difference between String.match and RegExp.exec is that the latter will return capture groups plus the first match (if a global modifier has been used), where as the former won’t return any capture groups only the full matches. Note: a capture group is when you wrap a certain set of characters in parenthesis, any text in () will then be remembered after the Regular Expression has run and can be replaced or moved (see below for an example of this).

The RegExp.test method is probably the most useful as far as validation testing is concerned as the majority of the time you just want to check if a specific value is present.

I’ve included some extremely basic examples below as a way to gently introduce newcomers to the terminology of Regular Expressions. A great tool for testing Regular Expressions can be found here: http://gskinner.com/RegExr/ (I highly recommend you try it).

In my examples I’ve used the String.match method of matching patterns and I’ve also used the ‘regexp literal’ form of creating Regular Expression (e.g. /my-regexp-pattern/) but, there is also the RegExp() constructor which allows you to create Regular Expression objects.

The only benefit of using the RegExp constructor is if you need to search for patterns dynamically (i.e. with unknown data. For example, such as at runtime when the user specifies a word they want to search for).

Regular Expression Basic Examples…

NOTE: Two of the examples use what is called a ‘look-ahead’ construct (I demonstrate a positive and negative variation of the look-ahead) and although Regular Expressions do have a ‘look-behind’ construct, the current version of JavaScript doesn’t implement it. There are ways to mimic it but that is outside the scope of this article so I’ll refer you to here: http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript

<script>
	var string = "Here is some text about JavaScript (also known as JScript). It contains multiple references to jscript/javascript because this article is all about JAVASCRIPT.";
 
	// returns: [JavaScript]
	// Although there are multiple references to the word "javascript" I've not included the "global" modifier so it has only returned the first match
	console.log(string.match(/javascript/i));
 
	// returns: [JavaScript,javascript,JAVASCRIPT]
	console.log(string.match(/javascript/ig));
 
	// returns: ["Java", "java", "JAVA"]
	console.log(string.match(/java/ig));
 
	// returns: [a,J,a,v,a,a,a,J,a,j,j,a,v,a,a,a,a,a,J,A,V,A]
	// Match any single character in the set (e.g. /defen[cs]e/ would matche either defense or defence).
	console.log(string.match(/[java]/ig));
 
	// returns: [e,r,e,i,s,s,o,m,e,t,e,x,t,a,b,o,u,t,a,v,a,c,r,i,p,t,a,l,s,o,k,n,o,w,n,a,s,c,r,i,p,t,t,c,o,n,t,a,i,n,s,m,u,l,t,i,p,l,e,r,e,f,e,r,e,n,c,e,s,t,o,j,s,c,r,i,p,t,j,a,v,a,s,c,r,i,p,t,b,e,c,a,u,s,e,t,h,i,s,a,r,t,i,c,l,e,i,s,a,l,l,a,b,o,u,t] 
	// Match any single character in the set (e.g. /defen[cs]e/ would matche either defense or defence).
	// Notice the capital letters are missing from the result set.
	console.log(string.match(/[a-z]/g));
 
	// returns: [H, , , , , ,J,S, ,(, , , ,J,S,),., ,I, , , , , ,/, , , , , , , ,J,A,V,A,S,C,R,I,P,T,.] 
	// Match any single character that is not in the set.
	// Matches anything that isn't a capital (includes spaces inbetween words)
	console.log(string.match(/[^a-z]/g));
 
	// returns: [J,J,j,j,J]
	// Alternation. Equivalent of "or". Matches the full expression before or after the |.
	console.log(string.match(/J|j/g));
 
	// returns: [J,j,J] 
	// Positive lookahead. Matches a group after your main expression without including it in the result.
	// Matches j that have "ava" following them (so jscript or JScript don't match)
	console.log(string.match(/J(?=ava)/ig));
 
	// returns: [J,j] 
	// Negative lookahead. Specifies a group that can not match after your main expression (ie. if it matches, the result is discarded).
	// Matches j that HASN'T got "ava" following them (so jscript & JScript return a match)
	console.log(string.match(/J(?!ava)/ig));
 
	// returns: [ , , , , , , , , , , , , , , , , , , , , , ]
	// \s matches any whitespace character (spaces, tabs, line breaks).
	console.log(string.match(/\s/g));
 
	// returns: [JavaScript,javascript,JAVASCRIPT] 
	// \b matches a word boundary position such as whitespace or the beginning or end of the string.
	console.log(string.match(/J.{6}ipt\b/ig));
 
	// returns: [JavaScript,javascript,JAVASCRIPT]
	// {} means to match a pattern a specified number of times (in this case any character 6 times after "J" and ending in "ipt")
	console.log(string.match(/J.{6}ipt/ig));
 
	// returns: "Here is some text about JJavaScript (also known as JScript). It contains multiple references to jscript/jjavaScript because this article is all about JJavaScript."
	// Groups multiple tokens together. This allows you to apply quantifiers to the full group. This creates a capturing group.
	// Notice the double "JJ" where we've captured the letter j and then replace it with itself twice.
	// If you capture more than one item then all subsequent capture groups are referenced by number incrementally (e.g. $2, $3, $4 etc.)
	console.log(string.replace(/(j)avascript/ig, '$1$1avaScript'));
</script>
Feb 7, 2010JavaScript, Regular Expressions
This site is mainly a place for me to share programming snippets which hopefully others will find useful. But otherwise you‘ll find me musing over Integral Theory (aka. the AQAL framework), MMA/UFC and my own kickboxing progress.
CommentsRSS0

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.