Full Flavour Behaviour!

Regex URL matcher
December 18th @ 12:00pm

You may have noticed that typing a web address into my comments makes a link automatically - as long as you include the http:// at the beginning. This happens because of a wonderful thing called a regular expression which looks for patterns in a stream of text and isolates specific sequences of letters. Until recently the regex looked like this:

/(http:[^ <]*)/i

Which basically looks for http: and then any number of letters which aren't a space or a < (i.e. a newline or some other HTML tag). This isn't by any means foolproof but it does mean that URLs become links.

Recently I've been working on a big database of contact information, a lot of which includes website URLs. Most of them don't include the http:// so I was thinking "okay, I'll just look for www. instead". However, then I remembered the no-www.org campaign - in fact this very blog has no www in its address - and realised that it was time to search out a proper regex.

And that's where it went wrong because although there are many such patterns out there, they all try to validate the URL as well as find it. This means they try to list all the protocols (like ftp:// or even gopher://) and all the TLD extensions (.com, .uk, .museum etc) which, to my mind, is a losing battle.

In this case I don't need to match the protocol and I'm really not bothered about false positives as long as all the real URLs do become clickable links. I noticed that gchat makes a link out of anything that has no spaces and at least one dot which is pretty much all I need. Sure, it means that a typo of "oh.my.god." becomes a link but, well, I can live with that. Users will understand what happened, right?

6th May 2009: Fiddled with it a little so that it only looks for things preceded by whitespace or http/https.

So screw those ludicrously long regexes; here's mine:
/(s|https{0,1}://)([^ <>@/.]+(.[^ <>@.]{2,})+)/i

Works beautifully. I added the @ matching so it wouldn't pick up emails and adjusted it to require at least two letters between dots so that "e.g.this" doesn't match.

Comment on this entry

Don't miss..

Other Carl sites

Photo galleries