Email Carl about this page
pineapplecharm.com
Go home
Home :: Windsor :: Regex URL matcher
Regex URL matcher
Posted Tuesday, 18 December 2007
You may have noticed that typing a web address into my comments makes a link automatically - as long as you include the http:// at the beginning. This happens because of a wonderful thing called a regular expression which looks for patterns in a stream of text and isolates specific sequences of letters. Until recently the regex looked like this:
/(http:[^ <]*)/i

Which basically looks for http: and then any number of letters which aren't a space or a > (i.e. a newline or some other HTML tag). This isn't by any means foolproof but it does mean that URLs become links.

Recently I've been working on a big database of contact information, a lot of which includes website URLs. Most of them don't include the http:// so I was thinking "okay, I'll just look for www. instead". However, then I remembered the no-www.org campaign - in fact this very blog has no www in its address - and realised that it was time to search out a proper regex.

And that's where it went wrong because although there are many such patterns out there, they all try to validate the URL as well as find it. This means they try to list all the protocols (like ftp:// or even gopher://) and all the TLD extensions (.com, .uk, .museum etc) which, to my mind, is a losing battle.

In this case I don't need to match the protocol and I'm really not bothered about false positives as long as all the real URLs do become clickable links. I noticed that gchat makes a link out of anything that has no spaces and at least one dot which is pretty much all I need. Sure, it means that a typo of "oh.my.god." becomes a link but, well, I can live with that. Users will understand what happened, right?

So screw those ludicrously long regexes; here's mine:
/([^ <>@/]+(.[^ <>@.][^ <>@.]+)+)/i

Works beautifully. I added the @ matching so it wouldn't pick up emails and adjusted it to require at least two letters between dots so that "e.g.this" doesn't match.
Comments
carlspinelesspublishing.com
Comment on this entry

Previously: It's in the bag!

Next: Don't you forget about me