A good url regular expression? (repost)

A good url regular expression? (repost)
by Ivan Porto Carrero
Posted November 8th, 2009 at 12:02 pm

I’m moving this post from http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

I started out blogging on geeks with blogs but I can’t allow comments there anymore or I get too much spam, so I’m moving the post from there to this place.  Various people have contributed through the comments in the other blog post. So here I have better control over the spam and can open the comments again.

I have been looking for a good first layer of validating an url to see if it is valid.

For checking the format of the url it seems to me to be the most logical approach to use regular expressions. Up until now I always discarded them as being to “geeky”, meaning i don’t consider it my life’s biggest goal to be typing (/?[]\w) all day long (so why did i become a programmer, aaaah yes to make life easier for other people)

Anyway.. to find a good regular expression to that validates urls not url domains. One that doesn’t allow spaces in the domainname and where the domain can be suffixed with the port number.  Also I need support for the ~/ paths

This is what I came up with.. if somebody as a better idea… or finds a mistake please let me know.. Always happy to learn something new.

^(((ht|f)tps?\:\/\/)|~/|/)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5})(:[\d]{1,5})?)/?(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

I was a bit quickly in using this regex. Simeon pilgrim indicated that the ftp urls won’t validate when you add a username and a password.

I don’t really need to validate ftp so I should have removed the ftp protocol from the list of choices.  I need this just to validate urls for weblinks and the link element in an rss feed.  When I need them for ftp I will post the ftp version.. but for now I don’t have time to spend on elaborating the regex.

Anyway here is the right one : ^(http(s?)\:\/\/|~/|/)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?/?(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

A full url validation would include resolving names through dns or making a webrequest to the provided url to see if we get a 200 response. The only way to be sure is to test if it is there in my opinion.

Thanks Simeon.

And for those who really want the ftp validation : ^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?/?(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

I am not sure about numbers in the username but I believe you can have a username of numbers alone.

Comments don’t seem to work on this blog engine.. so just send me a mail through the contact form. thanks

Two days later …

I discovered there is still a problem with my regular expressions… folders don’t get parsed.

I’ve solved the path issue, so now it should be finding all url’s

Expression:

^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

Should parse the url below

http://hh-1hallo.msn.blabla.com:80800/test/test/test.aspx?dd=dd&id=dki

But not :

http://hh-1hallo. msn.blablabla.com:80800/test/test.aspx?dd=dd&id=dki

Update 29/11/2008:

Joe posted what seems to be a great regular expression in the comments

he tested it with the following urls:

http://www.google.com/search?q=good+url+regex&rls=com.microsoft:*&ie=UTF-8&oe=UTF-8&startIndex=&startPage=1

ftp://joe:password@ftp.filetransferprotocal.com

google.ru

https://some-url.com?query=&name=joe?filter=*.*#some_anchor

Expression:

^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~/|/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$

Update 8/11/2009:

Expression:

^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$

There is a wave for this regex:

https://wave.google.com/wave/?pli=1#restored:wave:googlewave.com!w%252BsFbGJUukA

16 ResponsesLeave a comment
  • Beni
    November 29, 2009 at 12:59

    .asia ?

  • wilma
    November 30, 2009 at 17:57

    Nice work. But this regex doesn’t cover IP-Addresses as hostnames. Unfortunately i was unable to fix it in a good way.

  • Manasto
    December 3, 2009 at 08:12

    um…. i think you would absolutely be safe with something like (?:[a-zA-z}{2,8}) for Top Level Domains I don’t know what the max is, but i know the minimum is 2 characters for a dn.

  • Esben stien
    December 5, 2009 at 21:01

    Well, this is not a general URL regular expression

    I’ve never tried to create one myself, cause I believe the IETF or some friends of them has already done this.

    Btw, something like http/ftp should not be included in the regex. as long as it follow a general format of : it’s a URL

  • ivan
    December 5, 2009 at 23:00

    I think you missed the part of the post where it said what the purpose was. I’m sorry if URL isn’t used in the exact definition of the abbreviation URL. But lets not dwell on that.

  • testestest
    December 7, 2009 at 14:47

    what about domains with asian or russian characters? i think now there are even chinese or korean top level domains possible, for example “中国”…

  • smb
    January 7, 2010 at 17:37

    Don’t care about russian and asian characters. Before you validate an URL, you have to convert it to PunyCode, which only uses A-Z, 0-9 and -. So “中国” ain’t in URLs

  • petrelevich
    February 5, 2010 at 21:08

    What would you say about this?

    ^(?:(?:(?:http[s]?):\/\/)|(?:www.))(?:[-_0-9a-z] .) [-_0-9a-z]{2,4}[:0-9]*[\/]*$

  • J. Benitez
    February 7, 2010 at 05:16

    There’s no browser that can handle it

  • matapult
    February 8, 2010 at 11:55

    Nope, didn’t work for me. I tried this ultra-basic implementation in python

    —————————————————————
    #!/usr/bin/env python

    import re
    line =”this is an example of a url that someone might write in some text http://www.google.com test test test”
    urls = []

    urls.append( re.findall( r”^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w :\w @)?(?#Subdomains)(?:(?:[-\w] \.) (?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$ |.,=]|%[a-f\d]{2}) ) |\/) |\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$ |.,*:]|%[a-f\d{2}]) =?(?:[-\w~!$ |.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$ |.,*:]|%[a-f\d{2}]) =?(?:[-\w~!$ |.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$ |.,*:=]|%[a-f\d]{2})*)?$”, line ) )

    print urls
    —————————————————

    and it didnt find anything, not even the basic google.com :(

  • Rene
    February 15, 2010 at 20:12

    I tried it with PHP and works flawlessly, I tried like 8 different ones from regexlibrary.com and this one killed them all. I’m just wondering if it will work with JS, I’ll keep you guys posted…

  • zerkms
    February 16, 2010 at 01:40

    what about “&” in urls?

  • zerkms
    February 16, 2010 at 01:42

    in the last comment i mean “& amp;” (without space)

  • Mina
    February 23, 2010 at 08:45

    Thanks a lot for this wonderful article

  • Alexey Novikov
    March 14, 2010 at 11:09

    I’ve enhanced it. Now it works with IPs
    Even if you write 1.01.001.000

    ^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w :\w @)?((?#Subdomains)(?:(?:[-\w] \.) (?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))|(?#IP)((\b25[0-5]\b|\b[2][0-4][0-9]\b|\b[0-1]?[0-9]?[0-9]\b)(\.(\b25[0-5]\b|\b[2][0-4][0-9]\b|\b[0-1]?[0-9]?[0-9]\b)){3}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$ |.,=]|%[a-f\d]{2}) ) |\/) |\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$ |.,*:]|%[a-f\d{2}]) =?(?:[-\w~!$ |.,*:;=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$ |.,*:]|%[a-f\d{2}]) =?(?:[-\w~!$ |.,*:;=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$ |.,*:;=]|%[a-f\d]{2})*)?$

  • Alexey Novikov
    March 14, 2010 at 18:21

    By the way, there can be slashes after the anchor symbol (octothorp or dies or sharp). For example, in some gmail links. So if we do not to mark such URLs as invalid, we should include slash in the last part of this regexp:


    (?#Anchor)(?:#(?:[-\w~!$ |/.,*:;=]|%[a-f\d]{2})*)?$

Add a commentGet a Gravatar

* Name

* Email Address

Website Address

You can usethese tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
Around The Site
Categories
Archives
Tags