I’m moving this post from http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx
I started out blogging on geeks with blogs but I can’t allow comments there anymore or I get too much spam, so I’m moving the post from there to this place. Various people have contributed through the comments in the other blog post. So here I have better control over the spam and can open the comments again.
I have been looking for a good first layer of validating an url to see if it is valid.
For checking the format of the url it seems to me to be the most logical approach to use regular expressions. Up until now I always discarded them as being to “geeky”, meaning i don’t consider it my life’s biggest goal to be typing (/?[]\w) all day long (so why did i become a programmer, aaaah yes to make life easier for other people)
Anyway.. to find a good regular expression to that validates urls not url domains. One that doesn’t allow spaces in the domainname and where the domain can be suffixed with the port number. Also I need support for the ~/ paths
This is what I came up with.. if somebody as a better idea… or finds a mistake please let me know.. Always happy to learn something new.
^(((ht|f)tps?\:\/\/)|~/|/)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5})(:[\d]{1,5})?)/?(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?
I was a bit quickly in using this regex. Simeon pilgrim indicated that the ftp urls won’t validate when you add a username and a password.
I don’t really need to validate ftp so I should have removed the ftp protocol from the list of choices. I need this just to validate urls for weblinks and the link element in an rss feed. When I need them for ftp I will post the ftp version.. but for now I don’t have time to spend on elaborating the regex.
Anyway here is the right one : ^(http(s?)\:\/\/|~/|/)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?/?(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?
A full url validation would include resolving names through dns or making a webrequest to the provided url to see if we get a 200 response. The only way to be sure is to test if it is there in my opinion.
Thanks Simeon.
And for those who really want the ftp validation : ^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?/?(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?
I am not sure about numbers in the username but I believe you can have a username of numbers alone.
Comments don’t seem to work on this blog engine.. so just send me a mail through the contact form. thanks
Two days later …
I discovered there is still a problem with my regular expressions… folders don’t get parsed.
I’ve solved the path issue, so now it should be finding all url’s
Expression:
^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?
Should parse the url below
http://hh-1hallo.msn.blabla.com:80800/test/test/test.aspx?dd=dd&id=dki
But not :
http://hh-1hallo. msn.blablabla.com:80800/test/test.aspx?dd=dd&id=dki
Update 29/11/2008:
Joe posted what seems to be a great regular expression in the comments
he tested it with the following urls:
http://www.google.com/search?q=good+url+regex&rls=com.microsoft:*&ie=UTF-8&oe=UTF-8&startIndex=&startPage=1
ftp://joe:password@ftp.filetransferprotocal.com
google.ru
https://some-url.com?query=&name=joe?filter=*.*#some_anchor
Expression:
^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~/|/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$
Update 8/11/2009:
Expression:
^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$
There is a wave for this regex:
https://wave.google.com/wave/?pli=1#restored:wave:googlewave.com!w%252BsFbGJUukA
Update 29/09/2010
So people if you don’t like it don’t use it.
Now this regex is troubled it has a bunch of issues but it works most of the time. If you want a more liberal regular expression to just capture urls from text, there is a really good one on the blog of John Gruber.
Improved regex for matching urls @ daring fireball


.asia ?
Nice work. But this regex doesn’t cover IP-Addresses as hostnames. Unfortunately i was unable to fix it in a good way.
um…. i think you would absolutely be safe with something like (?:[a-zA-z}{2,8}) for Top Level Domains I don’t know what the max is, but i know the minimum is 2 characters for a dn.
Well, this is not a general URL regular expression
I’ve never tried to create one myself, cause I believe the IETF or some friends of them has already done this.
Btw, something like http/ftp should not be included in the regex. as long as it follow a general format of : it’s a URL
I think you missed the part of the post where it said what the purpose was. I’m sorry if URL isn’t used in the exact definition of the abbreviation URL. But lets not dwell on that.
what about domains with asian or russian characters? i think now there are even chinese or korean top level domains possible, for example “中国”…
Don’t care about russian and asian characters. Before you validate an URL, you have to convert it to PunyCode, which only uses A-Z, 0-9 and -. So “中国” ain’t in URLs
What would you say about this?
^(?:(?:(?:http[s]?):\/\/)|(?:www.))(?:[-_0-9a-z] .) [-_0-9a-z]{2,4}[:0-9]*[\/]*$
There’s no browser that can handle it
Nope, didn’t work for me. I tried this ultra-basic implementation in python
—————————————————————
#!/usr/bin/env python
import re
line =”this is an example of a url that someone might write in some text http://www.google.com test test test”
urls = []
urls.append( re.findall( r”^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w :\w @)?(?#Subdomains)(?:(?:[-\w] \.) (?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$ |.,=]|%[a-f\d]{2}) ) |\/) |\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$ |.,*:]|%[a-f\d{2}]) =?(?:[-\w~!$ |.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$ |.,*:]|%[a-f\d{2}]) =?(?:[-\w~!$ |.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$ |.,*:=]|%[a-f\d]{2})*)?$”, line ) )
print urls
—————————————————
and it didnt find anything, not even the basic google.com
I tried it with PHP and works flawlessly, I tried like 8 different ones from regexlibrary.com and this one killed them all. I’m just wondering if it will work with JS, I’ll keep you guys posted…
what about “&” in urls?
in the last comment i mean “& amp;” (without space)
Thanks a lot for this wonderful article
I’ve enhanced it. Now it works with IPs
Even if you write 1.01.001.000
^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w :\w @)?((?#Subdomains)(?:(?:[-\w] \.) (?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))|(?#IP)((\b25[0-5]\b|\b[2][0-4][0-9]\b|\b[0-1]?[0-9]?[0-9]\b)(\.(\b25[0-5]\b|\b[2][0-4][0-9]\b|\b[0-1]?[0-9]?[0-9]\b)){3}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$ |.,=]|%[a-f\d]{2}) ) |\/) |\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$ |.,*:]|%[a-f\d{2}]) =?(?:[-\w~!$ |.,*:;=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$ |.,*:]|%[a-f\d{2}]) =?(?:[-\w~!$ |.,*:;=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$ |.,*:;=]|%[a-f\d]{2})*)?$
By the way, there can be slashes after the anchor symbol (octothorp or dies or sharp). For example, in some gmail links. So if we do not to mark such URLs as invalid, we should include slash in the last part of this regexp:
(?#Anchor)(?:#(?:[-\w~!$ |/.,*:;=]|%[a-f\d]{2})*)?$
Thanks Ivan. You have really helped me with this extensive regular expression. I was wanting to validate a URL in Java and was surprised that there was no decent library to do this (Apache commons which has one doesn’t build properly and seems outdated – and anyway you have to import thousands of lines of code to validate a URL… yawn! lol).
Here’s the Java code for anyone interested (where URL_REGEXP is your regular expression – don’t forget to replace all ‘\’ with ‘\\’ when setting it as a constant):
String url = "http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/";
if (!url.matches(URL_REGEXP))
{
// Invalid URL.
}
else
{
// Valid URL.
}
Here’s a Java class I wrote using your regular expression:
/**
* This class implements URL validation.
* @author Mike Youell (wrote class) + Ivan Porto Carrero (wrote regular expression).
* @date 06 May 2010
*/
public class URLValidator
{
// The regular expression which validates a URL.
// Found here: http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/
private static final String URL_VALIDATION_REG_EXP = "^(?:(?:ht|f)tp(?:s?)\\:\\/\\/|~\\/|\\/)?(?:\\w+:\\w+@)?(?:(?:[-\\w]+\\.)+(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?::[\\d]{1,5})?(?:(?:(?:\\/(?:[-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?(?:(?:\\?(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?(?:[-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)(?:&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?(?:[-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*(?:#(?:[-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?$";
/**
* Is the URL valid?
* @param[in] url - The URL to validate.
* @return true if it's a valid URL, otherwise false.
*/
static public boolean isValid(String url)
{
// Is the URL valid?
if (url.matches(URL_VALIDATION_REG_EXP))
{
// yes.
return true;
}
// no.
return false;
}
}
p.s. there’s a bug in your <code> tag – it removes all leading whitespace which messes up the indentation of code.
Thanks for this great regex.
However I think there is a small mistake on line 4 column 37:
%[a-f\d{2}]
should actually be
%[a-f\d]{2}
Cheers!
Mine is simple – beginning should be a protocol and end should be space. And i manually add space to my data so that end would match too. I guess its simplest and most general
/((http|https|ftp):\/\/([^ ]*)) /i
Very nice regex. Thank You. But be careful, your last posts miss the plus “+” character.
I don’t want to post the whole regex again, so replace every space character within this regex with “+”, then it works perfectly (also with python).
Thanks for great regex. But not working with redirection urls:
http://www.redirectme.com/redirector.php?url=http://toberedirected.com/files/1.html
IMHO:
$regexUrl = “/\b(([\w-]+:\/\/?|www[.]|(\w+\.))([^\s()]+(?:\([\w\d]+\)|([^[:punct:]\s]|\/))))/”;
This one matches all kinds of links OK, even ones like asd.domain.com .
Recommendation:
function urls_to_links($originalString) {
return preg_replace(“/\b(([\w-]+:\/\/?|(www[.]|\w+\.))([^\s()]+(?:\([\w\d]+\)|([^[:punct:]\s]|\/))))/”, ‘$1‘, $originalString);
}
¡Ahh! how do you paste code?
You can view my original blog post at http://droope.wordpress.com/2010/07/08/regexpara-convertir-urls-a-links/
But here I get stuck with the short url http://to./, which makes the provided regex failed, any ideas? thanks
Hi,
just wanted to thank Ivan et al for the pattern – this is the most reasonable one that I’ve seen so far (I’m not counting monstrosities which validate international URIs or IPv6 addresses).
I’ve made 3 modifications which I wanted to share back as thanks for the provided pattern:
1.) added validation for IPv4 addresses
2.) made the hostname part optional, thus allowing relative URLs (so make sure to validate for empty strings outside of the code below)
3.) list of TLDs has been updated according to http://data.iana.org/TLD/tlds-alpha-by-domain.txt (incorporating newly approved TLDs)
Note: single part hostnames (e.g. “localhost”) and non-ASCII URIs were kept unsupported.
Here is the code for PHP (pattern should be language agnostic):
function validate_uri( $uri ) {
$pattern = ‘/^’
.’(‘
.’(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?’
.’(?#Username:Password)(?:\w+:\w+@)?’
.’(‘ // allow domain name or IP address
.’(?#Subdomains)(?:(?:[-\w]+\.)+’
.’(?#TopLevelDomains)(?:aero|arpa|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|[a-z]{2}))’
.’|’
.’(?#IpAddress)(?:(?:\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}(?:\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])’
.’)’
.’(?#Port)(?::[\d]{1,5})?’
.’)?’
.’(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?’
.’(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*’
.’(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?’
.’$/i’;
$result = preg_match($pattern, $url);
if($result > 0) {
return true;
}
return false;
}
Thanks,
M.
I am not admiring the regular expressions on this blog page.
The standard RegEx is:
http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
If you want to match only specific schemes, don’t want optional sheme or host, and you want to disallow spaces so that you can search in text, then use:
((http(s)?|ftp):)(//([^/?#\s]*))([^?#\s]*)(\?([^#\s]*))?(#([^\s]*))?
If you want to allow for www. without the http(s)?:// then use:
(?:(())(www\.([^/?#\s]*))|((http(s)?|ftp):)(//([^/?#\s]*)))([^?#\s]*)(\?([^#\s]*))?(#([^\s]*))?
If you want to validate the host, then I suggest doing it as a separate RegEx, so at least the RegEx above won’t fail to match any valid urls.
An improved version.
// http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
//
// (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
//
// $@ matches the entire url
// $1 matches scheme (http, ftp, mailto, ym, mshelp, etc)
// $2 matches authority (host, ftp user:pwd@host, etc)
// $3 matches path
// $4 matches query (http GET REST api, etc)
// $5 matches fragment (html anchor, etc)
//
// Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow ‘www.’ w/o scheme
// Note the schemes must match ^[^|:/?#\s]+(?:\|[^|:/?#\s]+)*$
//
// (?:()(www\.[^/?#\s]+\.[^/?#\s]+)|(schemes)://([^/?#\s]*))([^?#\s]*)(\?([^#\s]*))?(#([^\s]*))?
//
// Validate the authority with an orthogonal RegEx, so at least the RegEx above won’t fail to match any valid urls.
function urlRegEx ( schemes, flags )
{
if( !RegExp( ‘/^[^|:/?#\s]+(?:\|[^|:/?#\s]+)*$/’ ).test( schemes ) )
throw TypeError( “Expected schemes” )
return new RegExp( ‘(?:()(www\.[^/?#\s]+\.[^/?#\s]+)|(‘ + schemes + ‘)://([^/?#\s]*))([^?#\s]*)(\?([^#\s]*))?(#([^\s]*))?’, flags )
}
I use (s?https?|s?ftp|telnet|news|imap|mailto|mms|s?news|nntp|prospero|rsync|rtspu?|sips?|k?svn(\++(ssh|https?))?|telnet|wais) to match the protocol, which includes all those I’ve come across with. I’m sure there’s more; though it depends on how ridiculous you want to get.
Thank you so much for making my search for the perfect URL regEx came to an end.