You are here: PSPad forum > English discussion forum > Extract all links / URLs?

Extract all links / URLs?

Goto Page: 1 2 Next

#1 Extract all links / URLs?

Posted by: Dirk | Date: 2014-02-20 10:47 | IP: IP Logged

Can one extract all links / URLs of a text file (and / or htm(l) file) so they could be added to another txt file? Or at least show / list all URLs, links so you could easily copy them?

Options: Reply | Quote | Up ^


#2 Re: Extract all links / URLs?

Posted by: pspad | Date: 2014-02-20 11:06 | IP: IP Logged

Read my first answer:
forum.pspad.com

Options: Reply | Quote | Up ^


#3 Re: Extract all links / URLs?

Posted by: Dirk | Date: 2014-02-20 22:15 | IP: IP Logged

Thank you very much.

It works great. I would not have supposed there is such an easily to find function in PSPad, sorry for not searching there.

How could I make PSPad find / copy also these kind of links with only a leading www.
e.g.
www.domain.com
or
www.domain.com/?123
etc.?

Are there any (kind of) links / URLs the function does not cope?

Many thanks again.

Edited 1 time(s). Last edit at 2014-02-20 22:31 by Dirk.

Options: Reply | Quote | Up ^


#4 Re: Extract all links / URLs?

Posted by: vbr | Date: 2014-02-21 08:32 | IP: IP Logged

Dirk:
Thank you very much.

It works great. I would not have supposed there is such an easily to find function in PSPad, sorry for not searching there.

How could I make PSPad find / copy also these kind of links with only a leading www.
e.g.
www.domain.com
or
www.domain.com/?123
etc.?

Are there any (kind of) links / URLs the function does not cope?

Many thanks again.

Hi,
you can just modify the regular expression to match what you want,
i.e. instead of the current:

Quote:
(news|http|ftp|https):\/\/[\w\-_]+(\.[\w]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?

maybe

Quote:
((news|http|ftp|https):\/\/)?[\w\-_]+(\.[\w]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?

the added ( ... )? makes the part between parentheses optional.

Be sure to test this properly on you data, it is quite possible, that you get some false positives, i.e. matches that are not urls.

hth,
vbr

Edited 1 time(s). Last edit at 2014-02-21 08:33 by vbr.

Options: Reply | Quote | Up ^


#5 Re: Extract all links / URLs?

Posted by: Dirk | Date: 2014-02-21 09:38 | IP: IP Logged

Hi,

Many thanks.

Better having some false positives and catching all links, URLs than missing one of them without false positives.

But, I have any idea of those expressions, unfortunatly.

OK, extracting this now (I added ":" to avoid the source code is not be shown):

http:://well.me/dfdfddddf 200 ok text/html 1 1 1 nginx 00:00.799 utf-8
http:://well.me/999 200 ok text/html 2 1 1 nginx 00:00.285 utf-8
http:://well.me/456 200 ok text/html 2 1 2 nginx 00:00.323 utf-8
http:://well.me/8887kku 200 ok text/html 2 1 1 nginx 00:00.311 utf-8

extracts that:

http:://well.me/dfdfddddf
00.799
http:://well.me/999
00.285
http:://well.me/456
00.323
http:://well.me/8887kku
00.311

May be one could change that.

Many thanks again.

Edited 1 time(s). Last edit at 2014-02-21 09:38 by Dirk.

Options: Reply | Quote | Up ^


#6 Re: Extract all links / URLs?

Posted by: vbr | Date: 2014-02-21 13:38 | IP: IP Logged

Dirk:
Hi,

Many thanks.

Better having some false positives and catching all links, URLs than missing one of them without false positives.

But, I have any idea of those expressions, unfortunatly.

OK, extracting this now (I added ":" to avoid the source code is not be shown):

http:://well.me/dfdfddddf 200 ok text/html 1 1 1 nginx 00:00.799 utf-8
http:://well.me/999 200 ok text/html 2 1 1 nginx 00:00.285 utf-8
http:://well.me/456 200 ok text/html 2 1 2 nginx 00:00.323 utf-8
http:://well.me/8887kku 200 ok text/html 2 1 1 nginx 00:00.311 utf-8

extracts that:

http:://well.me/dfdfddddf
00.799
http:://well.me/999
00.285
http:://well.me/456
00.323
http:://well.me/8887kku
00.311

May be one could change that.

Many thanks again.

Hi,
well, these matched numbers are the mentioned false positives ... smiling smiley
you may try the following modified pattern

Quote:
((news|http|ftp|https):\/\/)?[\w\-_]+(\.[\w]+)*?(\.[a-z]{2,3})([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?

This should ensure the presence of a toplevel domain consisting of 2-3 letters; again make sure tpo test it on you data, I only did it in a very limited way.

Alternatively, if you know the form of the urls you want to match, it might be workable to write a simpler pattern from scratch - a large part of this version seems to deal with the query part after ?.

hth,
vbr

Options: Reply | Quote | Up ^


#7 Re: Extract all links / URLs?

Posted by: Dirk | Date: 2014-02-21 15:09 | IP: IP Logged

Yes, those fales positives, I see.

Now that happens, some URLs are killed:

---------------------------------------------------------------------------
To extract:


http:://www.vibeuiaductor.com/auoid/njs_pt1.mp3 200 ok audio/mpeg 71089796 njs_pt1.mp3 17.01.2012 23:47:49 2 3 Apache 00:00.501 utf-8

The extraction:


njs_pt1.mp3

---------------------------------------------------------------------------
To extract (actually not, it would not be needed):


''.phorum_html_encode('maurioues3@gmail.com').'' 2 4 00:00.000 utf-8

The extraction:


gmail.com

---------------------------------------------------------------------------
To extract:


http:://sites.google.com/site/mauriciowhysou38/unove-suns/MidnightStar-SearchingForLove.mp3 200 ok text/html MidnightStar-SearchingForLove.mp3 25.01.2011 19:30:47 2 1 1 GSE 00:01.983 utf-8

The extraction:


MidnightStar-SearchingForLove.mp3

---------------------------------------------------------------------------

Quote:
Alternatively, if you know the form of the urls you want to match, it might be workable to write a simpler pattern from scratch - a large part of this version seems to deal with the query part after ?.

I assume, there are to many different, unknown forms.

May be it is easier to extract the www. URLs in a first step and in a second step all of the remaining URLs containing http and similar.

Thank you very much.

Options: Reply | Quote | Up ^


#8 Re: Extract all links / URLs?

Posted by: vbr | Date: 2014-02-21 18:00 | IP: IP Logged

Dirk:
...
May be it is easier to extract the www. URLs in a first step and in a second step all of the remaining URLs containing http and similar.

Thank you very much.

Hi,
it may be difficult to solve this generally using one single regex, but chances are, the format of your input data allows some assumptions which would make the extraction simpler,

e.g. if your URLs were always at the beginning of the line and were guaranteed to either start with http or www and is followed by some whitespace, a naive pattern might simply be

^(https?://|www\.)\S{3,}

this would still have some invalid false positives like www.abc (which could be managed separately), but these are probably not that likely; this assumes, there should not be any spaces in url.

However, if there isn't such regularity, searching in multiple steps seem most simple.

hth,
vbr

Options: Reply | Quote | Up ^


#9 Re: Extract all links / URLs?

Posted by: Dirk | Date: 2014-02-21 19:12 | IP: IP Logged

Hi.

Quote:
it may be difficult to solve this generally using one single regex, but chances are, the format of your input data allows some assumptions which would make the extraction simpler,

e.g. if your URLs were always at the beginning of the line and were guaranteed to either start with http or www and is followed by some whitespace, a naive pattern might simply be

OK, yes, I understand, regrettably the URLs, links can be all over the place in many forms.

Or may be one first just could add (by search & replace)
http://
to each single
www.
not leaded by
http
or
https
or so and then you use the pattern provided by PSPad.

And a regex only extracting the www. URLs in the first step would not be easier to handle this?

Options: Reply | Quote | Up ^


#10 Re: Extract all links / URLs?

Posted by: vbr | Date: 2014-02-21 20:08 | IP: IP Logged

Dirk:
...
Or may be one first just could add (by search & replace)
http://
to each single
www.
not leaded by
http
or
https
or so and then you use the pattern provided by PSPad.

And a regex only extracting the www. URLs in the first step would not be easier to handle this?

This would be possible, however, in multiple steps - the simplest way I can see would be, to replace
www\.
with:
www.

first, and then remove the possibly duplicated part be replacing:
(https?://)http://

with
$1

(A direct replacement with respect to the needed contition "not preceeded by ..." requires so called lookbehind assertions, but the regex engine in PSPad doesn't support this feature.)

Dirk:
And a regex only extracting the www. URLs in the first step would not be easier to handle this?

this would be easier, it just seemed that you can also have URLs without this part. You may try:
www(\.[\w]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?

hth,
vbr

Edited 1 time(s). Last edit at 2014-02-21 20:26 by vbr.

Options: Reply | Quote | Up ^


Goto Page: 1 2 Next





Editor PSPad - freeware editor, © 2001 - 2024 Jan Fiala, Hosted by Webhosting TOJEONO.CZ, design by WebDesign PAY & SOFT, code Petr Dvořák, Privacy policy and GDPR