Composite: thoughts on poetics & tech: Cleaning up urls with awk

Wednesday, January 28, 2009

Cleaning up urls with awk

Here's my stupid awk trick of the day: using the field separator option to mess with URLs. I spent something like an hour trying to write regular expressions and then reading other people's solutions to cleaning up urls from log files and other sources.

For example, given a list of about a million urls like this:
http://bloggggggggy.com/path/to/the/post/2009/1/26/blahblahblah.html http://www.bloggggggggy.com/morejunk.html https://www.bloggggggggy.com http://yetanotherblogomigod.blogspot.com/ http://yetanotherblogomigod.blogspot.com/somejunk.php?stuff&morestuff
I want to end up with a list that's just
bloggggggy.com yetanotherblogomigod.blogspot.com
You can do this in php with some regular expressions:
preg_match("/^(http:\/\/)?([^\/]+)/i", $URLstring, $result); $domain = $result[2];
(Though I saw a lot of other solutions that were much longer and more involved)
or, here's one method in Perl:
$url =~ s!^https?://(?:www\.)?!!i; $url =~ s!/.*!!; $url =~ s/[\?\#\:].*//;
But for some reason I was trying to do it in one line in awk, because that's how my brain is working these days, and I couldn't get the regular expression right.

Suddenly I realized that if I split the lines on "/", the domain name would always be the third field.

So,
awk -F"/" '{print $3}' hugelistofurls.txt > cleanlist.txt
gave me a nicer list of urls.

and
awk -F"/" '{print $1,"//",$3} hugelistofurls.txt | sort | uniq -c | sort -nr > counted-sorted-cleanlist.txt

gave me just about what I wanted.

After I did that and finished squeaking with happiness and wishing I could show someone who would care (which unfortunately I couldn't which is why I'm blogging it now) I realized I wanted the www stuff taken out. So I backed up and did it in two steps,

awk -F"/" '{print $1,"//",$3}' hugelistofurls.txt > cleanlistofurls.txt awk -F"www." '{print $1 $2}' cleanlistofurls.txt | sort | uniq -c | sort -nr > reallyclean-sorted-listofurls.txt

which gave me something like this:

3 http://blogggggggy.com 2 http://yetanotherblogomigod.blogspot.com

Exactly what I wanted!

While I appreciate a nice regular expression and it can be a fun challenge to figure them out, getting the job done with awk felt a lot simpler, and I'm more likely to remember how to do it in an off-the-cuff way, next time I have a giant list of urls to wrestle with.

How would you approach this same problem, either in awk or using another tool or language? Do you think one way or another is superior, and why?