finding and removing patterns in a large list of urls

totus · February 19, 2009, 9:49pm

I have a list of urls for example:

Google
Google Base
Yahoo!
Yahoo!
Yahoo! Video - It's On
Google

The problem is that Google and Google are duplicates as are Yahoo! and Yahoo!.

I'm needing to find these conical www duplicates and append the text "DUP#" in from of both Google and Google for delimited import into excel to be able to sort and review by eye.

have no idea how to begin... sed, awk, perl, cut, etc????

Many thanks for any input.

summer_cherry · February 20, 2009, 1:04am

#!/usr/bin/perl
use strict;
open FH,"<a.txt";
my (@arr,%hash);
while(<FH>){
	chomp;
	push @arr,$_;
	$hash{$_}++;
}
close FH;
map { $_="#DUP ".$_ if $hash{$_} > 1 } @arr;
print join "\n" , @arr;

Shahul · February 20, 2009, 1:37am

Hi totus,

Hope This also can do .....

inputfile:
www.Google.com
www.Google Base.com
www.Yahoo!.com
www.Yahoo!.com
www.Yahoo! Video - It's On.com
www.Google.com

command:
sort inputfile|uniq -D |awk '{print $0"_DUP#"}'> out.csv

output:
www.Google.com_DUP#
www.Google.com_DUP#
www.Yahoo!.com_DUP#
www.Yahoo!.com_DUP#

Thanks
Sha

totus · February 20, 2009, 2:11am

Hello both of you! Thanks for the tips! However, I made a mistake in representing my data - as the vbulletin mucked it up. Here it is in <code> snip:

http://www.google.com
http://google.com
http://www.yahoo.com
http://video.yahoo.com
http://www.yahoo.com
http://knol.google.com

The issue is the www.domain.com and domain.com are dups. I need to identify these in a large lists by appending some delimiter to the matches e.g.

DUP#http://www.google.com
DUP#http://google.com