Compare and Remove duplicate lines from txt

rmarcano · August 16, 2008, 2:33pm

Hello,
I am a total linux newbie and I can't seem to find a solution to this little problem.
I have two text files with a huge list of URLS. Let's call them file1.txt and file2.txt

What I want to do is grab an URL from file2.txt, search file1.txt for the URL and if found, delete it from file1.txt.

What would be the best way to go around doing this? Would I need a full bash script or does anyone know a simple oneliner to do it.

Thanks,
Rafael

vidyadhar85 · August 16, 2008, 2:46pm

hope it will work...
right now i don't have terminal to test...

while read line
do
sed "/$line/d" file1.txt >tempfile /sed -i "/$line/d" file1.txt (works in higher version of linux)/
done < file2.txt

rmarcano · August 16, 2008, 3:52pm

Thanks for your quick answer vidyadhar!
I'm getting the following error
sed: -e expression #1, char 8: unknown command: `/'

This is my bash script:

#!/bin/sh
while read line
do
sed "/$line/d" file1.txt >tempfile
done < file2.txt

I tried searching around google for more sed info to see if I could fix it but had no luck.

Thanks again!

redoubtable · August 16, 2008, 4:03pm

another way to do it is just

for i in `cat file1`; do echo $i|grep -v -f file2; done

that will output only the lines in file1 which are not found in file2. Then if you want you can redirect the output to another file so you can create a new file1

for i in `cat file1`; do echo $i|grep -v -f file2; done > file1.new

rmarcano · August 16, 2008, 4:05pm

Just changed the double quotes to single quotes and I got no errors.
Thanks again!

vidyadhar85 · August 16, 2008, 4:39pm

ya sorry i typed double quote instead of single quote
usually in sed "" is used to expan the variable... when you are using variables inside sed you should be extra carefull

rmarcano · August 16, 2008, 5:13pm

vidyadhar85 , that didn't seem to work. I dont know what might be wrong, but it's not finding any common lines.
File1.txt has 1687 lines, of which 472 are in File2.txt

I am trying redoubtable's solution but it's a bit slow. It's been working for a bit over 10min which is understandable since the files are big.

I'll let you know how that works.

vidyadhar85 · August 16, 2008, 5:25pm

then just try
grep -v -f file2 file1 >> newfile
mv newfile file1

rmarcano · August 16, 2008, 5:37pm

That worked great vidyadhar. Thanks!

redoubtable · August 16, 2008, 5:52pm

I gave you a line-by-line solution 'cause I figured you could want to do some processing for each line.

matrixmadhan · August 17, 2008, 2:08am

Dirty but here is a superfast way to do it.

1) Load all the urls from file1.txt to a hashmap
2) Parse file2.txt each line one by one
3) if entry in file2.txt found in file1.txt delete the hashmap entry ( constructed for file1.txt )
4) loop through file2.txt till EOF
5) Write all the contents of remaining hash map entry to another file - this should give the output that you intend for

bakunin · August 18, 2008, 5:09am

Just a question: what is wrong with merging the files first and use "sort -u" then? I suppose you want to add the content of one file to the other but don't want to create duplicates, right? If so:

sort -u file1 file2 > tempfile

I hope this helps.

bakunin