removing duplicate lines while maintaing coherence with second file

adrunknarwhal · August 9, 2011, 2:09pm

So I have two files. The first file, file1.txt, has lines of numbers separated by commas.

file1.txt

10,2,30,50
22,6,3,15,16,100
73,55
78,40,33,30,11
73,55
99,82,85
22,6,3,15,16,100

The second file, file2.txt, has sentences.

file2.txt

"the cat is fat"
"I like eggs"
"fish live in water"
"the moon is made of cheese"
"fish have houses in the water"
"the cake is a lie"
"I like my eggs scrambled"

What I would like to do is remove the duplicate lines from file1.txt, the file with the numbers, but at the same time remove the equivalent line from file2.txt.

In other words: If line 5 from file1.txt is removed because it is a duplicate, then line 5 from file2.txt should also be removed.

I know that

awk '!x[$0]++' file1.txt > file1.txt.new

will work for removing duplicates, but it does not provide coherence with file2.txt.

thanks for the help in advance

Scott · August 9, 2011, 2:57pm

It's not clear, exactly, what output should be given, so here goes!

$ paste -d\| file1.txt file2.txt | sort -uk1,1 | awk -F\| '{print $2}'
"the cat is fat"
"I like eggs"
"fish live in water"
"the moon is made of cheese"
"the cake is a lie"

mirni · August 9, 2011, 2:57pm

This should work:

awk 'NR==FNR{if(!x[$0]++)print; y[NR]=x[$0]; next}y[FNR]==1' file1.txt file2.txt

Scrutinizer · August 9, 2011, 3:07pm

To manipulate both files at the same time, try:

awk '{getline l<f}!x[$0]++{print l>fnew;print}' f=file2.txt fnew=file2.txt.new file1.txt > file1.txt.new

adrunknarwhal · August 9, 2011, 3:33pm

Sweet, it works now. Thanks for the quick replies.

binlib · August 9, 2011, 9:58pm

scottn:

It's not clear, exactly, what output should be given, so here goes!

$ paste -d\| file1.txt file2.txt | sort -uk1,1 | awk -F\| '{print $2}'
"the cat is fat"
"I like eggs"
"fish live in water"
"the moon is made of cheese"
"the cake is a lie"

Nice solution, but you missed an option "-t\|" in your sort.

alister · August 9, 2011, 11:11pm

I agree that it's a nice solution. However, if the first file's content is as advertised (comma-delimited numbers), there's really no need to specify an alternate delimiter; the default \t suffices.

paste file1.txt file2.txt | sort -uk1,1 | cut -f2-

Regards,
Alister