Compare multiple files and print unique lines

jacobs.smith · February 3, 2012, 10:51am

Hi friends,

I have multiple files. For now, let's say I have two of the following style

cat 1.txt

cat 2.txt

output.txt

Please note that my files are not sorted and in the output file I need another extra column that says the file from which it is coming. I have more than 100 files to do this.

All helps are highly appreciated.

Thanks

radoulov · February 3, 2012, 11:01am

awk 'END {
  for (R in r) {
    split(r[R], t, SUBSEP)
    if (!t[1])
      print t[3], t[2]
    }
  }
{
  k = $1 SUBSEP $2 SUBSEP $3
  r[k] = c[k]++ SUBSEP FILENAME SUBSEP $0 
  }' [12].txt

jacobs.smith · February 3, 2012, 11:08am

radoulov:

awk 'END {
  for (R in r) {
   split(r[R], t, SUBSEP)
   if (!t[1])
   print t[3], t[2]
   }
  }
{
  k = $1 SUBSEP $2 SUBSEP $3
  r[k] = c[k]++ SUBSEP FILENAME SUBSEP $0 
  }' [12].txt

It works great, but will it compare the first three columns?

radoulov · February 3, 2012, 11:09am

Yes.
Do you want to compare the absolute numeric values of the rest of the columns?

jacobs.smith · February 3, 2012, 11:18am

Yes.

And also, instead of alphabets in the first three columns, I might have some numbers too.

radoulov · February 3, 2012, 11:19am

Please post bigger samples of the input files and, again, an example of the desired output based on that exact input.

For example, I don't understand why these two lines shouldn't be considered unique ...:

D E F 9 8 7 6 5 4 3 2 
D E F 90 88 76 54 32 1 0 1

jacobs.smith · February 3, 2012, 11:29am

radoulov:

Please post bigger samples of the input files and, again, an example of the desired output based on that exact input.

For example, I don't understand why these two lines shouldn't be considered unique ...:
D E F 9 8 7 6 5 4 3 2 
D E F 90 88 76 54 32 1 0 1 

I thought of asking u if u needed data. Ok sorry anyways. here u go with the data.

1.txt

A 2 3 1 2 3 4 5 6 7 8
D 4 5 9 8 7 6 5 4 3 2 
G 5 6 0 1 2 3 4 5 6 7
K 7 8 1 32 33 45 67 98 76 34
I 7 8 I A M A N I N D
L 2 3 G O T O H E L L

2.txt

A 2 3 1 2 3 4 5 6 7 8
D 4 5 9 8 7 6 5 4 3 2 
G 5 6 0 1 2 3 4 5 6 7
D O L K I N H J K I L
J G H J L K M N J U I 
M A A T U J H E S A L

3.txt

A 2 3 1 2 3 4 5 6 7 8
D 4 5 9 8 7 6 5 4 3 2 
G 5 6 0 1 2 3 4 5 6 7

4.txt

A 2 3 -1 -2 3 4 5 6 -7 80
D 4 5 90 88 76 54 32 1 0 1
M N O 99 65 34 22 13 9 4 3

Output.txt

M N O 99 65 34 22 13 9 4 3 4.txt
K 7 8 1 32 33 45 67 98 76 34 1.txt
I 7 8 I A M A N I N D 1.txt
L 2 3 G O T O H E L L 1.txt
D O L K I N H J K I L 2.txt
J G H J L K M N J U I 2.txt
M A A T U J H E S A L 2.txt

I just need a match on the first three columns if it is present in more than one file it should be eliminated. Thanks for all ur help.

radoulov · February 3, 2012, 11:33am

OK,
in this case the code I've provided is sufficient, isn't it?

jacobs.smith · February 3, 2012, 11:38am

I think so. But, my only question is that will your code matches numbers in the first three columns. Because, when u wrote the code all u had was alphabets.

Thanks for all ur help.

Sorry mods for somehow overlooking to include the code tags.

radoulov · February 3, 2012, 11:39am

No problem, just use code tags in the future.
Yes, the code compares the first three columns only and doesn't care about their content.

jacobs.smith · February 3, 2012, 11:42am

Perfect and cool.

Made my weekend happy. Cheers!!!

radoulov · February 3, 2012, 11:43am

You're welcome!

jacobs.smith · February 3, 2012, 5:23pm

cat 1.txt

cat 2.txt

cat3.txt

output.txt

Earlier before I asked for the unique ones. Now, I need the same unique ones but the duplicates should also be printed. If a record is present once or more than once across multiple files, it should be printed only once in my final output.

Thanks in advance.

radoulov · February 3, 2012, 5:28pm

I believe you're missing one line in your example output:

awk '!_[$0]++' [13].txt

jacobs.smith · February 3, 2012, 5:34pm

I just added it.

Sorry for it

radoulov · February 3, 2012, 5:42pm

If you want to restrict the uniqueness to certain columns only (1, 2 and 3 in this example):

awk '!_[$1, $2, $3]++' [13].txt

jacobs.smith · February 3, 2012, 5:43pm

I owe you a life time. Thanks. So, 13 indicates the range and not just the file names. Did I get it right?

Also, wat does this mean - !_

I know ! means not equal to and ++ means increment.

---------- Post updated at 05:43 PM ---------- Previous update was at 05:43 PM ----------

How could you read my mind? That was my next question.

radoulov · February 3, 2012, 5:45pm

[1-3].txt is a shell glob pattern which expands to the filenames:

zsh-4.3.14[t]% printf '%s\n' [1-3].txt
1.txt
2.txt
3.txt

jacobs.smith · February 3, 2012, 5:48pm

But, what does !_ means exactly.

Corona688 · February 3, 2012, 5:59pm

_ is the name of a variable.

He's using it as an array, hence _[ ].

The ! means _[$1, $2, $3] is blank.

The ++ increments the value at _[$1,$2,$3], so it's not blank anymore, meaning that further repeats of _[$1,$2,$3] won't be printed.