Compare multiple files and print unique lines

Hi friends,

I have multiple files. For now, let's say I have two of the following style

cat 1.txt

cat 2.txt

output.txt

Please note that my files are not sorted and in the output file I need another extra column that says the file from which it is coming. I have more than 100 files to do this.

All helps are highly appreciated.

Thanks

awk 'END {
  for (R in r) {
    split(r[R], t, SUBSEP)
    if (!t[1])
      print t[3], t[2]
    }
  }
{
  k = $1 SUBSEP $2 SUBSEP $3
  r[k] = c[k]++ SUBSEP FILENAME SUBSEP $0 
  }' [12].txt  
1 Like

It works great, but will it compare the first three columns?

Yes.
Do you want to compare the absolute numeric values of the rest of the columns?

1 Like

Yes.

And also, instead of alphabets in the first three columns, I might have some numbers too.

Please post bigger samples of the input files and, again, an example of the desired output based on that exact input.

For example, I don't understand why these two lines shouldn't be considered unique ...:

D E F 9 8 7 6 5 4 3 2 
D E F 90 88 76 54 32 1 0 1 

I thought of asking u if u needed data. Ok sorry anyways. here u go with the data.

1.txt

A 2 3 1 2 3 4 5 6 7 8
D 4 5 9 8 7 6 5 4 3 2 
G 5 6 0 1 2 3 4 5 6 7
K 7 8 1 32 33 45 67 98 76 34
I 7 8 I A M A N I N D
L 2 3 G O T O H E L L

2.txt

A 2 3 1 2 3 4 5 6 7 8
D 4 5 9 8 7 6 5 4 3 2 
G 5 6 0 1 2 3 4 5 6 7
D O L K I N H J K I L
J G H J L K M N J U I 
M A A T U J H E S A L

3.txt

A 2 3 1 2 3 4 5 6 7 8
D 4 5 9 8 7 6 5 4 3 2 
G 5 6 0 1 2 3 4 5 6 7

4.txt

A 2 3 -1 -2 3 4 5 6 -7 80
D 4 5 90 88 76 54 32 1 0 1
M N O 99 65 34 22 13 9 4 3

Output.txt

M N O 99 65 34 22 13 9 4 3 4.txt
K 7 8 1 32 33 45 67 98 76 34 1.txt
I 7 8 I A M A N I N D 1.txt
L 2 3 G O T O H E L L 1.txt
D O L K I N H J K I L 2.txt
J G H J L K M N J U I 2.txt
M A A T U J H E S A L 2.txt

I just need a match on the first three columns if it is present in more than one file it should be eliminated. Thanks for all ur help.

OK,
in this case the code I've provided is sufficient, isn't it?

I think so. But, my only question is that will your code matches numbers in the first three columns. Because, when u wrote the code all u had was alphabets.

Thanks for all ur help.

Sorry mods for somehow overlooking to include the code tags.

No problem, just use code tags in the future.
Yes, the code compares the first three columns only and doesn't care about their content.

1 Like

Perfect and cool.

Made my weekend happy. Cheers!!!

You're welcome!

cat 1.txt

cat 2.txt

cat3.txt

output.txt

Earlier before I asked for the unique ones. Now, I need the same unique ones but the duplicates should also be printed. If a record is present once or more than once across multiple files, it should be printed only once in my final output.

Thanks in advance.

I believe you're missing one line in your example output:

awk '!_[$0]++' [13].txt

I just added it.

Sorry for it

If you want to restrict the uniqueness to certain columns only (1, 2 and 3 in this example):

awk '!_[$1, $2, $3]++' [13].txt

I owe you a life time. Thanks. So, 13 indicates the range and not just the file names. Did I get it right?

Also, wat does this mean - !_

I know ! means not equal to and ++ means increment.

---------- Post updated at 05:43 PM ---------- Previous update was at 05:43 PM ----------

How could you read my mind? That was my next question.

[1-3].txt is a shell glob pattern which expands to the filenames:

zsh-4.3.14[t]% printf '%s\n' [1-3].txt
1.txt
2.txt
3.txt

But, what does !_ means exactly.

_ is the name of a variable.

He's using it as an array, hence _[ ].

The ! means _[$1, $2, $3] is blank.

The ++ increments the value at _[$1,$2,$3], so it's not blank anymore, meaning that further repeats of _[$1,$2,$3] won't be printed.

1 Like