Deleting all occurences of a duplicate row

ragavhere · July 10, 2008, 4:13am

Hi,

I need to delete all occurences of the repeated lines from a file and retain only the lines that is not repeated elsewhere in the file. As seen below the first two lines are same except that for the string "From BaseLine" and "From SMS".I shouldn't consider the string "From SMS" and "From BaseLine" for checking the repeated lines. I want to retain only the third line.

From BaseLine - 0T001 000 999999999 00101 20080411000000T1023.27
From SMS - 0T001 000 999999999 00101 20080411000000T1023.27
From BaseLine - 0T001 000 999999999 00101 20080411000000T109.019

My output should be the third line alone.

These file size would range from 100 MB to 900MB. The performance factor should also be considered. Can you please help me out?

Regards,

Ragav.

radoulov · July 10, 2008, 6:22am

Use nawk or /usr/xpg4/bin/awk on Solaris:

awk -F- 'END {
  for (p in r)
    if (u[p] == 1)
      print r[p]
      }
!u[$2] ++ { 
  r[$2] = $0
  }' input

ragavhere · July 10, 2008, 8:01am

Thanks. Can you please explain?

Regards,

Ragav.

radoulov · July 10, 2008, 8:06am

Which part of the code is not obvious?

ragavhere · July 10, 2008, 8:38am

Can you please explain the entire code???

Regards
Ragav

ghostdog74 · July 10, 2008, 8:55am

uniq -u -f 3 file

radoulov · July 10, 2008, 9:00am

OK.

awk -F- ...

Use '-' as a field separator.

The following expression/action pair is execute first:

!u[$2] ++ { 
  r[$2] = $0
  }

When the string in the second field is seen for the first time the element/value of the associative array u will be 0 (false for AWK), because of the implicit variable initialization. In idiomatic AWK it could be written as:

!array[key] ++

Which actually means:

array[key] ++ == 0

So, when NOT array[key]++ returns true (0 -> false, !0 -> true), do the following: build another associative array r (r for record, because it holds the entire record), $2 as key, $0 as element/value. So we store one copy (the first one) of each unique $2 while we're counting the unique values of $2 in the expression part - u[$2] ++.

END {
  for (p in r)
    if (u[p] == 1)
      print r[p]
      }

After all the input has been read the END block is executed.
For every key (k) in the r array verify: if the element/value in the u array with the same key (k) equals 1 (has only one entry in the entire input), print the corresponding element/value of the r (record) array.

That's all.

radoulov · July 10, 2008, 9:01am

And the input should be sorted ... so ITYM:

sort input|uniq -uf3

The benchmarking would be interesting