Removing duplicates except the last occurrence

mechvijays · November 5, 2014, 6:01am

Hi All,

i have a file like below,

@DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql
@DB_FCTS\src\Data\Scripts\Delete_CDP_BILL_LBL_MSG.sql
@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql
@DB_FCTS\src\Data\Scripts\Insert_CU_OM_LBL_MSG.sql
@DB_FCTS\src\Data\Scripts\Insert_CU_OM_BT_STMT_TYP.sql
@DB_FCTS\src\Data\Scripts\Insert_OM_BIL_T_ADDR.sql
@DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql
@DB_FCTS\src\Scripts\MC400_PreDb_Script.sql
@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql
@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql

==========================================
in the file,
line no 1 is repeated in the line 7 and
line no 3 is repeated in the line 9 and 10

my requirement is to remove the duplicate lines and keep only the last occurrence of it.

the output should be like below,

@DB_FCTS\src\Data\Scripts\Delete_CDP_BILL_LBL_MSG.sql
@DB_FCTS\src\Data\Scripts\Insert_CU_OM_LBL_MSG.sql
@DB_FCTS\src\Data\Scripts\Insert_CU_OM_BT_STMT_TYP.sql
@DB_FCTS\src\Data\Scripts\Insert_OM_BIL_T_ADDR.sql
@DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql
@DB_FCTS\src\Scripts\MC400_PreDb_Script.sql
@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql

My env details,

SunOS sasbsd27c1 5.10 Generic_150400-10 sun4u sparc SUNW,SPARC-Enterprise

Please script a script to achieve this, i have been trying from morning, but nothing works out.

Thanks in advance

RudiC · November 5, 2014, 6:35am

Please use code tags as required by forum rules!

How about looking into existing solutions on this site first (see bottom of page: More UNIX and Linux Forum Topics You Might Find Helpful):
http://www.unix.com/shell-programming-and-scripting/158497-removing-duplicates.html
http://www.unix.com/shell-programming-and-scripting/179947-help-removing-duplicates.html

etc ...

MadeInGermany · November 5, 2014, 10:21am

The suggested solutions remove the sequent duplicates, keeping the first instance.
The requirement, keeping the last instance, is far more complex.
The most comprehensive solution is perl:

perl -ne '$s{$_}=++$i; if (eof()){print sort {$s{$a}<=>$s{$b}} keys %s}' file

Another one is awk | sort | cut :

awk '{ 
      x[$0] = NR
     }
 END {
      for ( l in x ) printf "%d\t%s\n", x[l], l
     }' file | sort -n | cut -f2-

Another less efficient solution would be tac | awk 'remove sequent duplicates' | tac :

tac file | awk '!($0 in S) {print; S[$0]}' | tac

Don_Cragun · November 5, 2014, 2:57pm

Doing it entirely in awk isn't that hard:

/usr/xpg4/bin/awk '
$0 in N {
	delete O[N[$0]]
}
{	N[$0] = NR
	O[NR] = $0
}
END {	for(i = 1; i <= NR; i++)
		if(i in O)
			print O
}' file

Scrutinizer · November 5, 2014, 3:09pm

And also:

awk 'NR==FNR{L[$0]=FNR; next} L[$0]==FNR' infile infile

drl · November 6, 2014, 12:41pm

Hi.

If you were to run out of memory, you could use tac file | awk '!($0 in S) {print; S[$0]}' | tac posted by MadeInGermany.

A similar code in shell, with filename in variable FILE:

nl $FILE |
tee f1 |
sort -k 2 -k 1,1rn |
tee f2 |
uniq --skip-fields=1 |
tee f3 |
sort -k 1,1n |
tee f4 |
sed 's/^.*\t//'

Line numbers are added, then the body is sorted, with the secondary sort being reverse numeric. The GNU uniq allows fields to be skipped, the result sorted in numeric order, thus retaining the original order, after which the line number is stripped. Before stripping, this looks like:

     2	@DB_FCTS\src\Data\Scripts\Delete_CDP_BILL_LBL_MSG.sql
     4	@DB_FCTS\src\Data\Scripts\Insert_CU_OM_LBL_MSG.sql
     5	@DB_FCTS\src\Data\Scripts\Insert_CU_OM_BT_STMT_TYP.sql
     6	@DB_FCTS\src\Data\Scripts\Insert_OM_BIL_T_ADDR.sql
     7	@DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql
     8	@DB_FCTS\src\Scripts\MC400_PreDb_Script.sql
    10	@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql

Pipelines are useful for doing large-granularity parallel computing, and the disk is not touched because the pipes are simply buffers (usually 65K). The tee in the above is to allow intermediate results to be seen.

I have run across some uniq versions that keep the most recent version of a duplicate ( Solaris if memory serves ).

This was done on:

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
nl (GNU coreutils) 6.10
sort (GNU coreutils) 6.10
uniq (GNU coreutils) 6.10
sed GNU sed version 4.1.5

Best wishes ... cheers, drl

shamrock · November 6, 2014, 4:11pm

If there are dupes why does it matter which one is kept...either first one or last one etc...

MadeInGermany · November 6, 2014, 4:24pm

Sometimes the order matters.
For example, your input file is

shamrock · November 6, 2014, 4:50pm

Then it essentially becomes a funky sort scenario...assuming that the low values and their duplicates always come before the high ones...

drl · November 6, 2014, 7:24pm

Hi.

Observations, comments.

This was a wrong assertion on my part. The tac ... tac still saves all the lines in memory (except duplicates).

@shamrock:
I am confused by your comment:

in my solution the line numbers are added specifically so that the ordering is preserved, as the OP requested. Do you have a non-memory solution that does not do something like that?

Best wishes ... cheers, drl

shamrock · November 6, 2014, 7:47pm

That means that the input data posted by "MadeInGermany" has a pattern...the low values and their duplicates always come before the high values and their duplicates...like the last "2" is far away from the first "2" but still comes before the last "3"...and since the OP requested that (s)he needs only the last of the dupes so the final output will come out sorted...

No I don't...as there is nothing like a non-memory solution in the computing world...data to be processed on disk is first brought into memory before it can be worked on...firstly because the mpu can only address data that is in memory and not on disk and secondly because the operations would be very slow...

drl · November 6, 2014, 10:01pm

Hi, Shamrock.

I think the OP meant that the last value of the duplicates was important because of the position in the file, not because it had some other magical properties, after all, they are duplicates.

When I wrote about memory, I was referring to all the solutions that kept all of the data in an array (except for the duplicates of course). My solution was different in that nothing was kept in arrays like that, everything was pipelined, so that there would be little or no risk of running out of memory (say with awk's arrays).

Does this makes my comments more clear? ... cheers, drl