Removing Dupes from huge file- awk/perl/uniq

makn · April 13, 2012, 5:08am

Hi,

I have the following command in place

nawk -F, '!a[$1,$2,$3]++' file > file.uniq

It has been working perfectly as per requirements, by removing duplicates by taking into consideration only first 3 fields. Recently it has started giving below error:

bash-3.2$   nawk -F, '!a[$1,$2,$3]++' OTCTempD.dat > OTCTemp.uniq
nawk: symbol table overflow at 4044735840353890OTC
 input record number 5.42076e+07, file OTCTempD.dat
 source line number 1

More information:

No of records in file:

bash-3.2$ cat OTCTempD.dat | wc -l
 179128368

Size of the file :

-rw-r--r--   1 magt2    grip     7338355879 Apr 12 14:08 OTCTempD.dat

Contents of file:

a,b,c,2,2,3
a,b,c,1,2,3
a,b,E,1,2,3
a,b,c,1,2,3

Output should be:

a,b,c,2,2,3    //take first record only out of dupes 
a,b,E,1,2,3

Now how to resolve this.
Does awk uses some kind of memory and its exceeding its limit?
What should be the best approach to acheive the desired purpose?
Cant we use uniq directly to get remove such criteria based dupes?

Kindly Suggest.

CarloM · April 13, 2012, 5:15am

What's your OS? There are limitations, but they'll depending on which (n)awk you're using.

Is the file sorted?

Scrutinizer · April 13, 2012, 5:56am

Can't you use unique sort?

sort -ut, -k1,3 infile > outfile

makn · April 13, 2012, 7:26am

File is not sorted

Version:
i86pcSystem = SunOS
Node = gmmagappu1
Release = 5.10
KernelID = Generic_144489-17
Machine = i86pc

---------- Post updated at 07:26 PM ---------- Previous update was at 07:25 PM ----------

This is giving No space left on device error. I ll try this by clearing some space but can u pls explain how this actually works.

should not it be "-k1,2,3" ... as we consider first three fields of a line to find dupes.

jim_mcnamara · April 13, 2012, 8:12am

nawk does use memory for a hash table, and you are exceeding the limit. sort uses a temporary file(s) and they are written to whatever directory TMPDIR is pointed to. find a filesystem with free space and use a directory there that you have full access to. The files are temporary and exist only during sorting.

Scrutinizer's syntax is correct for the sort command you should use.

Ygor · April 13, 2012, 8:46am

Using a2p to convert the awk to perl...

perl -e '$[=1; $FS=","; while(<>){ @Fld=split(/[,\n]/,$_,-1); print $_ if !$a{$Fld[1],$Fld[2],$Fld[3]}++ }' file > file.uniq

methyl · April 13, 2012, 8:54am

The nawk in post #1 and the sort in post #2 give different results on my system. This is because the nawk always keeps the first of the duplicate key records but the sort selects a random one.

Shell tools were never designed to process multi-gigabyte files. Do you have a database engine and access to a programmer?

shamrock · April 13, 2012, 1:12pm

Yes I too see that behavior...so the OP would have to stick to [n]awk to make sure that the first one of the dupes is written out instead of a random one.

Scrutinizer · April 13, 2012, 2:49pm

You can change the environment variable TMPDIR to point to a large temporary directory like jim mcnamara suggested... or use the -T option to point to a directory with sufficient space...:

sort -T /path/to/temporary/directory -ut, -k1,3 infile

I believe this works by first sorting according to field 1, 2 and 3 in a "stable sort", (keeping the non-sort fields in the order in which they are read) and then discarding all duplicates according to field 1, 2 and 3.

shamrock · April 13, 2012, 4:29pm

scrutinizer:

sort -T /path/to/temporary/directory -ut, -k1,3 infile
I believe this works by first sorting according to field 1, 2 and 3 in a "stable sort", (keeping the non-sort fields in the order in which they are read) and then discarding all duplicates according to field 1, 2 and 3.

But it doesnt pickup and print the first dupe from the input file...instead it picks up and prints the first dupe from the intermediate sorted file...which isnt what the op wants.

Scrutinizer · April 13, 2012, 4:36pm

I believe it is not a regular sort, it is a stable sort. So the first dupe for each combination of field 1, 2 or 3 of the (stable) sorted file would be the same as the input file..

shamrock · April 13, 2012, 4:48pm

No it isnt...consider the input below...

a,b,E,10,2,3
a,b,c,2,2,3
a,b,c,0,2,3
a,b,c,2,2,3
a,b,c,1,2,3
a,b,E,1,2,3
a,b,c,1,2,3

The output should be...

a,b,E,10,2,3
a,b,c,2,2,3

but the sort command gives output as...

a,b,E,1,2,3
a,b,c,0,2,3

which isnt what the op wants.

Scrutinizer · April 13, 2012, 4:53pm

On my system I get:

$ sort -ut, -k1,3 infile
a,b,E,10,2,3
a,b,c,2,2,3

On a different platform it also works. I get:

$ sort -ut, -k1,3 infile
a,b,c,2,2,3
a,b,E,10,2,3

I also used to think that order would be changed, but later I found that on every system I tested this on, the order was always preserved.

What is your platform, shamrock?

S.

shamrock · April 13, 2012, 11:42pm

I ran your sort on an AIX and HPUX machine.

Scrutinizer · April 14, 2012, 1:56am

Thanks that is very useful to know, that this cannot be relied upon to work on any system. So it is appears to be an extension and not standard to perform a stable sort. I guess a possible indicator might then be if a particular sort supports a "stable sort" option in the first place. For example if this option is called "-s" then if this works:

sort -st, -k1,3 infile | awk -F, '{n=$1 FS $2 FS $3}p!=n;{p=n}'

then probably this works too:

sort -ut, -k1,3 infile

I checked some man pages and systems that provide a stable "-u", it says:

-u      with -c, check for strict ordering; without -c, output only the first of an equal run

instead of something like this

-u      [..] If used with the -c option, check that there are no lines with duplicate keys, in addition to checking that the input file is sorted.

..

shamrock · April 14, 2012, 1:29pm

Just tested it on a Solaris 10 box with same results...looks like the legacy unix offerings dont have a stable sort whereas linux has it.

scrutinizer:

So it is appears to be an extension and not standard to perform a stable sort. I guess a possible indicator might then be if a particular sort supports a "stable sort" option in the first place. For example if this option is called "-s" then if this works:
sort -st, -k1,3 infile | awk -F, '{n=$1 FS $2 FS $3}p!=n;{p=n}'
then probably this works too:
sort -ut, -k1,3 infile

None of the legacy unix flavors HPUX Solaris or AIX have the "-s" switch so looks like linux may have optimized it to spit out the first record of a group.

That must only be on linux system...

scrutinizer:

instead of something like this

-u      [..] If used with the -c option, check that there are no lines with duplicate keys, in addition to checking that the input file is sorted.

..

Yes that is pretty much the case on all of AIX Solaris and HPUX...so looks like sort will break if the code is to be deployed on different platforms...unless it is different flavors of linux.

Scrutinizer · April 14, 2012, 4:00pm

Not just Linux systems, but on systems that use GNU sort, so also freeBSD and OSX and any system that has GNU sort installed. But it does not work with POSIX sort, so in conclusion it is not a solution for UNIX systems in general... Thanks again Shamrock, for your effort...

Scrutinizer · April 14, 2012, 4:34pm

So then I guess a portable solution would be to force a stable sort, for example like this:

nl -s, infile | sort -t, -k2,4 -k1 | cut -f2- -d, | awk -F, '{n=$1 F2 $2 FS $3}p!=n;{p=n}'