Nawk Problem - nawk out of space in tostring on

Abhiraj_Singh · February 17, 2014, 8:52am

Hi.. i am running nawk scripts on solaris system to get records of file1 not in file2 and find duplicate records in a while with the following scripts -compare

nawk 'NR==FNR{a[$0]++;next;} !a[$0] {print"line"FNR $0}' file1 file2

duplicate -

nawk '{a[$0]++}END{for(i in a){if(a-1)print i,a}}' file1

in the middle of script I get an error message saying nawk: out of space in tostring on record 971360... I am using a file having 2 million records. Please suggest.. It is very very important...

I searched and came to know that gawk can solve this, but it won't run on Solaris..

jim_mcnamara · February 17, 2014, 9:08am

Two suggestions -

gawk is available for Solaris 8->11 Sharing /opt/csw � OpenCSW Solaris packages
to find duplicates in a single file use an online method

  sort file | nawk 'NR==1 {old=$0; next}  {if (old==$0) {print $0}; old=$0}'

Abhiraj_Singh · February 17, 2014, 9:54am

Thanks Jim.. But I want to avoid using sort as that would reorganise my file and hence display of records which I want to avoid.. Is there not any other solution except using gawk as I don't have much control on my machine..

drl · February 17, 2014, 10:23am

Hi.
The nawk and Solaris I use doesn't like the code:

$ nawk 'NR==FNR{a[$0]++;next;} !a[$0] {print"line"FNR"$0}' file1 file2
nawk: syntax error at source line 1
 context is
        NR==FNR{a[$0]++;next;} !a[$0] >>>  {print"line"FNR"$0} <<< 
nawk: illegal statement at source line 1
        missing }

Possibly an un-matched double quote?

For this system:

OS, ker|rel, machine: SunOS, 5.10, i86pc
Distribution        : Solaris 10 10/08 s10x_u6wos_07b X86
bash GNU bash 3.00.16
nawk - ( /usr/bin/nawk, Jan 8 2007 )

I was trying to convert this to perl, which generally has better memory management.

cheers, drl

Abhiraj_Singh · February 17, 2014, 10:35am

Yes drl.. Double quote mismatch was there.. Please let me know if you can convert this to perl

drl · February 17, 2014, 10:48am

Hi.

Please post the corrected code ... cheers, drl

Abhiraj_Singh · February 17, 2014, 10:52am

Code is edited in the original post

drl · February 17, 2014, 11:17am

Hi.

Suppose this is in file p1:

#!/usr/bin/perl
eval 'exec /usr/bin/perl -S $0 ${1+"$@"}'
    if $running_under_some_shell;
			# this emulates #! processing on NIH machines.
			# (remove #! line above if indigestible)

eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z_0-9]+=)(.*)/ && shift;
			# process any FOO=bar switches

$, = ' ';		# set output field separator
$\ = "\n";		# set output record separator

line: while (<>) {
    chomp;	# strip record separator
    if ($. == ($.-$FNRbase)) {
	$a{$_}++;
	next line;
    }
    if (!$a{$_}) {
	print 'line' . ($.-$FNRbase) . $_;
    }
}
continue {
    $FNRbase = $. if eof;
}

then try running it on a sample of your data to be sure it seems to do the right thing. Supply file names as you did for nawk (or possibly with the order reversed).

The Solaris box I have seem to have omiited processor a2p which automates the work of converting awk to perl. This was done on:

OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
perl 5.10.0

Best wishes ... cheers, drl

Don_Cragun · February 17, 2014, 2:05pm

Have you tried using /usr/xpg4/bin/awk instead of nawk ? I don't remember if there is much difference between those two versions of awk on Solaris systems in the way they handle memory management, but it might be worth a try if the perl script doesn't work for you.

Akshay_Hegde · February 17, 2014, 2:09pm

Correct your code, your double quote is mismatched also..
To compare files

nawk 'NR==FNR{a[$0];next;} !($0 in a){print "line:" FNR $0}' file1 file2

for duplicate try this

nawk '{A[$0]++}END{for(i in A)if(A>1)print i,A}' file

!a[$0] --> using a[$0] creates an extra empty array element for every $0 that does not exist in array a while reading the second file, so best thing is to do !($0 in a)

I really don't trust a[$0] to compare its my personal experience.

AbelLuis · February 17, 2014, 2:56pm

jim mcnamara:

Two suggestions -

gawk is available for Solaris 8->11 (URL OMMITED)

to find duplicates in a single file use an online method
  sort file | nawk 'NR==1 {old=$0; next}  {if (old==$0) {print $0}; old=$0}'

If you want to remember the original order, you may do something as this:

  nawk '{ print $0, length($0), NR }' file | sort | nawk 'NR==1 {old=substr($0, 1, $NF-1); next}  {if ( old==substr($0, 1, $NF-1) ) {print $0, $NF; old=substr($0, 1, $NF-1)}'

Just add the length and the original order to the records, sort, and then the last two fields is used to make the original record and display the original order.

Greetings!

drl · February 17, 2014, 3:00pm

Hi.

That's a good point. The lengths are certainly different here:

$ ls -lig /usr/bin/awk /usr/bin/nawk /usr/xpg4/bin/awk /usr/bin/oawk
      7456 -r-xr-xr-x   2 bin        80184 Jan  8  2007 /usr/bin/awk
      7500 -r-xr-xr-x   1 bin       110100 Jan  8  2007 /usr/bin/nawk
      7456 -r-xr-xr-x   2 bin        80184 Jan  8  2007 /usr/bin/oawk
     35654 -r-xr-xr-x   1 bin        66816 Oct 10  2007 /usr/xpg4/bin/awk

but I have no idea about the internals. Note that oawk ("old awk") and awk are the same binary.

This system:

OS, ker|rel, machine: SunOS, 5.10, i86pc
Distribution        : Solaris 10 10/08 s10x_u6wos_07b X86

Best wishes ... cheers, drl

AbelLuis · February 17, 2014, 3:40pm

nawk '{ print $0, length($0), NR }' file | sort | nawk 'NR==1 {old=substr($0, 1, $(NF-1) ); next}  {if ( old==substr($0, 1, $(NF-1) ) ) {print $0, $NF; old=substr($0, 1, $(NF-1))}'

Only to correct a mistake, the penultimate field is $(NF-1), not $NF-1; this is the last field minus one