validating a file based on conditions

trichyselva · December 31, 2008, 6:10am

i have a file in unix in which the records are like this

aaa 123 233
aaa 234 222
aaa 242 222
bbb 122 111
bbb 122 123
ccc 124 222

In the output i want only the below records

aaa
ccc

The validation logic is 1st column and 2nd column need to be considered
if both columns values are not same and 1st column values are same
then the record in 1st column need to be picked up

in the records if the first and second column matches then those records need to be dropped

plz. let me know how to do this validation

radoulov · December 31, 2008, 8:32am

Use nawk or /usr/xpg4/bin/awk on Solaris:

awk 'END { 
  for (_ in u) if (u[_])
	  print _
	}
{
  u[$1] = k[$1,$2]++ ? x : 1      
  }' infile

Christoph_Spohr · December 31, 2008, 10:03am

Hi radoulov,

another one of your astonishing awk scripts. Could you go in
some detail how it works.

Regards

Chris

radoulov · December 31, 2008, 12:12pm

Astonishing :),
thank you!

I'll try to explain.
Code followed by comments.

{
  u[$1] = k[$1,$2]++ ? x : 1      
  }

While reading the input build an associative array named u (unique) keyed by $1. The values are build/chosen based on the following expression:

k[$1,$2]++ ? x : 1

If the value of the auto incremented associative array k, build en passant with $1 SUBSEP $2 as keys, is different than 0 (i.e. true in boolean context), i.e. already seen (remember Ed Morton's !arr[val]++?),
then return and assign the value of the variable x (never used and auto initialized -> null -> 0 in numeric context -> false in boolean context, if I had written 0, it would have been clearer :)), otherwise return and assign the value 1 (the opposite of the previous).

END { 
  for (_ in u) if (u[_])
      print _
    }

After reading all the input, print only those u keys whose values are true when evaluated in boolean context (which equal to 1).

Happy holidays!

trichyselva · January 2, 2009, 12:36am

hi,
thanks for the response
actually in my message if there are 3 records like

aaa 123 233
aaa 234 222
aaa 242 222

then only ONE aaa

need to be printed
but in the output it is showing all the 3 values

Actually in my input file it will contain nearly 10 fields each separated by pipe symbol
For that thing whether this solution will work (by replacing k[$1,$2]++ with all the fields like $3...) or i have to use another approach
I have to consider the first 2 fields for validation remaining fields i can leave as it is

expecting your reply

thanks

summer_cherry · January 2, 2009, 2:35am

hi, you may try below perl script

#! /usr/bin/perl
open FH,"<a.txt";
while(<FH>){
	my @tmp=split(" ",$_);
	if(! exists $hash{$tmp[0]}){
		$hash{$tmp[0]}=$tmp[1]." " ;
		next;
	}
	if((exists $hash{$tmp[0]}) && ($hash{$tmp[0]} ne 'DUP')){
		$hash{$tmp[0]}=($hash{$tmp[0]} =~ m/$tmp[1] /)?'DUP':$hash{$tmp[0]}.$tmp[1]." ";		
	}
}
close FH;
print join "\n", grep {$hash{$_} ne 'DUP' } keys %hash;

trichyselva · January 2, 2009, 3:24am

hi,
I have to use shell script
please suggest some logic in shell script itself
i haven't used perl script

thanks

radoulov · January 2, 2009, 3:27am

Is this the correct output or I'm missing something?

% cat file
aaa 123 233
aaa 234 222
aaa 242 222
% awk 'END { 
  for (_ in u) if (u[_])
  print _
}
{
  u[$1] = k[$1,$2]++ ? x : 1
  }' file
aaa

If your fields are separated by |, use awk -F\| ...

If you post sample from your real data, it would be easier.

methyl · January 2, 2009, 7:51am

(Post withdrawn)