finding first instance

I have a file with 3 columns. Often the first two will hold the same values and the third may differ. I only want one row for each.

i.e.

aaa,bbb,1
aaa,bbb,2
aaa,bbb,3
xxx,yyy,1
xxx,yyy,2
zzz,rrr,5

Should be

aaa,bbb,1
xxx,yyy,1
zzz,rrr,5

I just want the first instance of each pariing of column 1 and 2 - and it's 3rd column value and ignore any subsequent rows with the same 1st and second column pairing.

Do u need a shell script that would do this process when give this file as an input???

-Nisha

This should do it

# !/bin/sh
toggle=0
field1=""
field2=""
lastfield1=""
lastfield2=""
exec < inputfile
exec > outputfile
while read line ; do
        echo $line > linetemp
        field1=`awk 'BEGIN {FS=","} {print $1}' linetemp`
        field2=`awk 'BEGIN {FS=","} {print $2}' linetemp`

        if [ $field1 != $lastfield1 -o $field2 != $lastfield2 ]
        then
                lastfield1=$field1
                lastfield2=$field2
                toggle=0
        fi

        if [ $toggle -eq 0 ]
        then
                echo $line
                toggle=1
        fi
done
rm linetemp

Sorry about the indenting if it's messed up, I haven't quite figured out how to keep it when I paste into this.

Moderator's note: I took the liberty of adding the code tag to fix the indenting. - Perderabo

Here is a solution using Perl and a hash. This perl routine will work even if the data is not in a sorted order.

open(INFILE, "myFile.txt") || die "$!";

my %dataHash;

while ($inputLine = <INFILE>) {
  chomp($inputLine);
  my ($column1, $column2, $column3) = split(/\,/, $inputLine);
  my $key = "$column1,$column2";
  
  if (!exists($dataHash{$key})) {
    $dataHash{$key} = $column3;
  };
};

foreach $outputLine (sort(keys(%dataHash))) {
  print "$outputLine,$dataHash{$outputLine}\n";
};

Well I've found the answer from a colleague ..... so I'll share.....for anyone who's interested.

nawk -F\| ' { if ( $1 != field1|| $2 != field2)
{ print $1 "," $2 "," $3 ; field1 = $1; field2 = $2 }
} ' TMPSERVICE_mappings > SERVICE_mappings

Hey Peter,

I am unable to execute the script that u have sent....
:o

Help me!!!!

Thanks,
Nisha

P.S. I too tried writing a script for this problem.. but couldn't really succeed.. I would like to post it ask doubts abt it..

Otherwise, i have done with similar types of records but with two different files...long ago.

I assume you have nawk in your path? (/usr/bin/nawk usually).

If not if should be fine with awk.

If you want to see why it's not working break it down....
you need a file with pipe '|' delimiters
you need a file with at least 3 fields
you need an input file (mine is TMPSERVICE_mappings)

That's all you should need. What's the problem with you nawk script?

Yeah now its working Peter... The problem was that the file that i used according to your first post had "," as the delimiter...

But now as per your reply, I changed it to "|" and its working fine..

My awk is this.. See if u can throw light on it..

BEGIN {max=11 #max width of the record
NR == 1
while (getline>0)
{
line=substr($0,0,max);
firstcol=substr(line,0,3);
secondcol=substr(line,5,2);
thirdcol=substr(line,6,1);
print (firstcol secondcol thirdcol);
}
}

In this above piece of code, the firstcol, secondcol, thirdcol respectively extracts the three columns of the file. After which i need a comparison between the fields.. and i need to use a loop..
And here is where i am stuck up..

Probably u can execute this code and see if you can help me..

Thanks,
Nisha

This will do what you need in less code - I assume you want this data from every row in the file?

awk ' { max=11; line=substr($0,0,max) ;
firstcol=substr(line,0,3) ; secondcol=substr(line,5,2); thirdcol=substr(line,6,1) ;
print firstcol secondcol thirdcol} '
input_filename > output_filename

So...that will allow you to get the 6 digit string....what are you wanting to do with it after you have it?

Compare between rows - or between columns in a row?

I want to compare columns between rows.. something like this..
compare the first col of the first and second row and the second col of the first and the second row.. if they are the same, then take the first occurence of the entire row...

Am i right...????

Thanks,
Nisha

Okay...so if I have this right, basically the same as the original question I asked. This will give you the entire row where col1 and col2 together are unique.

i.e.

a,3,xxx
a,3,yyy
a,4,ccc
a,5,ccc
a,5,fff

Will give you:

a,3,xxx
a,4,ccc
a,5,ccc

awk ' 
{ 
max=11; 
line=substr($0,0,max) ; 
firstcol=substr(line,0,3) ; 
secondcol=substr(line,5,2); 
thirdcol=substr(line,6,1) 
}
{
if ( firstcol != field1 || secondcol != field2 )
{ print firstcol secondcol thirdcol ; field1 = firstcol; field2 = secondcol }
' input_file > output_file

Basically field1 and field2 start empty get set with the first pariing of col1 and 2 - and prints the 3 cols. Then it compares with the next row and where either are different it prints the three cols.

I haven't tried this code but it hould be good.

Hey Peter,

With a small modification here my code also works... :slight_smile:

But I have very important doubts to be cleared.....

Here is the code...

{
max=11;
line=substr($0,0,max) ;
firstcol=substr(line,0,3) ;
secondcol=substr(line,5,3);
thirdcol=substr(line,9,1)

if ( firstcol != field1 || secondcol != field2 )
{ print firstcol "," secondcol "," thirdcol; field1=firstcol; field2=secondcol }
}

And I executed it as awk -f cmp_rec test.dat > log

Ofcourse, i can execute it as u said too....

First let me clarify with you whether my understanding is right...

it is col to col comparison right??

for example,

aaa|bbb|1
aaa|bbb|2 would give me aaa|bbb|1
xxx|yyy|3 xxx|yyy|3
xxx|yyy|4 zzz|rrr|5
zzz|rrr|5

Bcos it compares aaa and bbb of the first row and the aaa and bbb of the second row...

if the first col and the second col of the first row is not equal to the first and second col of the second row, then that first instance is printed..

Now coming to the doubt,

In the above code why shoud i assign

field1=firstcol; field2=secondcol

Are these field1 and field2 key fields???? meaning system defined..???

Can you explain???

Thanks,
Nisha

You understand it perfectly...and I would guess you could see the importance of sorting the file before you start this too.

field1 and field2 are just variables (call them anything you want). At the start of the awk they are empty, so for the very first row of the file when the firstcol and secondcol are compared to them they won't be equal ...therefore triggering the columns to be printed AND triggering the field1 and field2 to be set with the values of firstcol and secondcol respectively.

With field1 and 2 now holding the values you just printed - when you compare to the next line you're looknig to see if the new firstcol is different to the variable field1 OR if the new secondcol is different to the variable field2.

If they are BOTH the same - then the if statement will fail and not execute the print, then awk will move onto the next row and compare again. The if statement will continue to fail until it finds a situation where col1 and col2 do not match the variables - meaning that you have got a different combination - and should print those columns - and again update the variables with the last match.

In logic terms it's simply

x=''
y=''
if (firstcol is not = x OR second col is not = y ) then print your columns, and make x = firstcol and y = secondcol.
Then repeat for the next line.

so for the example again....

aaa|bbb|1
aaa|bbb|2 would give me aaa|bbb|1
xxx|yyy|3 xxx|yyy|3
xxx|yyy|4 zzz|rrr|5
zzz|rrr|5

first time through - (field1 and field 2 are empty)
aaa != field1(empty) and bbb != field2(empty) => so PRINT and make field1 = aaa and field2 =bbb
second row-
aaa = field1(aaa) and bbb = field2(bbb) so do nothing and move onto next row.
thrid row...
xxx !=field1(aaa) and yyy != field2(bbb) => so PRINT and make field1 = xxx and field2 =yyy

et. etc. etc. et.c et.c

Yeah perfect... As u said the sorting is to be by default otherwise results would be ambiguous..

But did you notice something.. unlike other languages, in awk there is no necessity to declare a variable. example is this field1 and field2... and no initialization also is required....

Too good a program.. in my program that i posted i need to change the max variable everytime when the width of the record is changed... and so is the offset for the firstcol, secondcol and the thirdcol...

So I think the one that u sent would be the efficient one..
Thanks a lot for your time and explanation...

Thanks,
Nisha
:stuck_out_tongue:

You don't really need the max at all. It is effectively doing nothing in your script anyway as you are only taking substrings from 1-9. If theres more than your 'max' it won't grab them, and if there's less than max - you've not acheived anyting by having it. So just skip it alltogether.

Of course if your offsets are changing ...that's a hassle for you.

instead of these 3 lines....
max=11;
line=substr($0,0,max) ;
firstcol=substr(line,0,3) ;

just use => firstcol=substr($0,0,3) ;

etc.

Hey yes Peter.. its perfectly working...

Then this $0 contains the whole record is it??? or

Does it point to the starting position of the record???

Thanks,
Nisha

Ha!

My perl code doesn't require the input to be sorted!

[Dances around the office doing the "Nanny Nanny Boo Boo" Song]

Actually, that's not entirely true, as long as you don't care about the last parameter. It might be out of sync due to not being sorted. :smiley:

Nisha, $0 is the entrire record.

Aus, Here's my equivalent in awk of your hash in perl. (nah nah na nah nah) - look mum - no sort! The awk associative array (hash). With a few more bells and whistles. ( This one is in use).

/^[^|]*\|[Ss][Ee]/
{
key=$5 $6; edd = x[key];
if ( edd !~ /^[0-9][0-9][0-9][0-9]$/)
{ edd = substr($18,5,2) substr($18,3,2) ; x[key]=edd } ;
print $1,$2,$3,$4,$5,edd,$6,$7,$8,$9,$13,$14,$15,$16,$17,$18
}

<KungFuMovie>Huh? You may have defeated me with your Southern Hemisphere AwkFu coding style, but my Northern Hemisphere PerlFu coding style is easier to read! I'll get you next time! Ha!Ha!Ha!</KungFuMovie>

I'll remeber this next time I need to do some slicing and dicing in awk. I usually save the simple list processing for awk and the more logic ridden probelms stuff for perl. I have gotten a lot of good stuff out of awk in the past.

Hey Pet,

Now what????? I thought i got clarified and u have inserted a new stuff and as auswipe says its tough to decifer...

U both have to take the responsibility of explaining it to me :wink:

Lots of learning.. uhhhhhhhhhhhhh