awk , conditional involving line and column

ariesto · February 25, 2012, 8:19pm

Dear All,

I indeed your help for managing resarch data file.

for example I have,

data1.txt :

type of atoms z vz
Si 34 54
O 20 56
H 14 13
Si 40 17
O 65 18
H 70 19
Si 24 20
H 85 21
O 90 12
Si 12 34

I want to extract O and H data using awk where its z value is bigger than maximum z value of Si (40).
that I hope I can data file, say data2.text:
type of atoms z vz
O 65 18
H 70 19
H 85 21
O 90 12

thank you in advance
with best regards,

---------- Post updated at 08:19 PM ---------- Previous update was at 08:13 PM ----------

sorry, once more, I also need the opposite one.
data3.test

Si 34 54
O 20 56
H 14 13
Si 40 17
Si 12 34

codemaniac · February 25, 2012, 10:16pm

Hello ariesto ,

Can you try the below command line .

 
 
maxSi=`awk '$1 ~/Si/{if(max < $2){max = $2}}END{print max}' data1.txt`;awk -v val=$maxSi '$1 ~ /O/ || $1 ~ /H/{if ($2 > val){print $0}}' data1.txt

agama · February 25, 2012, 10:17pm

I think this will do what you want. Reads data1.txt and generates data2.txt and data3.txt. I was a bit confused with your example of data3 -- I assumed you meant that it should have anything not written to data2.txt.

awk '
    NR == 1 { print >"data2.txt"; print >"data3.txt"; next; }   # needed if there is a header line
    {
        if( $1 == "Si"  &&  $2+0 > simax )
            simax = $2 + 0      # take note of max value 

        capture[++idx] = $0;   # capture for output lines
        type[idx] = $1;            # save type and value for easy check at end
        value[idx] = $2+0;
    }

    END {
        for( i = 1; i <= idx; i++ )     # for each input line we saw put it someplace based on the z value 
        {
            dest = ((type == "O" || type == "H") && value > simax ) ? "data2.txt" : "data3.txt";
            print capture >dest;
        }
    }
' data1.txt

codemaniac · February 25, 2012, 10:20pm

A shorter version

 
awk '$1 ~/Si/{if(max < $2){max = $2}} $1 ~ /O/ || $1 ~ /H/{if ($2 > max){print $0}}'

agama · February 25, 2012, 10:36pm

@codemaniac: your shorter version doesn't work for the case below:

type of atoms z vz
Si 34 54
O 35 56
H 14 13
Si 40 17
O 65 18
H 70 19
Si 24 20
H 85 21
O 90 12
Si 12 34

The first O value is larger than the previous Si, but not larger than the maximum Si value in the file; your script will print it because it's larger than the previously observed Si value.

codemaniac · February 25, 2012, 10:53pm

Thanks Agama for pointing this out , I was in the illusion that if there are two body {}
blocks in an awk , then second will start only if first one parses all the records of a file .

Can you give some more enlightment on how awk processes multiple body blocks . {}

agama · February 25, 2012, 11:12pm

I don't want to hijack this thread, so I'll just post a link to a pretty decent on-line overview of awk. If you still have questions, create a thread in shell programming and someone will be eager to expand further.

Awk - A Tutorial and Introduction - by Bruce Barnett

ariesto · February 26, 2012, 8:20am

@agama: thank a lot, sorry for making confusing.
surely what you gave is what I want.
may I ask a more explanation about this part " $2+0 > simax "
why we need to write +0 , what does it means?

agama · February 26, 2012, 12:07pm

No problems; I find myself often missing little things and assuming incorrectly, so it can work both ways!

The $2+0 forces the value to be interpreted/stored as a number rather than a string. It's a habit that I've gotten into (not sure why I didn't do it on the value assignment) which prevents odd issues when comparing values. This practice stems from my experience with very early versions of awk; awk currently distributed with Solaris (not nawk) still exhibits the need for this in certain circumstances.

Consider a data file with colons used as field separators and the input line:
11 : 2

And the small awk programme:

awk -F : '{ if( $1 > $2 ) print "true" }' input-file

With most (all?) modern awk implementations, running the programme results in 'true' being written to stdout. However, older awks interpret the fields as strings because of the white space, and '11' in field 1 does not evaluate larger than '2' as might be expected. To prevent this, adding zero to the variable forces awk to convert it to a numeric value and there aren't any surprises. The programme

awk -F : '{ if( $1+0 > $2+0 ) print "true" }' input-file

works as expected when executed with an older version of awk; even when the input data has whitespace.

This practice probably isn't needed with modern implementations of awk, and even with older versions the conditions must be 'just right' to trigger the need. However, I've been bitten enough times (and spent countless hours tracking down the cause of the odd logic issues this causes) to error on the side of adding zero to variables pulled from the input fields like I did in the programme I posted.

pandeesh · February 26, 2012, 1:22pm

Good explanation Agama!

ariesto · February 28, 2012, 7:35pm

sorry asking (probably) silly question. "simax" this is general term for maximum value of everything or because in my question, for maximum value of Si type of atom, so other words can be used?

---------- Post updated at 07:35 PM ---------- Previous update was at 07:20 PM ----------

another question how if type of atoms also indicated by number
type z vz
2 34 54
1 20 56
3 14 13
2 40 17
1 65 18
3 70 19
2 24 20
3 85 21
1 90 12
2 12 34
so this command : if( $1 == "Si" && $2+0 > simax ) should be changed to
if( $1 == "2" && $2+0 > simax ) or
if( $1 == 2 && $2+0 > simax )?
thank you very much

agama · February 28, 2012, 8:56pm

I chose simax as it represented the maximum value observed for Si.

Best to use the later. I would code

[code]
if( $1+0 == 2 && $2+0 > simax)
[icode]

Adding zero to the value forces awk to treat it as a numeric rather than a string. Probably not significant in your case, but a habit of mine to prevent surprises with old versions of awk.