parse by string and difference in substring???

dcfargo · July 1, 2008, 12:01pm

I have a big list as the following:

apple X:5_yes_a
apple X:12_no_b
apple X:45_yes_a
apple X:100_no_b
banana X:7_yes_a
banana X:13_yes_a
banana X:42_no_a
cat X:42_no_b
cat X:77_yes_d

I'd like to parse the file so that for each $1 value I return only lines in which the value in $2 after the : and before the "_" is more than 10 greater than the previous value.

e.g.

apple X:5_yes_a
apple X:45_yes_a
apple X:100_no_b
banana X:7_yes_a
banana X:42_no_a
cat X:42_no_b
cat X:77_yes_d

So for each new $1 I'd like to print the line and then for each identical $1 I'd like to only print the line if the substring value in $2 between ":" and the next "_" for line X+1 - line X is > 10.

I guess the last line won't have an 'X+1' line???

It might be easiest to split $2 up first??

I have no idea where to start.

radoulov · July 1, 2008, 12:09pm

With AWK (using nawk or /usr/xpg4/bin/awk on Solaris):

awk -F'[:_]' '$2>s+10||!_[$1]++;{s=$2}' file

dcfargo · July 1, 2008, 12:15pm

Wow!

Thank thank you thank you.

dcfargo · July 1, 2008, 12:21pm

Can you explain to me how that script is working?

Thanks again

radoulov · July 1, 2008, 12:44pm

Sure,
first split the record based on the FS to locate $2:

awk -F'[:_]' ...

For each record evaluate the following expression:

$2>s+10||!_[$1]++

The first part is easy, it tests if the current $2 is greater than the previous one + 10 (the variable s is set in the action (s=$2), so when the expression is tested it contains the value of the previous record). !array[string]++ is a common AWK idiom, it returns true when string is matched for the first time, it could be easier to understand like this:

!_[string]++ is _[string]++==0

AWK auto-initializes variables as NULL (in string context) or 0 (in numeric context). When it post-increments the associative array for the first time, it's value is 0 (see AWK associative arrays for more info on this). The || operator is logical OR.

Hope this helps.