I have a big list as the following:
apple X:5_yes_a
apple X:12_no_b
apple X:45_yes_a
apple X:100_no_b
banana X:7_yes_a
banana X:13_yes_a
banana X:42_no_a
cat X:42_no_b
cat X:77_yes_d
I'd like to parse the file so that for each $1 value I return only lines in which the value in $2 after the : and before the "_" is more than 10 greater than the previous value.
e.g.
apple X:5_yes_a
apple X:45_yes_a
apple X:100_no_b
banana X:7_yes_a
banana X:42_no_a
cat X:42_no_b
cat X:77_yes_d
So for each new $1 I'd like to print the line and then for each identical $1 I'd like to only print the line if the substring value in $2 between ":" and the next "_" for line X+1 - line X is > 10.
I guess the last line won't have an 'X+1' line???
It might be easiest to split $2 up first??
I have no idea where to start.
With AWK (using nawk or /usr/xpg4/bin/awk on Solaris):
awk -F'[:_]' '$2>s+10||!_[$1]++;{s=$2}' file
Can you explain to me how that script is working?
Thanks again
Sure,
first split the record based on the FS to locate $2:
awk -F'[:_]' ...
For each record evaluate the following expression:
$2>s+10||!_[$1]++
The first part is easy, it tests if the current $2 is greater than the previous one + 10 (the variable s is set in the action (s=$2), so when the expression is tested it contains the value of the previous record). !array[string]++ is a common AWK idiom, it returns true when string is matched for the first time, it could be easier to understand like this:
!_[string]++ is _[string]++==0
AWK auto-initializes variables as NULL (in string context) or 0 (in numeric context). When it post-increments the associative array for the first time, it's value is 0 (see AWK associative arrays for more info on this). The || operator is logical OR.
Hope this helps.