fuzzy sequence match in a text file

Hi Forum:

I have struggle with it and decide to use my eye ball to accomplish this.

Basically I am looking for sequence of date inside a file.
If one of the sequence repeat 2-3 time or skip once; it's still consider a match.

input text file:

Sep 6 A
Sep 6 A
Sep 10 A
Sep 7 B
Sep 8 B
Sep 9 B
Sep 10 B
Sep 11 B
Sep 7 C
Sep 7 C
Sep 7 C
Sep 11 C
Sep 8 D
Sep 9 E
Sep 7 F
Sep 8 F
Sep 9 F
Sep 10 F
Sep 11 F
Sep 7 G
Sep 8 G
Sep 9 G
Sep 10 G
Sep 7 H
Sep 8 H
Sep 9 H
Sep 7 I
Sep 8 I
Sep 8 I
Sep 9 I
Sep 10 I
Sep 7 J
Sep 7 J
Sep 7 J
Sep 9 J
Sep 10 J

Desired filtered output:

Sep 7 B
Sep 8 B
Sep 9 B
Sep 10 B
Sep 11 B
Sep 7 F
Sep 8 F
Sep 9 F
Sep 10 F
Sep 11 F
Sep 7 G
Sep 8 G
Sep 9 G
Sep 10 G
Sep 7 H
Sep 8 H
Sep 9 H
Sep 7 I
Sep 8 I
Sep 8 I
Sep 9 I
Sep 10 I
Sep 7 J
Sep 7 J
Sep 7 J
Sep 9 J
Sep 10 J

Cheers!!
Chirish.

Try this:

awk '
{ min=day
  max=skip?day+1:day+2
  if($1==mth && $2+0>=min && $2 <=max) {
    if($2+0>min)diff++
    skip=skip||$2+0==day+2
    day=$2+0
    out=start out"\n"$0
    start=""
    next
  }
  if(diff>2) printf "%s\n",out
  mth=$1
  start=$0
  day=$2+0
  diff=1; skip=0; out="" }
END {if(diff>2) printf "%s\n",out}' infile

Edit: rename variables for more clarity

Wow, Chubler_XL, I stand in awe... After thirty years or so using Unix, Linux, and awk (among others, see, this is my work AND my hobby too), I am completely stupefied at:

skip=skip||$2+0==day+2

and

max=skip?day+1:day+2

Please, don't get me wrong: It is amazing to cut your code, paste it in my terminal and see the expected output... Is like listening "Bazinga!" in the background!
Would you please give us mere mortals a bit of feedback?

Apologies, this site seem focused on smallest number of chars, not clarity.

if(skip != 0 || $2+0 == day + 2) skip=1
 
# or
if (skip == 0) {
    if ($2+0 == day + 2) skip =1
}
if(skip ==0) max=day+2 else max=day+1
1 Like

I don't follow the logic behind

if(skip != 0 || $2+0 == day + 2) skip=1

# or
if (skip == 0) {
    if ($2+0 == day + 2) skip =1
}

being equivalent... The first statement is "if ((skip is not 0) OR (($2+0) equals (day + 2)))", but your second expression is "if (skip equals 0) THEN if (($2+0) is equal to (day + 2))", and as such it is NOT equivalent to the first one; in the first statement it suffices that "(skip is not 0)" to make "skip=1", but in the second expression this is not true. As I see it, the second expression is equivalent to "if ((skip equals 0) AND (($2+0) equals (day + 2)))", which is clearly different from the first statement.

hexram is correct. Those two expressions are not logically equivalent (you cannot manipulate one into the form of the other using boolean identities/properties/theorems). They behave differently when skip is not zero. However, the difference in behavior does not affect the outcome if the non-zero value is 1.

Regards,
Alister

1 Like

Thanks for the clarification Alister,

In this implementation skip is supposed to be a boolean variable (only has value of zero or 1).

I use this "locked on" expression with booleans a lot in programming, and using the var=var?var:expr one-liner method is almost unconscious. So while explaining what is really happening, it's quite easy to forget about any illegal (non zero/1) values the flag could have.