Search for sequential pattern

cedenker · September 22, 2016, 8:47pm

input file:

(input file can be many millions of lines long)

I want to search the example input file above, and when I find 4 sequential rows with values of 1,2,3,4 return those values and the two previous ones.
In this case it should return

1,A,1,2,3,4

I know this can be done on various platforms, but I'd like to use awk in this case. I'm fairly certain I'll end up using a six element array, but y'all will probably figure this out before I do. Thanks in advance, brain too old to figure this stuff out anymore...

---------- Post updated at 07:47 PM ---------- Previous update was at 04:47 PM ----------

I started down the path of using grep to pull out the rows that I need, 2 before the match and 3 after the match. I was going to simply the match to only finding the first entri that i needed, and filter the extra ones out later. After that is was a simple matter of formatting. That is, until the case where we had matching overlaps, like so.

Say I'm looking for rows with 1,2,3,4 - then I was only going to grep on "1", and extract the leading and following rows. Even if I got alot of entries that were not a perfect match, I can easily filter those out. Here is the case that ruined it.

The grep will misbehave because it refuses to grep the value "1" more than once. In this case the "1" relates to the before part of one selection, and the after part of another, and it only reports it once. So unless there is a way of telling grep to not do this, can't use grep....

Chubler_XL · September 22, 2016, 10:35pm

How about this using awk

awk '
  {A=B; B=C; C=$0}
  N==5 { print F " found from row " NR-6 ; exit}
  N&&$0==N { F=F","N++; next}
  $0==1 { F=A","B",1";N=2;next}
  {N=x}
' infile

cedenker · September 22, 2016, 11:02pm

I can sort of follow this. Is is hardcoded to use "1,2,3,4" for the search criteria? Or at least 4 sequential numbers?
I need to have a little flexibility in selecting the 4 values to search for (I used 1,2,3,4 just as an oversimplified example).

I have confirmed that it works great for 1,2,3,4.......

Thanks for the first response!

Chubler_XL · September 22, 2016, 11:19pm

If you are looking for different strings (not "1" thru "4") a slightly different solution is required:

awk '
  BEGIN{ L=split("one,two,three,four", M, ",") }
  {A=B; B=C; C=$0}
  N==L+1 { print F " found from row " NR-L-2 ; exit}
  N&&$0==M[N] { F=F","M[N++]; next}
  $0==M[1] { F=A","B","M[1];N=2;next}
  {N=x}
' infile

This version now searches for "one", "two", "three" and then "four" and can be easily converted to search for you list of specific strings. The split command is building an array M[] which is used to match each line.

cedenker · September 23, 2016, 1:16am

initial test works fine. Let me add some of the other things I oversimplified into the script and see if I can break it. Thanks!

---------- Post updated 09-23-16 at 12:16 AM ---------- Previous update was 09-22-16 at 11:04 PM ----------

I should have made this part of the initial requirement, but thought I could add it in myself after the original problem was solved. I can't wrap my head what the script is actually doing, so can't really add to it unfortunately.

The additional requirement is as follows.
Extra column in the input file.

1  cow
2  bird
3  horse
4  one
5  two
6  three
7  four
8  fff

the additional output would be the value in column 1 for the initial row of the match. In this case the output (looking for one,two,three,four) should be.

2, bird,horse,one,two,three,four

So I understood enough to read $2 instead of $0, and the script works the same now, just basically ignoring the first of the two input columns. I'm assuming all we need is a 2nd array to store the first column values, updating itself at the same time the 1st array updates. Then when it comes time to print out, just print the first array element of the 1st column.

I should have included this in the initial requirement, sorry about that....

rovf · September 23, 2016, 3:01am

Wouldn't it be easier using grep?

For instance (assuming every line consist of exactly one character, as in your example, and that the line terminator is just a newline character), the following command would work:

grep -zo '....1.2.3.4' your_data.txt

RavinderSingh13 · September 23, 2016, 4:52am

Hello cedenker,

Let's say our Input_file is as follows, where I am considering that strings one , two etc could come at any order.

cat Input_file
1 cow
2 bird
3 horse
4 one
5 two
6 three
7 four
8 fff
9 one
10 two
11 one
12 two
13 one
14 two
15 three
16 four
11 one
12 two
13 three
14 one

Then following will be the code.

awk 'BEGIN{num=split("one,two,three,four", A,",");for(i=1;i<=num;i++){B[A]=i}} {;while(($2 in B) && ++e == B[$2]){A[FNR]=$2;W=W?W OFS $2:$2;getline;};A[FNR]=$2;if(e>=4){print FNR-6,A[FNR-6],A[FNR-5],W};e=W=""}' OFS=,   Input_file

Output will be as follows.

2,bird,horse,one,two,three,four
11,one,two,one,two,three,four

EDIT: Adding a non-one liner form of solution too now.

awk 'BEGIN{
                num=split("one,two,three,four", A,",");
                for(i=1;i<=num;i++){
                                        B[A]=i
                                   }
          }
          {;
                while(($2 in B) && ++e == B[$2]){
                                                        A[FNR]=$2;
                                                        W=W?W OFS $2:$2;
                                                        getline;
                                                };
                A[FNR]=$2;
                if(e>=4){
                                print FNR-6,A[FNR-6],A[FNR-5],W
                        };
                e=W=""
          }
    ' OFS=,   Input_file

So it is taking care of rule like strings one,two,three,four should come consecutive and if they are less than their count 4 it shouldn't print those too. Please do let us know how it goes and if this helps you.
EDIT2: Improving above code by removing array A inside while loop.

awk 'BEGIN{num=split("one,two,three,four", A,",");for(i=1;i<=num;i++){B[A]=i}} {A[++q]=$2;while(($2 in B) && ++e == B[$2]){;W=W?W OFS $2:$2;getline;};if(e>=4){print FNR-6,A[q],A[q-1],W};e=W=""}' OFS=,   Input_file
####OR a non-one liner form of solution too as follows.
awk 'BEGIN{
                num=split("one,two,three,four", A,",");
                for(i=1;i<=num;i++){
                                        B[A]=i
                                   }
          }
          {
                A[++q]=$2;
                while(($2 in B) && ++e == B[$2]){;
                                                        W=W?W OFS $2:$2;
                                                        getline;
                                                };
                if(e>=4){
                                print FNR-6,A[q],A[q-1],W
                        };
                e=W=""
          }
    ' OFS=,   Input_file

Thanks,
R. Singh

RudiC · September 23, 2016, 6:30am

How about

awk  -v PAT="two,three,four" '                  # define search pattern
BEGIN   {KC = split(PAT, T, ",")                # fill temp array with keywords from pattern, find key count
         RB = KC + 2                            # compute ring buffer size
        }
        {A[NR%RB] = $1                          # fill ring buffer for line number
         B[NR%RB] = $2                          # fill ring buffer for keywords
         if ($2 == T[CNT+1])    CNT++           # if keyword match found: advance to next keyword
         else                   CNT = 0         # else start again from first keyword

         if (CNT == KC) {printf "%s,", A[(NR+1)%RB]
                                                # BINGO! ALL keywords in a row! print line number

                         for (i=1; i<=2; i++) printf "%s,", B[(NR+i)%RB]
                                                # print two preceding words

                         print PAT              # print the search words 
                         CNT = 0                # and start again from first keyword
                        }
        }
' file
3,horse,one,two,three,four

RudiC · September 23, 2016, 6:36am

And, if you want a dynamic number of preceding lines, try

awk  -v PRC=4 -v PAT="two,three,four" '         # define search pattern
BEGIN   {KC = split(PAT, T, ",")                # fill temp array with keywords from pattern, find key count
         RB = KC + PRC                          # compute ring buffer size
        }
        {A[NR%RB] = $1                          # fill ring buffer for line number
         B[NR%RB] = $2                          # fill ring buffer for keywords
         if ($2 == T[CNT+1])    CNT++           # if keyword match found: advance to next keyword
         else                   CNT = 0         # else start again from first keyword

         if (CNT == KC) {printf "%s,", A[(NR+1)%RB]
                                                # BINGO! ALL keywords in a row! print line number, -

                         for (i=1; i<=PRC; i++) printf "%s,", B[(NR+i)%RB]
                                                # print two preceding words

                         print PAT              # print the search words 
                         CNT = 0                # and start again from first keyword
                        }
        }
' file