Extracting sequential pattern

fuzzi · September 14, 2016, 1:49am

Hi,

Can someone advise/help me on how to write a script to extract sequential lines. I was able to find and get a script working to create permutations of the inputs, but that not what I want/need.

awk 'function perm(p,s,     i) {
       for(i=1;i<=n;i++)
         if(p==1)
           printf "%s%s\n",s,A
         else
           perm(p-1,s A", ")
     }
     {
       A[++n]=$1
     }
     END{
       perm(n)
     }' infile

Unfortunately, i dont understand the script well enough to made modification (not due to lack of trying). I need to extract 2 to 5 sequential lines/word patterns.

An illustration of what i need is as follows:

Eg.

inputfile.txt:

A
B
C
D
E
F
G

outputfile.txt:

A B
B C
C D
D E
E F
F G
A B C
B C D
C D E
D E F
E F G
A B C D
B C D E
C D E F
D E F G
A B C D E
B C D E F
C D E F G

rovf · September 14, 2016, 2:31am

I don't see the point in taking from somewhere a program, which is completely unrelated to your problem, and hoping that it will mysteriously turn into a correct run.

You should at least show some effort, for example by outlining the idea for an algorithm to do this task. At this point, this has nothing to do with shell programming; it's just about programming in general. Once this is done, we can discuss solutions in how to turn the algorithm into a, say, shell script; and, don't forget in this case to indicate, whether you are looking for a solution in some particular shell (bash, ksh, zsh, posix shell,....), or if any shell would be fine, as long as it gets the job done.

BTW, while it is for sure possible and not too hard to write the whole thing in shell language, I would probably use a more convenient programming language, such as Ruby or Perl. In the end, it's a matter of taste.

fuzzi · September 14, 2016, 3:25am

lol. Thanks for the advise. Truth be told, I do not know where to start but the code I put up was the best I could find that does something similar. Thats why I wanted to start from there.

Preferably I would want to use awk/grep as I used it to clean up the data, but if (as you mentioned) that I might be able to crack this faster in Perl, then time to brush up my perl then.

---------- Post updated at 03:25 PM ---------- Previous update was at 02:39 PM ----------

awk 'NR%3{printf "%s ",$0;next}{print;}' infile

the code above would allow me to 'combine' 3 sequential lines.
Can i extend this to make it iterative?

rovf · September 14, 2016, 3:46am

It seems that you are already strugging with the *algorithm*, and not with the implementation.

Here is how I would approach the problem:

What you basically have, is an ordered list of items (A B C D E ..... ) and you want to generate all consecutive runs from it. For example, B C D is such a run. Also, you don't consider a single element (D) by itself as a "run".

If you think about the whole list being an array, a run can represented by two array indices, which are different and where the first index is lower than the other. In the example above, the run B C D can be - assuming that we start index count in the array by 1 - by the pair (2,4), because B is the second and D is the fourth element.

Since it is trivial to generate the actual list of elements, when you have the index pair as described above, your problem boils down to generate all such index pairs.

Assuming that your array contains N elements and N>1, you generate all pairs matching the above restrictions by two nested loops. Without focusing on a particular programming language, the algorithm can be sketched as

    for i from 1 to N-1
      for j from i+1 to N
        generate run (i,j)

Of course you need to generate your array before, but as this doesn't request some clever algorithm, I left out this part for brevity.

Now you have the algorithm, and you can turn it in any language of your choice, so the next step would be to choose the language. You *can* do it in awk, in the same way as you can fetch one bottle of milk from the grocer round the corner by using a truck, but there are plenty of languages around, and maybe the choice of the language is also influenced by what you are going to do with the data afterwards.

I personally would do it in Ruby or in Zsh, but others might consider Python or ksh or C++ or LISP instead. Take that language which you are familiar with, or which you are eager to learn.

RudiC · September 14, 2016, 3:46pm

rovf's discussion couldn't entice you to show up with some own ideas? Pity ...

For exactly the problem given in post#1, try

awk '
        {T[NR] = $1}

END     {for (i=2; i<=5; i++)
           for (j=1; j<=NR-i+1; j++)
             {for (k=j; k<j+i; k++) printf "%s ", T[k]
              printf RS
             }
        }' file
A B 
B C 
C D 
D E 
E F 
F G 
A B C 
B C D 
C D E 
D E F 
E F G 
A B C D 
B C D E 
C D E F 
D E F G 
A B C D E 
B C D E F 
C D E F G