Filtering text with awk

pedro88 · February 6, 2019, 10:33am

I need to filter a file that is composed like that:

>Cluster 0
0	292nt, >last294258;size=1;... *
>Cluster 1
0	292nt, >last111510;size=1;... *
1	290nt, >last136280;size=1;... at -/98.62%
2	292nt, >last217336;size=1;... at +/99.66%
3	292nt, >last280937;size=1;... at -/99.32%
>Cluster 2
0	292nt, >last355423;size=1;... *

i need to output it having just the lines that contain the "*" pattern and I need to add in the output file also the info of number of lies before and after the match . like this:

>last294258;size=1;... *nr=1
>last111510;size=1;... *nr=4
 >last355423;size=1;... *nr=1

Corona688 · February 6, 2019, 11:11am

The 'after' of one is just the 'before' of the other, so:

$ awk '/[*]/ { print $0";nr=" N ; N=0 ; next } { N++ }' data

0       292nt, >last294258;size=1;... *;nr=1
0       292nt, >last111510;size=1;... *;nr=1
0       292nt, >last355423;size=1;... *;nr=4

$

pedro88 · February 6, 2019, 11:25am

sorry maybe I was not so clear but like that doesn't work in the whole big file correctly.

for each line with * I need to add nr=the number of lines belonging to that group(group=cluster)

so like that:

>last294258;size=1;... *nr=1
>last111510;size=1;... *nr=4
>last355423;size=1;... *nr=1

--- Post updated at 04:25 PM ---

maybe I was not that clear.
this is the output I need

>last294258;size=1;... *nr=1
>last111510;size=1;... *nr=4
>last355423;size=1;... *nr=1

nr=4 is because of the subgroup of Cluster 1 is composed by 4 lines

vgersh99 · February 6, 2019, 11:59am

something to start with - a bit verbose...:
awk -f pedro.awk myFile
where pedro.awk is:

BEGIN {
  FS="[>;]"
  OFS=";"
}

function p(a, i)
{
   for(i in a)
     print ">" i, "*nr=" ln

}
/^>/ {p(out);ln=0;split("",out);next}
/[*]/  {idx=$2 OFS $3; out[idx]}
{ln++}
END {
  if (ln) p(out)
}

RudiC · February 6, 2019, 12:03pm

It is common use in these forums to show what you've tried and where you were stuck when posting a problem. Try

awk '
/>Cluster/      {if (CNT) print CNT
                 CNT = 0
                 next
                }
                {CNT++
                }
/\*/            {printf "%snr=", $NF
                }
END             {print CNT
                }
' FS=, file
 >last294258;size=1;... *nr=1
 >last111510;size=1;... *nr=4
 >last355423;size=1;... *nr=1

vgersh99 · February 6, 2019, 12:47pm

rudic:

It is common use in these forums to show what you've tried and where you were stuck when posting a problem. Try

awk '
/>Cluster/      {if (CNT) print CNT
   CNT = 0
   next
   }
   {CNT++
   }
/\*/            {printf "%snr=", $NF
   }
END             {print CNT
   }
' FS=, file
 >last294258;size=1;... *nr=1
 >last111510;size=1;... *nr=4
 >last355423;size=1;... *nr=1

Just a slight modification if a "block" does not have anything marked with *:

>Cluster 0
0       292nt, >last294258;size=1;... *
>Cluster 1
0       292nt, >last111510;size=1;... *
1       290nt, >last136280;size=1;... at -/98.62%
2       292nt, >last217336;size=1;... at +/99.66%
3       292nt, >last280937;size=1;... at -/99.32%
>Cluster 2
0       292nt, >last355423;size=1;...

BEGIN {
  FS=","
}
/>Cluster/      {if (flg) print CNT
                 CNT=flg = 0
                 next
                }
                {CNT++
                }
/\*/            {printf "%snr=", $NF;flg++
                }
END             {if (flg) print CNT
                }

MadeInGermany · February 6, 2019, 2:08pm

The following skips empty lines (where NF is 0) and saves the 1st line after ">Cluster"

awk '
function prt(){ if (run==1) print (save1 "nr=" nr); else run=1 }
$1~/^>Cluster/ { prt(); nr=0; next }
(NF>0 && ++nr==1) { $1=$2=""; save1=$0 }
END { prt() }
' data

At each ">Cluster" and at the END it calls prt() that prints the collected values (but not at its first invocation).