pedro88
February 6, 2019, 10:33am
1
I need to filter a file that is composed like that:
>Cluster 0
0 292nt, >last294258;size=1;... *
>Cluster 1
0 292nt, >last111510;size=1;... *
1 290nt, >last136280;size=1;... at -/98.62%
2 292nt, >last217336;size=1;... at +/99.66%
3 292nt, >last280937;size=1;... at -/99.32%
>Cluster 2
0 292nt, >last355423;size=1;... *
i need to output it having just the lines that contain the "*" pattern and I need to add in the output file also the info of number of lies before and after the match . like this:
>last294258;size=1;... *nr=1
>last111510;size=1;... *nr=4
>last355423;size=1;... *nr=1
The 'after' of one is just the 'before' of the other, so:
$ awk '/[*]/ { print $0";nr=" N ; N=0 ; next } { N++ }' data
0 292nt, >last294258;size=1;... *;nr=1
0 292nt, >last111510;size=1;... *;nr=1
0 292nt, >last355423;size=1;... *;nr=4
$
pedro88
February 6, 2019, 11:25am
3
sorry maybe I was not so clear but like that doesn't work in the whole big file correctly.
for each line with * I need to add nr=the number of lines belonging to that group(group=cluster)
so like that:
>last294258;size=1;... *nr=1
>last111510;size=1;... *nr=4
>last355423;size=1;... *nr=1
--- Post updated at 04:25 PM ---
maybe I was not that clear.
this is the output I need
>last294258;size=1;... *nr=1
>last111510;size=1;... *nr=4
>last355423;size=1;... *nr=1
nr=4 is because of the subgroup of Cluster 1 is composed by 4 lines
something to start with - a bit verbose...:
awk -f pedro.awk myFile
where pedro.awk
is:
BEGIN {
FS="[>;]"
OFS=";"
}
function p(a, i)
{
for(i in a)
print ">" i, "*nr=" ln
}
/^>/ {p(out);ln=0;split("",out);next}
/[*]/ {idx=$2 OFS $3; out[idx]}
{ln++}
END {
if (ln) p(out)
}
RudiC
February 6, 2019, 12:03pm
5
It is common use in these forums to show what you've tried and where you were stuck when posting a problem. Try
awk '
/>Cluster/ {if (CNT) print CNT
CNT = 0
next
}
{CNT++
}
/\*/ {printf "%snr=", $NF
}
END {print CNT
}
' FS=, file
>last294258;size=1;... *nr=1
>last111510;size=1;... *nr=4
>last355423;size=1;... *nr=1
1 Like
rudic:
It is common use in these forums to show what you've tried and where you were stuck when posting a problem. Try
awk '
/>Cluster/ {if (CNT) print CNT
CNT = 0
next
}
{CNT++
}
/\*/ {printf "%snr=", $NF
}
END {print CNT
}
' FS=, file
>last294258;size=1;... *nr=1
>last111510;size=1;... *nr=4
>last355423;size=1;... *nr=1
Just a slight modification if a "block" does not have anything marked with *:
>Cluster 0
0 292nt, >last294258;size=1;... *
>Cluster 1
0 292nt, >last111510;size=1;... *
1 290nt, >last136280;size=1;... at -/98.62%
2 292nt, >last217336;size=1;... at +/99.66%
3 292nt, >last280937;size=1;... at -/99.32%
>Cluster 2
0 292nt, >last355423;size=1;...
BEGIN {
FS=","
}
/>Cluster/ {if (flg) print CNT
CNT=flg = 0
next
}
{CNT++
}
/\*/ {printf "%snr=", $NF;flg++
}
END {if (flg) print CNT
}
1 Like
The following skips empty lines (where NF is 0) and saves the 1st line after ">Cluster"
awk '
function prt(){ if (run==1) print (save1 "nr=" nr); else run=1 }
$1~/^>Cluster/ { prt(); nr=0; next }
(NF>0 && ++nr==1) { $1=$2=""; save1=$0 }
END { prt() }
' data
At each ">Cluster" and at the END it calls prt() that prints the collected values (but not at its first invocation).