Filter (by max length) only lines not matching regex

pathunkathunk · March 25, 2013, 2:28am

I have a large file of many pairs of sequences and their headers, which always begin with '>'

I'm looking for help on how to retain only sequences (and their headers) below a certain length. So if min length was 10, output would be

I can filter by length, but I'm not sure how to exclude the header lines.

awk '{lines[NR] = $0} length($0) < 10 {print lines [NR-1]; print lines [NR]} ' file.name

RudiC · March 25, 2013, 2:59am

Do your data always come in pairs? If yes, your code is fine. Try also

$ while read HEAD; do read DATA; [ ${#DATA} -lt 10 ] && printf "%s\n%s\n" "$HEAD" "$DATA"; done < file
>gi|bcd| Species two
ATTTGATC
>gi|cdf| Species three
ATTTGATCT

mirni · March 25, 2013, 3:07am

Like this?

awk '/^[^>]/ && length($0)<10{print hdr"\n"$0}{hdr=$0}' input

pathunkathunk · March 25, 2013, 3:27am

Both of these work, thank you.

The problem with my original code was obscured by my choice of an example file with unrealistically short sequences. In reality my sequences are longer, and the problem with my code is that it captures headers along with short sequences.