awk regular expression

Hello,

I have big files which I wanna filter them based on first column.

first column should be one of these strings: chr2L || chr2R || chr3L || chr3R || chr4 || chrX
and something like chr2Lh or chrY or chrM3L is not accepted.

I used the following command:

 awk '{ if ($1=="chr2L" || $1=="chr2R" || $1=="chr3L" || $1=="chr3R" || $1== "chr4" || $1=="chrX") print $0 }' input | sort -k1,1 -k2,2n > output

Is there easier way to do it? or is it true anyway?!!
And BTW, I wanna sort based on two column which the first is string and second is number and you see what I used in the code above. Please let me know if it's true also.

Thanks in advance.

You could do:

awk '$1 ~ /^chr[2-4]*[LRX]*$/' input | sort -k1,1 -k2,2n

EDIT: But If you want to match strictly only chr2L || chr2R || chr3L || chr3R || chr4 || chrX , then above regexp is wrong.

It will match other combinations as well. I think what you did is correct.

1 Like

I was about to respond to this post to say exactly what you added in your edit. I agree completely.

Shorter isn't always better. That said, the original version can be simplified a little bit:

 awk '$1=="chr2L" || $1=="chr2R" || $1=="chr3L" || $1=="chr3R" || $1=="chr4" || $1=="chrX"' input | sort -k1,1 -k2,2n > output

Regards,
Alister

1 Like

thank you Yoda and alister.

As you said shorter is not always better! :slight_smile: I was not sure if what I used does the matching strictly or not. Now I'm
once again thanks to both of you.

Cheers!

Not sure if this will work with all awk version, but you might want to try

awk '$1 ~ /^(chr2L|chr2R|chr3L|chr3R|chr4|chrX)$/' input | sort -k1,1 -k2,2n

It's an or ed regex constant fitting exactly within the bounds of $1.

Factoring out the common string chr :

awk '$1 ~ /^chr(2L|2R|3L|3R|4|X)$/' input | sort -k1,1 -k2,2n

Try to avoid regex's if you can, to improve efficiency. Also, if you happen to know, to a reasonable extent, the frequency of occurrence of those strings, you could use simple string comparisons (with the proper order of comparisons) with the short-circuit logical OR ( || ) operator.

1 Like