Grepping only if condition matches

anushree.a · July 21, 2016, 3:18am

Dear Friends,

I have a flat file which is as follows

$cat sample
123,456,1,1,1,1
sdfas,345,1,1,1,1
dfgd,234,2,3,4,1
ggffgr,234,4,3,2,1
jkhu,354.1,1,1,1
$

I want to get output of only those lines which has '1' in 3 to 5 position.

So I want output as follows

123,456,1,1,1,1
sdfas,345,1,1,1,1
jkhu,354.1,1,1,1

Kindly guide.
Anu.

zaxxon · July 21, 2016, 4:16am

What have you tried so far?

anushree.a · July 21, 2016, 5:16am

I didn't try anything as I know way out only by using "If" statement which unfortunately I do not want to use.
Hence seeking for guidance who know grep well.

Akshay_Hegde · July 21, 2016, 6:12am

Please search forum before posting, kind of question you asked now, repeated several times on fora with different data,

you may try awk, its easy

[akshay@localhost tmp]$ cat sample
123,456,1,1,1,1
sdfas,345,1,1,1,1
dfgd,234,2,3,4,1
ggffgr,234,4,3,2,1
jkhu,354.1,1,1,1

[akshay@localhost tmp]$ awk -F, '$3 == 1 &&  $4 == 1 && $5 == 1' sample
123,456,1,1,1,1
sdfas,345,1,1,1,1
jkhu,354.1,1,1,1

[akshay@localhost tmp]$ awk  -F, '{j=1;for(i=3; i<=5; i++)j*=$i==1}j' sample
123,456,1,1,1,1
sdfas,345,1,1,1,1
jkhu,354.1,1,1,1

rbatte1 · July 21, 2016, 7:35am

So if your file has several fields that you can create an expression for, then that should do it.

If the separator is , then an ignored field is .*, meaning zero or more ( * ) of any character ( . ) followed by the field separator ( , )

So, to count from the beginnig of the line your expression starts as ^.*,.*, to signify start of record ( ^ ) the ignore two fields. You can then tag on 1,1,1, to specify your requirements and the rest doesn't matter if it matches or not.

I think you can end up with:-

egrep "^.*,.*,1,1,1," input_file

From your sample input, I get one less line because the one starting jkhu does not have the correct field separator between fields 2 & 3.

I hope that this helps,
Robin

rdrtx1 · July 21, 2016, 9:51am

awk -F, '$3==$4==$5==1' sample

anushree.a · July 22, 2016, 1:18am

Thank you friends for the help which was much needed. Special thanks to Mr. rbatte1 for taking extra efforts for step by step guiding.
Thank you.

Don_Cragun · July 22, 2016, 4:09am

This might work on some systems, but it certainly is not portable.

The standards state that there is no associativity for the == operator and some versions of awk produce the syntax error:

awk: syntax error at source line 1
 context is
	 >>> $3==$4== <<< 
awk: bailing out at source line 1

If we rewrite the expression as:

awk -F, '$3==($4==($5==1))'

then there are lots of cases where that expression will evaluate to 1 even if all three of those fields are not set to 1. For example, the above command will print any of the following lines:

a,b,1,1,1
a,b,1,0,X for any X other than 1
a,b,0,1,X for any X other than 1
a,b,0,W,X for any W other than 0 or 1 for any X

Of course, it could also be rewritten as:

awk -F, '(($3==$4)==$5)==1))'

which would print any of the following lines:

a,b,1,1,1
a,b,X,X,1 for any X
a,b,X,Y,0 for any X that is not Y

Don_Cragun · July 22, 2016, 4:51am

rbatte1:

So if your file has several fields that you can create an expression for, then that should do it.

If the separator is , then an ignored field is .*, ) meaning zero or more ( * ) of any character ( . followed by the field separator ( , )

So, to count from the beginnig of the line your expression starts as ^.*,.*, to signify start of record ( ^ ) the ignore two fields. You can then tag on 1,1,1, to specify your requirements and the rest doesn't matter if it matches or not.

I think you can end up with:-
egrep "^.*,.*,1,1,1," input_file
From your sample input, I get one less line because the one starting jkhu does not have the correct field separator between fields 2 & 3.

I hope that this helps,
Robin

Note that grep will work as well as egrep (or the preferred syntax grep -E ) for the RE being used in this thread.

Note also that the RE suggested works correctly only if there are exactly 6 fields (separated by 5 commas) on each input line. Since BREs and EREs use a greedy match, the RE .*, can match more than one field if there are more than 5 commas on a line. For example, that egrep command will also print the lines:

a,b,c,1,1,1,2
a,b,0,0,1,1,1,2
1,2,1,2,1,2,1,2,1,2,1,1,1,2

in addition to lines with 1 in fields 3,4, and 5 that only have 6 fields.

To make it work correctly on a line containing six commas (i.e. 7 fields), you would need to change the RE to:

.*,.*,1,1,1,.*,

and you would need to add an additional .*, to the end of that RE for each additional field in your input file.

Alternatively, we could use an RE that only matches non-comma characters in each of the first two fields:

grep '^[^,]*,[^,]*,1,1,1,' input_file

which will only print lines with 1 in fields 3, 4, and 5 as long as there are at least six fields on each line. ( [^,]* is an RE that matches zero or more occurrences ( * ) of any character that is not a comma ( [^,] ) followed by a comma ( , ). And, the leading ^ in the entire RE anchors the match to the start of the line.)