Print rows, having pattern in specific column...

admax · October 4, 2009, 1:19pm

Hello all,
I have a pattern file some what like this,

cd003
cd005
cd007
cd008

and input file like this,

abc cd001 cd002 zca
bca cd002 cd003 cza
cba cd003 cd004 zca
bac cd004 cd005 zac
cba cd005 cd006 acz
acb cd006 cd007 caz
cab cd007 cd008 azc

I want to print all the rows which have the pattern in column 3 ?
Is it possible to give patterns from a file, in AWk ?
How to do it in shell scripting (without using AWk) ?

THANKS IN ADVANCE

Scott · October 4, 2009, 1:24pm

Hi.

Something like:

awk 'NR == FNR { A[$1] = $1; next } A[$3]' pattern_file input_file

Scrutinizer · October 4, 2009, 2:30pm

Hi scottn,

Very nice code. The only thing I can't figure out is why you need the next statement in this case.

awk 'NR == FNR { A[$1] = $1 } A[$3]' pattern_file input_file

seems to produce the same result..

S.

Scott · October 4, 2009, 2:44pm

The idea is to read the pattern file completely before doing anything else.

But you're right, in this case A[$3] would never be set as there is no field three in the pattern file. But if that was to change, using next means you wouldn't have to worry about it later!

patrick87 · October 5, 2009, 8:31pm

Hi,

Can I know what is the relationship between NR == FNR { A[$1] = $1; next } A[$3] ?
I quite confusing about the reason why you using the A[$X] to print rows, having pattern in specific column.
Thanks for your reply

jp2542a · October 5, 2009, 9:12pm

The quick answer:

First realize that awk arrays are indexed by a value not a number. While array indexes may look like a number, that is not how awk sees them.

The first clause sets up an array (A) indexed by values from the pattern file.
Indexes to A are cd003, cd005, etc... so A["cd003"] is a valid entry in the A array.
NR is the number records awk has ever read. NR is set to 1 when awk starts.
FNR is the number of records read from the current file. FNR is reset to 1 when a new file is opened.
Both are incremented when a record is read.

So, if NR is equal to FNR, then we are reading from the first file (pattern file) since the record counts are the same.
If NR is not equal to FNR, then we are reading from a subsequent file (i.e. data file).

The A[$3] (where $3 is the third field from the data file) says if the entry exists (i.e A["cd003"] then do the default action (print the line), else ignore that entry.

This clause is not executed on the pattern file because the next statement says "skip all following code. read next line, and start processing clauses from the top". The 'next' statement adds to the robustness of the code.

patrick87 · October 6, 2009, 12:32am

Hi,

Really thanks for your explanation.
It is excellent.
Besides that, can you told me what is the meaning of
A[$1] = $1Thanks for your help again.

jp2542a:

The quick answer:

First realize that awk arrays are indexed by a value not a number. While array indexes may look like a number, that is not how awk sees them.

The first clause sets up an array (A) indexed by values from the pattern file.
Indexes to A are cd003, cd005, etc... so A["cd003"] is a valid entry in the A array.
NR is the number records awk has ever read. NR is set to 1 when awk starts.
FNR is the number of records read from the current file. FNR is reset to 1 when a new file is opened.
Both are incremented when a record is read.

So, if NR is equal to FNR, then we are reading from the first file (pattern file) since the record counts are the same.
If NR is not equal to FNR, then we are reading from a subsequent file (i.e. data file).

The A[$3] (where $3 is the third field from the data file) says if the entry exists (i.e A["cd003"] then do the default action (print the line), else ignore that entry.

This clause is not executed on the pattern file because the next statement says "skip all following code. read next line, and start processing clauses from the top". The 'next' statement adds to the robustness of the code.

jp2542a · October 6, 2009, 12:37am

The A[$1] = $1 statement simply makes A[$1] exist by giving it a value. Assigning the value to the same index is just a convenience.

patrick87 · October 6, 2009, 12:45am

I try the command below:
awk 'NR == FNR { A[$1] = $2; next } A[$3]'
awk 'NR == FNR { A[$1] = $3; next } A[$3]'
awk 'NR == FNR { A[$1] = $4; next } A[$3]'
All fail to get my desired output result.
Thus I'm interesting about the reason why need to set like A[$1] = $1

jp2542a:

The quick answer:

First realize that awk arrays are indexed by a value not a number. While array indexes may look like a number, that is not how awk sees them.

The first clause sets up an array (A) indexed by values from the pattern file.
Indexes to A are cd003, cd005, etc... so A["cd003"] is a valid entry in the A array.
NR is the number records awk has ever read. NR is set to 1 when awk starts.
FNR is the number of records read from the current file. FNR is reset to 1 when a new file is opened.
Both are incremented when a record is read.

So, if NR is equal to FNR, then we are reading from the first file (pattern file) since the record counts are the same.
If NR is not equal to FNR, then we are reading from a subsequent file (i.e. data file).

The A[$3] (where $3 is the third field from the data file) says if the entry exists (i.e A["cd003"] then do the default action (print the line), else ignore that entry.

This clause is not executed on the pattern file because the next statement says "skip all following code. read next line, and start processing clauses from the top". The 'next' statement adds to the robustness of the code.

---------- Post updated at 11:45 PM ---------- Previous update was at 11:41 PM ----------

Actually I got try the command below:
awk 'NR == FNR { A[$1] = $2; next } A[$3]'
awk 'NR == FNR { A[$1] = $3; next } A[$3]'
awk 'NR == FNR { A[$1] = $4; next } A[$3]'
All fail to get my desired output result.
Thus I'm interesting about the reason why need to set like A[$1] = $1
Really thanks ya.
I'm the new user of awk ^^
But I like awk

jp2542a · October 6, 2009, 12:54am

Scottn's srcript works fine based on your spec. What exactly are you trying to do?

Changing the A[$1] rvalue will not change the programs execution. If what you posted represents the command line you put to the shell, then you forgot to specify the pattern and data files..

patrick87 · October 6, 2009, 1:10am

Hi,
Sorry that I missed some hints..
Like what you said, "If what you posted represents the command line you put to the shell, then you forgot to specify the pattern and data files"
How can I do that to specify the pattern and data files?

My work script

[patrick@home]$ awk 'NR == FNR { A[$1] = $1; next } A[$3]' pattern_file data_file
bca cd002 cd003 cza
bac cd004 cd005 zac
acb cd006 cd007 caz
cab cd007 cd008 azc
[patrick@home]$ awk 'NR == FNR { A[$1] = $2; next } A[$3]' pattern_file data_file
[patrick@home]$ awk 'NR == FNR { A[$1] = $3; next } A[$3]' pattern_file data_file
[patrick@home]$ awk 'NR == FNR { A[$1] = $4; next } A[$3]' pattern_file data_file
[patrick@home]$ awk 'NR == FNR { A[$1] = $5; next } A[$3]' pattern_file data_file

If I change the A[$1] = $1 to $2/$3/$4, I can't get any output result.
Thanks a lot for your guide and sharing

jp2542a · October 6, 2009, 1:58am

pattern_file and data_file are just place holders for the actual paths to your files.

For instance if /export/home/user/real.txt is the path to the file with the patterns and /export/home/user/thedata.txt is the path to the data file, then the command line would be:

awk 'NR == FNR { A[$1] = $1; next } A[$3]' /export/home/user/real.txt /export/home/user/thedata.txt

And I just noticed something. $2..$n don't exist. So A[$1] will have no value and hence not exist either.

patrick87 · October 6, 2009, 2:10am

Hi,
thanks for your explanation
Actually I just confusing about this "A[$1] = $1"
The $1 I not sure how to fill it and its refer to what ?!

jp2542a:

pattern_file and data_file are just place holders for the actual paths to your files.

For instance if /export/home/user/real.txt is the path to the file with the patterns and /export/home/user/thedata.txt is the path to the data file, then the command line would be:
awk 'NR == FNR { A[$1] = $1; next } A[$3]' /export/home/user/real.txt /export/home/user/thedata.txt
And I just noticed something. $2..$n don't exist. So A[$1] will have no value and hence not exist either.

Scrutinizer · October 6, 2009, 2:16am

A[$1] = $1 is used to set up the array of patterns. It is set when NR is equal to FNR, in other words when the first file is read i.e. the pattern file. It means fill the associative array "$1" to the value of "$1", $1 being the first field of your pattern file So with the OP's provided inputs it gets filled like so:

A[cd003]=cd003
A[cd005]=cd005
A[cd007]=cd007
A[cd008]=cd008

So once the pattern file is done it starts reading the input file FNR is no longer equal to NR, so it will just execute the part "A[$3]" for each line, which means: print the current line if field 3 exist as a key in the array.

The script is only testing the existence of the array elements, not using its contents. So IMO the use of $1 is a tiny bit superfluous. I think we could also just set it to 1 instead:

awk 'NR == FNR { A[$1]=1; next } A[$3]' pattern_file input_file

or, since pattern file does not have a third column.

awk 'NR == FNR { A[$1]=1 } A[$3]' pattern_file input_file

jp2542a · October 6, 2009, 2:27am

As Scrutinizer says, what is assigned to A[$1] is unimportant. It just has to be a value. Why are you changing the A[$1] assignment? what are you trying to do ?

BTW, then 'next' statement has a good side effect... it prevents the A[$3] clause from being executed. If a novice decides to modify the script, it will prevent some undesired behavior. And awk doesn't have to do useless work processing a clause that isn't useful for the pattern file....

patrick87 · October 6, 2009, 2:31am

Hi,

Thanks ya.
I fully understand about all the code now d:)
Can I ask you one more things?
Is it when I used A[$x]=$y
The x & y MUST be the same number,right?

Besides that, if I used A[$x]=1
The result will got some empty space for those not match with the pattern file.
awk 'NR == FNR { A[$1]=1 } A[$3]' pattern_file input_file
x
x
bca cd002 cd003 cza
bac cd004 cd005 zac
acb cd006 cd007 caz
cab cd007 cd008 azc
x
x

x represent the empty space.

scrutinizer:

A[$1] = $1 is used to set up the array of patterns. It is set when NR is equal to FNR, in other words when the first file is read i.e. the pattern file. It means fill the associative array "$1" to the value of "$1", $1 being the first field of your pattern file So with the OP's provided inputs it gets filled like so:
A[cd003]=cd003
A[cd005]=cd005
A[cd007]=cd007
A[cd008]=cd008
So once the pattern file is done it starts reading the input file and it will print each line if field 3 exist as a key in the array.

The script is only testing the existence of the array elements, not using its contents. So IMO the use of $3 is a tiny bit superfluous. I think we could also just set it to 1 instead of $1:
awk 'NR == FNR { A[$1]=1; next } A[$3]' pattern_file input_file
or, since pattern file does not have a third column.
awk 'NR == FNR { A[$1]=1 } A[$3]' pattern_file input_file

---------- Post updated at 01:31 AM ---------- Previous update was at 01:30 AM ----------

Thanks jp2542a,
I fully understand about the code now d
Really thanks for all of your explanation;)

radoulov · October 6, 2009, 2:34am

It could be undefined as well, in that case one should check if the key exists:

$1 in A

Scrutinizer · October 6, 2009, 2:41am

Hi Radoulov, you mean $3 in A, no?

jp2542a · October 6, 2009, 2:49am

While the A[$3] construct works where I've tried it, I think the ($3 in A) is much better...

radoulov · October 6, 2009, 2:51am

Yes, sorry.
It should be the field in the second non empty input file that you want to match:

awk 'NR == FNR { A[$1]; next } $3 in A' pattern_file input_file