Extracting specific lines from data file

palex · August 3, 2012, 7:24pm

Hello,
Is there a quick awk one-liner for this extraction?:

file1

49389 text55
52211 text66

file2

59302 text1
49389 text2
85939 text3
52211 text4
13948 text5

Desired output

49389 text2
52211 text4

Thanks!!

migurus · August 3, 2012, 8:12pm

try this:

 
awk '{if(NR==FNR){a[$1]=$2}else{if($1 in a){print $1,a[$1]}}}'  file2 file1

Don_Cragun · August 3, 2012, 8:16pm

I think you want something like:

while read code data
do
	grep -- "$code" file2
done < file1

This will work with sh, bash, ksh, and most other shells. (However, it won't work with csh.)

alister · August 3, 2012, 8:34pm

A good awk solution is a much better approach.

AWK can handle this without having to read file2 more than once.

Your grep approach is treating the contents of file1 as a list of regular expressions when it should be treated as a list of literal text. While it doesn't seem to be a problem with the sample data, if the real data contains regular expression metacharacters, there will be problems. This can be avoided if fixed-string matching is used (-F).

The grep approach will match text at any location in the line, not just the first field. Also, it doesn't require that the match consist of an entire field; a substring match will trigger a false positive. Attempting to workaround this by wrapping "$code" with anchors and delimiters won't work if -F is used.

That's a good approach, but the implementation isn't as elegant and idiomatic as it could be. I would suggest ...

awk 'NR==FNR {a[$1]; next} $1 in a' file1 file2

Regards,
Alister

Don_Cragun · August 3, 2012, 9:28pm

alister:

A good awk solution is a much better approach.

AWK can handle this without having to read file2 more than once.

Your grep approach is treating the contents of file1 as a list of regular expressions when it should be treated as a list of literal text. While it doesn't seem to be a problem with the sample data, if the real data contains regular expression metacharacters, there will be problems. This can be avoided if fixed-string matching is used (-F).

The grep approach will match text at any location in the line, not just the first field. Also, it doesn't require that the match consist of an entire field; a substring match will trigger a false positive. Attempting to workaround this by wrapping "$code" with anchors and delimiters won't work if -F is used.

That's a good approach, but the implementation isn't as elegant and idiomatic as it could be. I would suggest ...
awk 'NR==FNR {a[$1]; next} $1 in a' file1 file2
Regards,
Alister

I agree that using awk is much better than using the shell while loop as long as file2 isn't huge. And the shell solution won't work if anything in file1's 1st field contains any regular expression meta-characters. A common problem with the questions we get on this forum is that the questions give trivial examples of input and expected output without stating anything about the actual sizes of datasets that will be processed nor of actual specifications for the contents of the fields being processed. (I started using UNIX in the early 70's on a PDP-11 and a 3B20. There wasn't enough room in the user's address space to build an array in awk for a file of the size you might see processing customer records for a telco.)

---------- Post updated at 06:28 PM ---------- Previous update was at 06:02 PM ----------

Note also that the awk script provided by migurus will only give you the last entry in file2 if more than one line in file2 has a first field that matches the first field of any line in file1.

The awk script provided by Alister doesn't have this problem.

RudiC · August 4, 2012, 6:40am

... on the other hand, alister's code will suppress duplicates in file1, should they occur... tbd by the OP which behaviour is desired.