Search data file2 for entries in data file1

palex · March 27, 2017, 9:20pm

Hello,
I have two data files:

file1:

9780205646999   28.31   20      Cengage
9780205647040   51.94   20      Cengage
9780205660568   49.11   20      Cengage
9780205696758   51.75   20      Cengage
9780205727643   41.63   20      Cengage

file2:

9780020080954   9780020080954    2       23.3729         11      100     a       n
9780205647040   9780205647040    2       23.3729         11      100     a       n
9780023001406   9780023001406    2       46.4325         11      100     a       n
9780205727643   9780205727643    2       299.378         11      100     a       n

I am looking to search file2 for every entry in column 1 of file1. If the entry is found in file2, the output should be the entire corresponding line of file 1 with the entry in column 4 of file2 appended as such:

9780205647040   51.94   20      Cengage	23.3729
9780205727643   41.63   20      Cengage	299.378

I know how to do this with grep -f, but it is extremely cumbersome. Thanks very much!

Don_Cragun · March 27, 2017, 9:42pm

It seems like this should be doable with a fairly simple awk script. Please show us the code using grep -f that is working for you.

What operating system and shell are you using?

palex · March 27, 2017, 10:00pm

I'm using XQuartz on a Mac OS (El Capitan) with a bash shell. What I think would work would be something like:

awk '{print $1}' file1 > temp
grep -f temp file2

but there may be up to 10k lines in file1, so this would take a very long time. Additionally, I don't know how to append the column from file2.

Thanks again!

Don_Cragun · March 27, 2017, 10:31pm

Try something like:

awk '
FNR == NR {
	v[$1] = $4
	next
}
$1 in v {
	print $0, v[$1]
}' OFS="\t" file2 file1

I don't know about XQuartz , but it works fine in a Terminal window.

palex · March 29, 2017, 11:14pm

One followup question... If I wanted to include an additional column from file2 to the output, how would I modify that code? For example, including column 7 as such:

9780205647040   51.94   20      Cengage	23.3729  a
9780205727643   41.63   20      Cengage	299.378  a

Thanks!

Don_Cragun · March 29, 2017, 11:49pm

palex:

One followup question... If I wanted to include an additional column from file2 to the output, how would I modify that code? For example, including column 7 as such:
9780205647040   51.94   20      Cengage	23.3729  a
9780205727643   41.63   20      Cengage	299.378  a
Thanks!

Given the code shown in post #4, what do you think would need to be changed to get the output you want now?

Can't you just add one line to gather the additional field from file2 and change the print command to output the additional field you gathered from file2 when you find a matching record in file1?

palex · March 29, 2017, 11:56pm

I did play around with the code, but I couldn't figure out the syntax of the array (I looked on several pages also). Something like this came to mind, but is clearly wrong:

awk -F"\t" '
FNR == NR {
        v[$1,$2] = $4,$7
        next
}
$1 in v {
        print $0, v[$1], v[$2]
}' OFS="\t" file2 file1

Don_Cragun · March 30, 2017, 12:40am

You're close. You can use two arrays instead of just one... Try:

awk '
FNR == NR {
	v[$1] = $4
	v2[$1] = $7
	next
}
$1 in v {
	print $0, v[$1], v2[$1]
}' OFS="\t" file2 file1

Or, since both saved fields from file2 are being printed next to each other in the output, you can do it with one array indexed just by the contents of field 1:

awk '
FNR == NR {
	v[$1] = $4 OFS $7
	next
}
$1 in v {
	print $0, v[$1]
}' OFS="\t" file2 file1

In either case, the index (indices) on the array (arrays) has (have) to just be the field that has a common value in both input files.