awk to look up values in File 2 from File 1, & printingNth field of File1 based value of File2 $2

samonl · March 2, 2018, 12:33pm

I have two files which are the output of a multiple choice vocab test (60 separate questions) from 104 people (there are some missing responses) and the question list. I have the item list in one file (File1)

Item,Stimulus,Choice1,Choice2,Choice3,Choice4,Correct
3100004,HATCHET,tomahawk,pagoda,prosecution,soot,1
3100011,MORSEL,bite,host,marmalade,extortion,1
3100012,PASTIME,interest,miracle,position,homesick,1
3100016,GIST,essence,contempt,policy,fig,1

etc
The actual data is in another (File2)

Item,Response,TrialOrderRelative,ItemOrder,Accuracy,ID,RT
3100004,1,184,1,1,90915,2243.46
3100004,1,702,1,1,1207MY,3355.45
3100004,3,1031,1,0,120908,5832.49
3100004,1,1405,1,1,123321,2577.53
3100004,4,1963,1,0,17440293,6490.75
3100004,2,2326,1,0,200198,19480.67
3100012,4,2722,1,0,202017,2073.00

etc
All I want to do is print the answer chosen for each item in the File 2. This requires that I a) match $1 of File 2 to File 1, and b) then take the value of $2 in File 2, and print the $2+2 th field of File 1 for the matching record. I have got as far as working out that I need to start with reading both the files in as arrays at the same time but got really lost reading the array section in a 30 year old copy of Aho book from undergraduate days. I would like to learn how to do this but do not understand how to use the $2 to look up a field in File 1. I know how to do the a) part of this but that's not really helping. The output should look like this:

3100004,1,184,1,1,90915,2243.46,tomahawk
3100004,1,702,1,1,1207MY,3355.45,tomahawk
3100004,3,1031,1,0,120908,5832.49,prosecution
3100004,1,1405,1,1,123321,2577.53,tomahawk
3100004,4,1963,1,0,17440293,6490.75,soot
3100004,2,2326,1,0,200198,19480.67,pagoda
3100012,4,2722,1,0,202017,2073.00,homesick

I would be grateful for all suggestions including pointers to helpful worked examples so I can solve this kind of thing for myself. This post also represents huge frustration with attempts to do this kind of thing with Excel and its lookup functions (it hung and broke). Thank you.

RudiC · March 2, 2018, 1:05pm

Read file1 into a reference multidimensional array; standard procedure and thus copious examples (for onedimensional arrays) in these forums. Then, print file2's respective lines adding the array element referenced by field 2; also numerous examples found in here.

abdulbadii · March 4, 2018, 10:42am

  awk -F, 'NR==FNR { for (i=3;i<=NF;i++) arr[$1,i] = ","$i; next } { print $0 arr[$1,($2+2)] }' file1 file2

Chubler_XL · March 4, 2018, 3:59pm

If you want to avoid output of the heading line add FNR<2{next} to the front of abdulbadii's program:

awk -F, 'FNR<2{next} NR==FNR { for (i=3;i<=NF;i++) arr[$1,i] = ","$i; next } { print $0 arr[$1,($2+2)] }' file1 file2

samonl · March 4, 2018, 4:11pm

Thank you both. It works, but I still don't understand where the $0 comes from in this bit

{ print $0 arr[$1, ($2+2)] }

I understand (just) how the transpose examples work in the various manual pages but as soon as there is a second file I get really really confused.

Thank you also for the suggestion about suppressing the headers - but I want the headers. You have all been really helpful. Thank you.

Chubler_XL · March 4, 2018, 4:37pm

$0 is the current row, in this case it is a row from any file except the first file, because the previous command texts NR == FNR and for every row of the first file the global row number (NR) will equal the File row number (FNR) and within this test next is called so the rest of the program is never reached.

To keep the headers in the output you need something like this:

awk -F, '
NR==FNR { for (i=3;i<=NF;i++) arr[$1,i] = ","$i; next } 
FNR==1 { print $0 ",Choice ; next }
{ print $0 arr[$1,($2+2)] }' file1 file2

explanation:
row 1: capture file #1 fields into arr[] and skip rest of program for file #1
row 2: if this is row number 1 of this file print row plus ",Choice"
row 3: otherwise print row plus arr[] value for current item and response