Create file based on data from two other files

cdubu2 · March 3, 2017, 2:32pm

I have looked through several threads regarding merging files with awk and attempted using join however have been unsuccessful likely as I do not fully understand awk.
What I am attempting is to take a csv file which could be between 1 and 15,000 lines with 5 colums and another csv file that will between 1 and 200 lines and 2 colums and create a new file based on a match of field. The match pair is $5 in csv1 and $1 in csv 2.

csv1

SABBKRDHS	   00:02:DE:55:DB:74	SA3000	TS	TR011 
SABBNLRJL	   00:02:DE:62:2C:A2	SA3000	TS	TR002 
SABBPKRLF	   00:02:DE:66:D0:18	SA3000	TS	TR003 
SABBQFHXM	   00:02:DE:6A:A0:80	SA3000	TS	TR003 
SABBPKWXD	   00:02:DE:66:DE:26	SA3000	1	TR006B
SABBNLTLX	   00:02:DE:62:33:46	SA3000	TS	TR009 
SABBPLKBH	   00:02:DE:66:FA:9C	SA3000	TS	TR010 
SABBQLJVG	   00:02:DE:6B:DB:E0	SA3000	TS	TR010 
SABBNSGSR	   00:02:DE:63:88:AC	SA3000	TS	TR011 
SABBQLKKH	   00:02:DE:6B:DD:9A	SA3000	1	TR013 
SABBNLNJX	   00:02:DE:62:23:56	SA3000	TS	TR015 
SABBPNPNF	   00:02:DE:67:85:A8	SA3000	TS	TR023 
SABBNLZTF	   00:02:DE:62:40:C0	SA3000	TS	TR026

csv2

TR002 	qpska-1
TR003 	qpska-2
TR006B	qpska-3
TR009 	qpska-4
TR010 	qpska-5
TR011 	qpska-6
TR013 	qpska-7
TR015 	qpska-8
TR023 	qpska-9
TR026 	qpska-10
TR101 	qpska-11
TR102 	qpska-12
TR103 	qpska-13
TR104 	qpska-14

With the desired output to be csv3

   00:02:DE:62:2C:A2	qpska-1
   00:02:DE:66:D0:18	qpska-2
   00:02:DE:6A:A0:80	qpska-2
   00:02:DE:66:DE:26	qpska-3
   00:02:DE:62:33:46	qpska-4
   00:02:DE:66:FA:9C	qpska-5
   00:02:DE:6B:DB:E0	qpska-5
   00:02:DE:63:88:AC	qpska-6
   00:02:DE:6B:DD:9A	qpska-7
   00:02:DE:62:23:56	qpska-8
   00:02:DE:67:85:A8	qpska-9
   00:02:DE:62:40:C0	qpska-10
   00:02:DE:6C:16:C6	qpska-11
   00:02:DE:67:1F:FA	qpska-12
   00:02:DE:6B:E7:D8	qpska-12
   00:02:DE:61:C7:8E	qpska-13
   00:02:DE:6A:7F:26	qpska-13
   00:02:DE:62:17:CA	qpska-14

I would really like to understand how to format the awk command.

Thanks,

RudiC · March 3, 2017, 3:27pm

Unfortunately, you didn't specify what to do if lines don't find a match - suppress? Print error message or default?
Howsoever, try

awk 'NR==FNR {T[$5] = $2; next} {print T[$1], $2}' OFS="\t" csv[12]
00:02:DE:62:2C:A2    qpska-1
00:02:DE:6A:A0:80    qpska-2
00:02:DE:66:DE:26    qpska-3
00:02:DE:62:33:46    qpska-4
00:02:DE:6B:DB:E0    qpska-5
00:02:DE:63:88:AC    qpska-6
00:02:DE:6B:DD:9A    qpska-7
00:02:DE:62:23:56    qpska-8
00:02:DE:67:85:A8    qpska-9
00:02:DE:62:40:C0    qpska-10
    qpska-11
    qpska-12
    qpska-13
    qpska-14

Scrutinizer · March 4, 2017, 7:04am

Given the fact that fields in the last column of csv1 occur more than ones, I think it makes more sense that the join should be the other way around (the other way around it takes the last occurrence in column 5)

awk 'NR==FNR {A[$1]=$2; next} {print $2, A[$5]}' OFS="\t" csv2 csv1

output:

00:02:DE:55:DB:74	qpska-6
00:02:DE:62:2C:A2	qpska-1
00:02:DE:66:D0:18	qpska-2
00:02:DE:6A:A0:80	qpska-2
00:02:DE:66:DE:26	qpska-3
00:02:DE:62:33:46	qpska-4
00:02:DE:66:FA:9C	qpska-5
00:02:DE:6B:DB:E0	qpska-5
00:02:DE:63:88:AC	qpska-6
00:02:DE:6B:DD:9A	qpska-7
00:02:DE:62:23:56	qpska-8
00:02:DE:67:85:A8	qpska-9
00:02:DE:62:40:C0	qpska-10

Eventhough, I cannot figure out why:

00:02:DE:55:DB:74	qpska-6

is not present in the output sample in post #1

cdubu2 · March 6, 2017, 8:58am

Thank you both for reply's and bare with me I want to make sure I understand so I don't have ask again.

 
{A[$1]=$2; next}

Is putting column 1 of file 2 in an array and make it variable $2?
next stop parsing file 2.
{print $2, A[$5]}
Then print variable $2 and value of column 5 (a[$5] when it has a match from array$1 file 1 to a value in file 2?

RudiC · March 6, 2017, 9:21am

Not quite.

A[$1]=$2 : Create an element for array A indexed by $1 (the first field in the line) and assign $2 (the second field)'s contents.

next stop processing THIS actual line in file2; read in and process next line until done with file2

print $2, A[$5] When processing the second file (file1), print its second field, and use fifth field as an index into array A, and, after OFS, print that value or an empty string if no element exists.