Grep Line with Matching Fields

Below is the scenario. Help is appreciated.

File1: ( 500,000 lines ) : Three fields comma delimited : Not sorted

1234FAA,435612,88975
1224FAB,12345,212356

File2: ( 4,000,000 lines ) : Six fields comma delimited (Last 3 field should match the 3 fields of File1) : Not Sorted :

0123456abcd,12345,abcdef,1234FAA,435612,88975
0123456wxyz,11234,lmnopq,1224FAB,12345,212356

I need to grab all the six fields for file2 when there is a match of first 3 fields of file1 and last 3 fields of file2.

I wrote a small script but seems like it might take days to complete :slight_smile:

  1 \#!/bin/ksh
  2
  3 while read record
  4 do
  5   cat file2 | grep "$record" >> final.list
  6 done < file1

Can someone help me with a faster solution?

Thanks in advance.

egrep -f file1 file2

nawk -F',' -f hem.awk file1 file2

hem.awk:

FNR==NR { f1[$0]; next}
( $(NF-2) FS $(NF-1) FS $(NF) ) in f1

vgersh99,

Linux box does not have nawk.
Has awk and gawk.

:frowning:

'gawk' is your new friend.

vgersh99,

The following ran for about 45-50 minutes and completed, but the output file was empty.
Could you please give me more insight on what is happening below?

Thanks

gawk -F',' -f hem.awk file1 file2 > final.list

hem.awk:
FNR==NR { f1[$0]; next}
( $(NF-2) FS $(NF-1) FS $(NF) ) in file1

ShellLife,

The egrep command ran over an hour and killed it.
I kicked it off again to see how long it runs.

Thanks

hemangjani,
based on your sample input files and the proposed awk script, the output came out as expected. I'v also modified the 'file2' file to add non=matching records found in file1, and the result was as expected.

I believe your actual files are not the same as the ones you'v quoted above: there might be inconsistent spaces/tabs between the fields and/or some other anomalities you're not paying attention to that result in the 'empty' output.

I'd suggest copy/pasting the part of the content of files file1 and file2 here using the Vb Codes so that the formating does not 'get lost in translation' [pun intended].

vgersh99,

The sample data might not have a match of file1 in file2.

file1:

1000FAAA,100706,11446
1000FAAA,1067050,12239
1000FAAA,1081989,24010
1000FAAA,111273,26019
1000FAAA,1130922,13608
1000FAAA,11554,10214
1000FAAA,1424483,11564
1000FAAA,160237,8409
1000FAAA,255794,29390
1000FAAA,264869,8663

file2:

000000AR8894426,13443,AR8894,1FAAA,0,11233
000000AR8967426,13443,AR8967,1FAAA,11233,14800
000000AR8993426,13443,AR8993,1FAAA,26033,8750
000000AR9012426,13443,AR9012,1FAAA,34783,8000
000000AR9067426,13443,AR9067,1FAAA,42783,11576
000000AR9203426,13443,AR9203,1FAAA,54359,9957
000000AR9570426,13443,AR9570,1FAAA,64316,9228
000000AW6599426,13443,AW6599,1FAAA,73544,10703
000000AW6609426,13443,AW6609,1FAAA,84247,19952
000000AW6617426,13443,AW6617,1FAAA,104199,8632

hemangjani's code does not match vgersh99's code

given your initial description and those 2 sample files - there're no matches.
What would your expected output be?

Post a set of files with the expected output

vgersh99,

The previous sample was the last 10 lines from both files.
For the below sample, I made sure that there few matches.


file1:

1000FAAA,100706,11446
1000FAAA,1067050,12239
1000FAAA,1081989,24010
1000FAAA,111273,26019
1000FAAA,1130922,13608
1000FAAA,11554,10214
1000FAAA,1424483,11564
1000FAAA,160237,8409
1000FAAA,255794,29390
1000FAAA,264869,8663

file2:

000000E8F900496,13492,E8F900,1000FAAA,100706,11446
0000001768X0496,13492,1768X0,1000FAAA,1067050,12239
000000354923496,13492,354923,1000FAAA,1081989,24010
00000029R018496,13492,29R018,1000FAAA,1424483,11564
0000000Y495R496,13492,0Y495R,1000FAAA,160237,8409
0000003W5R34496,13492,3W5R34,1000FAAA,255794,29390
0000002AA859496,13492,2AA859,1000FAAA,264869,8663
000000E70311496,13492,E70311,1000FAAA,309934,30127
000000R99462496,13492,R99462,1000FAAA,394836,8279
000000063EW7496,13492,063EW7,1000FAAA,421058,10314
0000000960FF496,13492,0960FF,1000FAAA,437530,11282
000000351795496,13492,351795,1000FAAA,513053,22251

Output:

000000E8F900496,13492,E8F900,1000FAAA,100706,11446
0000001768X0496,13492,1768X0,1000FAAA,1067050,12239
000000354923496,13492,354923,1000FAAA,1081989,24010
00000029R018496,13492,29R018,1000FAAA,1424483,11564
0000000Y495R496,13492,0Y495R,1000FAAA,160237,8409
0000003W5R34496,13492,3W5R34,1000FAAA,255794,29390
0000002AA859496,13492,2AA859,1000FAAA,264869,8663

Try this:
cat file1 | while read line
do
grep $line file2 >> file3

done

Given the sample files above and an awk solution, I get the exact desired output as posted.

vgersh99,

:b: Worked like a charm. Thanks a lot. All it took was 3 minutes to rip through. Thanks a lot.

Yesterday I make a mistake in my hem.awk file. Kahuna pointed out the mistake that I was using file1 instead of f1, due to my ignorance about the code.

I would appreciate if you could explain briefly what exactly is happening in the code. This could also help me in future tasks while working with huge data.

Thanks a lot.

:slight_smile: