Need help for faster file read and grep in big files

reldb · June 8, 2018, 1:44pm

I have a very big input file <inputFile1.txt> which has list of mobile no

inputFile1.txt

... around 1 million records

i have another file as inputFile2.txt which has some log detail big file
inputFile2.txt

afjhjdhfkjdhfkd df h8983 3434343 | 3483 | myout1 | 9uohksdf
afjhjdhfkjdhfkd df h8983 3434343 | 3483 | myout2 | 9uohksdf
afjhjdhfkjdhfkd df h8983 0970978| 3483 | myout3 | 9uohksdf

i have another file as inputFile3.txt which has some log detail big file

afjhjdhfkjdhfkd df h8983 myout1  | 3iroi2 | FinalOut1 | 3243
afjhjdhfkjdhfkd df h8983 myout2  | 3iroi2 | FinalOut2 | 3243
afjhjdhfkjdhfkd df h8983 myout2  | 3iroi2 | FinalOut3 | 3243

Basically i need to take the first line from inputFile1.txt and search it in inputFile2.txt and extract myout1 & myout2 and then extract these in inputFile3.txt and get the FinalOut1 / FinalOut1

basically output as

3434343 myout1 FinalOut1 
3434343 myout2 FinalOut2 
3434343 myout2 FinalOut3

I was doing it in shell script using grep command .. it is taking forever more than 10-20 hours.
is there any better and faster way to handle it ?

Thanks in advance

Corona688 · June 8, 2018, 3:33pm

Guessing you're running grep once per record, if it's taking hours. How about:

$ awk 'LFN != FILENAME { LFN = FILENAME ; FILENUM++ }
FILENUM==1 { A[$1] ; next }
FILENUM==2 { if($4 in A)        S1[$6] = $4 ; next }
FILENUM==3 { if($4 in S1) print S1[$4], $4, $6 }' \
        FS="[ |]+" inputFile1.txt inputFile2.txt inputFile3.txt

3434343 myout1 FinalOut1
3434343 myout2 FinalOut2
3434343 myout2 FinalOut3

$

One command.

If your real data's any different from what you posted it may need fine tuning.

RudiC · June 8, 2018, 3:58pm

Try also (tackling it from the other end)

awk '
FNR == 1        {FILE++
                }
FILE < 3        {FIN[FILE,$4] = FIN[FILE,$4] $8 FS
                }
FILE == 3       {n = split(FIN[2,$1], T1)
                 for (i=1; i<=n; i++)   {m = split(FIN[1,T1], T2)
                                         for (j=1; j<=m; j++) print $1, T1, T2[j]
                                        }
                }
' file3 file2 file1
3434343 myout1 FinalOut1
3434343 myout2 FinalOut2
3434343 myout2 FinalOut3

Scrutinizer · June 9, 2018, 4:45am

Another version, which use surrounding spaces as field separator and takes into account potential variability in field 1 by using its last subfield:

awk '
  FNR==1{
    fn++
  }
  fn==1 {
    A[$1]
    next
  }
  {
    n=split($1, F, " ")
    i=F[n]
  } 
  fn==2 {
    if(i in A)
      B[$3]=i
  }
  fn==3 {
    if(i in B)
      print B, i, $3
  }
' file1 FS=' *[|] *' file2 file3