reldb
June 8, 2018, 1:44pm
1
I have a very big input file <inputFile1.txt> which has list of mobile no
inputFile1.txt
3434343
3434323
0970978
85233
... around 1 million records
i have another file as inputFile2.txt which has some log detail big file
inputFile2.txt
afjhjdhfkjdhfkd df h8983 3434343 | 3483 | myout1 | 9uohksdf
afjhjdhfkjdhfkd df h8983 3434343 | 3483 | myout2 | 9uohksdf
afjhjdhfkjdhfkd df h8983 0970978| 3483 | myout3 | 9uohksdf
i have another file as inputFile3.txt which has some log detail big file
afjhjdhfkjdhfkd df h8983 myout1 | 3iroi2 | FinalOut1 | 3243
afjhjdhfkjdhfkd df h8983 myout2 | 3iroi2 | FinalOut2 | 3243
afjhjdhfkjdhfkd df h8983 myout2 | 3iroi2 | FinalOut3 | 3243
Basically i need to take the first line from inputFile1.txt and search it in inputFile2.txt and extract myout1 & myout2 and then extract these in inputFile3.txt and get the FinalOut1 / FinalOut1
basically output as
3434343 myout1 FinalOut1
3434343 myout2 FinalOut2
3434343 myout2 FinalOut3
I was doing it in shell script using grep command .. it is taking forever more than 10-20 hours.
is there any better and faster way to handle it ?
Thanks in advance
Guessing you're running grep once per record, if it's taking hours. How about:
$ awk 'LFN != FILENAME { LFN = FILENAME ; FILENUM++ }
FILENUM==1 { A[$1] ; next }
FILENUM==2 { if($4 in A) S1[$6] = $4 ; next }
FILENUM==3 { if($4 in S1) print S1[$4], $4, $6 }' \
FS="[ |]+" inputFile1.txt inputFile2.txt inputFile3.txt
3434343 myout1 FinalOut1
3434343 myout2 FinalOut2
3434343 myout2 FinalOut3
$
One command.
If your real data's any different from what you posted it may need fine tuning.
1 Like
RudiC
June 8, 2018, 3:58pm
3
Try also (tackling it from the other end)
awk '
FNR == 1 {FILE++
}
FILE < 3 {FIN[FILE,$4] = FIN[FILE,$4] $8 FS
}
FILE == 3 {n = split(FIN[2,$1], T1)
for (i=1; i<=n; i++) {m = split(FIN[1,T1], T2)
for (j=1; j<=m; j++) print $1, T1, T2[j]
}
}
' file3 file2 file1
3434343 myout1 FinalOut1
3434343 myout2 FinalOut2
3434343 myout2 FinalOut3
1 Like
Another version, which use surrounding spaces as field separator and takes into account potential variability in field 1 by using its last subfield:
awk '
FNR==1{
fn++
}
fn==1 {
A[$1]
next
}
{
n=split($1, F, " ")
i=F[n]
}
fn==2 {
if(i in A)
B[$3]=i
}
fn==3 {
if(i in B)
print B, i, $3
}
' file1 FS=' *[|] *' file2 file3