Faster search needed

daytripper1021 · July 13, 2012, 4:52am

Hope you guys out there can help.

I have 2 files as below:

file 1:

111,222,333,444,555,666
777,888,999,000,111,222
111,222,333,444,555,888

file 2:

666,AAA
222,BBB
888,CCC

I want to get the 6th column from file 1 (example, 666) and check in file 2 for the value in the 2nd column (AAA). Then print the file2 value (AAA) at the end of file1. Results should be as below:

result:

111,222,333,444,555,666,AAA
777,888,999,000,111,222,BBB
111,222,333,444,555,888,CCC

I already have a code for this but I found it to be slow (file1 has about a million lines while file2 has about 20,000 lines). I think there should be a faster way of doing this. See below for the code I did:

for line in `cat file2`
do
  cellid=`echo $line|awk -F"," {'print $6'}`
  area=`nawk -F"," -v cellid=$cellid '{if($1==cellid) print $2}' file2`
  echo "$line,$area"  >> result.txt
done

Hope you can help.

Thanks in advance!

elixir_sinari · July 13, 2012, 4:59am

nawk -F, 'FNR==NR{a[$1]=$2;next}{$(NF+1)=a[$NF]}1' OFS="," file2 file1

I haven't tested the performance though...

daytripper1021 · July 13, 2012, 5:04am

Thanks for the quick reply! Btw, I think the code could get faster if, once a match has been found, it ends the search for that value and goes immediately to the next. Would you know how to add this in?

Klashxx · July 13, 2012, 5:16am

this should be pretty fast :

sort  -t, -k 6,6  -o file1 file1

sort  -t, -k 1,1  -o file2 file2

join -t, -o 1.1,1.2,1.3,1.4,1.5,1.6,2.2 -j1 6 -j2 1 file1 file2 |sort -t, -k 7,7

daytripper1021 · July 13, 2012, 7:11am

the join command had no output.

elixir_sinari · July 13, 2012, 7:44am

Have you tried it?

Klashxx · July 13, 2012, 7:47am

Works fine in a HP-UX box:

# cat file1
111,222,333,444,555,666
777,888,999,000,111,222
111,222,333,444,555,888
# cat file2
666,AAA
222,BBB
888,CCC
# sort  -t, -k 6,6  -o file1 file1
# sort  -t, -k 1,1  -o file2 file2
#cat file1 file2
777,888,999,000,111,222
111,222,333,444,555,666
111,222,333,444,555,888
222,BBB
666,AAA
888,CCC
# join -t, -o 1.1,1.2,1.3,1.4,1.5,1.6,2.2 -j1 6 -j2 1 file1 file2
777,888,999,000,111,222,BBB
111,222,333,444,555,666,AAA
111,222,333,444,555,888,CCC

Marcos_ksh · July 13, 2012, 8:58am

# cat file1.txt
111,222,333,444,555,666
777,888,999,000,111,222
111,222,333,444,555,888
# cat file2.txt
666,AAA
222,BBB
888,CCC
# sOldIFS=${IFS}
# IFS=','
# while read s11 s12 s13 s14 s15 s16 ; do
>   while read s21 s22 ; do
>     if [[ ${s16} = ${s21} ]] ; then
>       echo "${s11},${s12},${s13},${s14},${s15},${s16},${s22}"
>       break
>     fi
>   done <file2.txt
> done <file1.txt
111,222,333,444,555,666,AAA
777,888,999,000,111,222,BBB
111,222,333,444,555,888,CCC
# IFS=${sOldIFS}
#

raj_saini20 · July 13, 2012, 9:11am

awk -F, 'FNR==NR{a[$1]=$2;next}(a[$6]){print $0"," a[$6]}'  file2 file1

daytripper1021 · July 17, 2012, 5:58am

klashxx:

Works fine in a HP-UX box:

# cat file1
111,222,333,444,555,666
777,888,999,000,111,222
111,222,333,444,555,888
# cat file2
666,AAA
222,BBB
888,CCC
# sort  -t, -k 6,6  -o file1 file1
# sort  -t, -k 1,1  -o file2 file2
#cat file1 file2
777,888,999,000,111,222
111,222,333,444,555,666
111,222,333,444,555,888
222,BBB
666,AAA
888,CCC
# join -t, -o 1.1,1.2,1.3,1.4,1.5,1.6,2.2 -j1 6 -j2 1 file1 file2
777,888,999,000,111,222,BBB
111,222,333,444,555,666,AAA
111,222,333,444,555,888,CCC

Oops. Sorry I made a mistake. It works in Ubuntu Linux!

Thanks!