Very Challenging Problem. Please read fully.

Hi,

This is the Third thread i'm putting here for the same problem. :frowning:

Actually, i'm trying a script like this.. but its taking a long time.. about 3 days to complete fully..


#!/bin/ksh

if [ $# != 2 ] 
then
exit 1
fi

while read i
do
    while read j
    do
	field7=`echo $j|cut -d "|" -f7`
	field13=`echo $j|cut -d "|" -f13`
	field14=`echo $j|cut -d "|" -f14`
        if [ "${i}" == "${field7}|${field13}|${field14}" ]
	then
	print "$j"
	break 
	fi
    done < $2
done < $1

The first file being

1TVAO|OVEPT|VO
1TVAO|OVPDM|VO
6NFXE|17CLP|DH
6NFXE|NRZO4|EQ
6NFXE|SMOSA|EQ
ACA15|11X1W|DX
ACA15|1LN88|DX
ACA15|1LNSK|DX
ACA15|1LNVX|DX
ACA15|1LNVX|FD
ACA15|1ZOAA|DX
ACA15|NRLAF|DX
ACA15|NRZCN|DX
ACA15|NRZFC|DX
ACA15|NRZX8|DX
ACA15|O41AC|DX
ACA17|1LN88|DX
ACA17|NRZX8|DX
ACA1E|11X2W|DX
ACA1E|1LN88|DH

The second file being...

1TVAO|S3WS0306|45101000|4513000|AJGJ|CB10|1TVAO|S3WS033306|4513101000|4513201000|AJBFGJ|CB10|OVEPT|VO|430300|430300|430300|009|IC    |Z
1TVAO|S3WS0306|45101000|451000|AJFGJ|CB10|1TVAO|S3WS033306|4513101000|4513201000|AJBFGJ|CB10|OVPDM|VO|430300|430300|430300|009|IC    |Z
6NFXE|S3SN0201|41101000|451101000|B7HT|CB10|6NFXE|S3SN021401|4511101000|4511201000|B7BXHT|CB10|17CLP|DH|******|6670NI|410402|011|LQ    |Z
AGRJE|NA|NA|NA|NA|NA|6NFXE|S3021401|4511101000|4511201000|B7BXHT|CB10|NRZO4|EQ|402100|6670DC|410402|001|EQ|Z|U|Y|VT
6NFXE|S3SN0201|41101000|45111000|BXHT|CB10|6NFXE|S3SN021401|4511101000|4511201000|B7BXHT|CB10|SMOSA|EQ|******|6670NI|410402|016|EQ    |Z
ACA15|S3BW1120|41101000|4511000|AEHZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|11X1W|DX|410312|410312|410312|011|LQ    |Z
ACA15|S3BW1120|41101000|45112000|AEZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|1LN88|DX|410312|410312|410312|A14|IOC   |Z
ARCXE|NA|NA|NA|NA|NA|A5|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|1LN88|DX|410312|420100|420100|A14|IOC   |Z
ACA15|NA|NA|NA|NA|NA|A15|NA|NA|NA|NA|NA|1LNSK|DX|410312|410312|410312|A14|TC    |Z
ACA15|NA|NA|NA|NA|NA|A15|NA|NA|NA|NA|NA|1LNVX|DX|410312|410312|410312|009|IOC   |Z
ALBBE|S3BW1118|41101000|4511201000|KSBL|CB20|ACA15|S3BW100120|4511101000|4511201000|KPASBL|CB20|1LNVX|FD|410312|410210|410210|A14|IOC|Z|N|Y|IS
ACA15|NA|NA|NA|NA|NA|ACA15|NA|NA|NA|NA|NA|1ZOAA|DX|410312|410312|410312|011|LQ|Z|A|Y|IS
ACA15|NA|NA|NA|NA|NA|ACA15|NA|NA|NA|NA|NA|NRLAF|DX|410312|410312|410312|A15|EQ    |Z
ACA15|S3BW1120|41101000|4511201000|AEHZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|NRZCN|DX|410312|410312|410312|009|NQ    |Z
ACA15|NA|NA|NA|NA|NA|ACA15|NA|NA|NA|NA|NA|NRZFC|DX|410312|410312|410312|009|NQ    |Z
ACA15|S3BW1120|41101000|4511201000|AEHZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|NRZX8|DX|410312|410312|410312|A14|NQ    |Z
ACA15|S3BW1120|41101000|4511201000|AEDHZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|O41AC|DX|410312|410312|410312|009|NQ-AC|Z|N|Y|IS
ACA17|S3BW1120|42111000|4512201000|AEDHZ|CB10|ACA17|S3BW100120|4512111000|4512201000|AEBDHZ|CB10|1LN88|DX|410325|410312|410312|A14|IOC   |Z
ACA17|S3BW1120|42111000|4512201000|AHZ|CB10|ACA17|S3BW100120|4512111000|4512201000|AEBDHZ|CB10|NRZX8|DX|410325|410312|410312|009|NQ    |Z
ACA1E|S3BW1120|41101000|4511201000|ADHZ|CB10|ACA1E|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|11X2W|DX|410312|410312|410312|011|LQ    |Z

The expected result is..

1TVAO|S3WS033306|4513101000|4513201000|AJBFGJ|CB10|1TVAO|S3WS033306|4513101000|4513201000|AJBFGJ|CB10|OVEPT|VO|430300|430300|430300|009|IC    |Z
1TVAO|S3WS033306|4513101000|4513201000|AJBFGJ|CB10|1TVAO|S3WS033306|4513101000|4513201000|AJBFGJ|CB10|OVPDM|VO|430300|430300|430300|009|IC    |Z
6NFXE|S3SN021401|4511101000|4511201000|B7BXHT|CB10|6NFXE|S3SN021401|4511101000|4511201000|B7BXHT|CB10|17CLP|DH|******|6670NI|410402|011|LQ    |Z
AGRJE|NA|NA|NA|NA|NA|6NFXE|S3SN021401|4511101000|4511201000|B7BXHT|CB10|NRZO4|EQ|402100|6670DC|410402|001|EQ|Z|U|Y|VT
6NFXE|S3SN021401|4511101000|4511201000|B7BXHT|CB10|6NFXE|S3SN021401|4511101000|4511201000|B7BXHT|CB10|SMOSA|EQ|******|6670NI|410402|016|EQ    |Z
ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|11X1W|DX|410312|410312|410312|011|LQ    |Z
ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|1LN88|DX|410312|410312|410312|A14|IOC   |Z
ACA15|NA|NA|NA|NA|NA|ACA15|NA|NA|NA|NA|NA|1LNSK|DX|410312|410312|410312|A14|TC    |Z
ACA15|NA|NA|NA|NA|NA|ACA15|NA|NA|NA|NA|NA|1LNVX|DX|410312|410312|410312|009|IOC   |Z
ALBBE|S3BW1118|451000|45111000|KPASBL|CB20|ACA15|S3BW100120|4511101000|4511201000|KPASBL|CB20|1LNVX|FD|410312|410210|410210|A14|IOC|Z|N|Y|IS
ACA15|NA|NA|NA|NA|NA|ACA15|NA|NA|NA|NA|NA|1ZOAA|DX|410312|410312|410312|011|LQ|Z|A|Y|IS
ACA15|NA|NA|NA|NA|NA|ACA15|NA|NA|NA|NA|NA|NRLAF|DX|410312|410312|410312|A15|EQ    |Z
ACA15|S3BW1120|45111000|45112000|AEZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|NRZCN|DX|410312|410312|410312|009|NQ    |Z
ACA15|NA|NA|NA|NA|NA|ACA15|NA|NA|NA|NA|NA|NRZFC|DX|410312|410312|410312|009|NQ    |Z
ACA15|S0120|4101000|4511000|AEHZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|NRZX8|DX|410312|410312|410312|A14|NQ    |Z
ACA15|S30120|41101000|45112000|AEHZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|O41AC|DX|410312|410312|410312|009|NQ-AC|Z|N|Y|IS
ACA17|S3100120|4111000|45122000|AEHZ|CB10|ACA17|S3BW100120|4512111000|4512201000|AEBDHZ|CB10|1LN88|DX|410325|410312|410312|A14|IOC   |Z
ACA17|S3BW1120|411000|4512201000|AEBDHZ|CB10|ACA17|S3BW100120|4512111000|4512201000|AEBDHZ|CB10|NRZX8|DX|410325|410312|410312|009|NQ    |Z
ACA1E|S3BW1120|4101000|4511201000|AEBDHZ|CB10|ACA1E|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|11X2W|DX|410312|410312|410312|011|LQ    |Z
ACA1E|S3BW20|4111000|4511201000|AEBDHZ|CB10|ACA1E|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|1LN88|DH|410312|410312|410312|A14|IOC   |Z

The above script was posted in this forum by fpmurphy. Thanks for that fpmuphy.. but this script is taking about 3 days to scan 48000 records in first file to 77000 records in second file..

The earlier threads which i had posted and contains some more description about the problem are...

Can we use some kind of counter, so that the script starts scanning the next record from the second file from where it stopped.
What i'm trying to say is that, here, i believe its taking more time because every time it fetches a record from first file it scans the second file from the beginning. But the second file is sorted with respect to 7th, 13th and 14th column.

Thanks,
RRVARMA

Would it only be a one-to-one relationship between those key fields?

Could the 2nd file be made into a new temp file where
3keyfields"~"full_data_record
then sorted

perhaps a grep or other command
if successful then cut -d"~" -f2 >>new_file

Hi joeyg,

Thanks for the feed back.. but i'm not able to get you.. :frowning:

Here both the files are sorted.

Thanks,
RRVARMA

Just as an aside, when I process the first two files with that script, I don't get the output in third script. I get:

1TVAO|S3WS0306|45101000|4513000|AJGJ|CB10|1TVAO|S3WS033306|4513101000|4513201000|AJBFGJ|CB10|OVEPT|VO|430300|430300|430300|009|IC    |Z
1TVAO|S3WS0306|45101000|451000|AJFGJ|CB10|1TVAO|S3WS033306|4513101000|4513201000|AJBFGJ|CB10|OVPDM|VO|430300|430300|430300|009|IC    |Z
6NFXE|S3SN0201|41101000|451101000|B7HT|CB10|6NFXE|S3SN021401|4511101000|4511201000|B7BXHT|CB10|17CLP|DH|******|6670NI|410402|011|LQ    |Z
AGRJE|NA|NA|NA|NA|NA|6NFXE|S3021401|4511101000|4511201000|B7BXHT|CB10|NRZO4|EQ|402100|6670DC|410402|001|EQ|Z|U|Y|VT
6NFXE|S3SN0201|41101000|45111000|BXHT|CB10|6NFXE|S3SN021401|4511101000|4511201000|B7BXHT|CB10|SMOSA|EQ|******|6670NI|410402|016|EQ    |Z
ACA15|S3BW1120|41101000|4511000|AEHZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|11X1W|DX|410312|410312|410312|011|LQ    |Z
ACA15|S3BW1120|41101000|45112000|AEZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|1LN88|DX|410312|410312|410312|A14|IOC   |Z
ALBBE|S3BW1118|41101000|4511201000|KSBL|CB20|ACA15|S3BW100120|4511101000|4511201000|KPASBL|CB20|1LNVX|FD|410312|410210|410210|A14|IOC|Z|N|Y|IS
ACA15|NA|NA|NA|NA|NA|ACA15|NA|NA|NA|NA|NA|1ZOAA|DX|410312|410312|410312|011|LQ|Z|A|Y|IS
ACA15|NA|NA|NA|NA|NA|ACA15|NA|NA|NA|NA|NA|NRLAF|DX|410312|410312|410312|A15|EQ    |Z
ACA15|S3BW1120|41101000|4511201000|AEHZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|NRZCN|DX|410312|410312|410312|009|NQ    |Z
ACA15|NA|NA|NA|NA|NA|ACA15|NA|NA|NA|NA|NA|NRZFC|DX|410312|410312|410312|009|NQ    |Z
ACA15|S3BW1120|41101000|4511201000|AEHZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|NRZX8|DX|410312|410312|410312|A14|NQ    |Z
ACA15|S3BW1120|41101000|4511201000|AEDHZ|CB10|ACA15|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|O41AC|DX|410312|410312|410312|009|NQ-AC|Z|N|Y|IS
ACA17|S3BW1120|42111000|4512201000|AEDHZ|CB10|ACA17|S3BW100120|4512111000|4512201000|AEBDHZ|CB10|1LN88|DX|410325|410312|410312|A14|IOC   |Z
ACA17|S3BW1120|42111000|4512201000|AHZ|CB10|ACA17|S3BW100120|4512111000|4512201000|AEBDHZ|CB10|NRZX8|DX|410325|410312|410312|009|NQ    |Z
ACA1E|S3BW1120|41101000|4511201000|ADHZ|CB10|ACA1E|S3BW100120|4511101000|4511201000|AEBDHZ|CB10|11X2W|DX|410312|410312|410312|011|LQ    |Z

That script is inefficient, but is it even doing what you want it do? We want to get the algorithm right.

You can give this a try but it uses a significant amount of memory if you have big files:

awk 'BEGIN{FS=OFS="|"}
NR==FNR{a[$0]=1;next}
a[$7FS$13FS$14]{print}' file1 file2

If you get errors use nawk, gawk or /usr/xpg4/bin/awk on Solaris.

Regards

Franklin's script runs really fast. If memory is an issue, though, try the following:

#!/bin/ksh

if [ $# != 2 ] 
then
exit 1
fi

IFS="|"

while read i
do
    set -A array $i

    while read j
    do
        field7=${array[6]}
        field13=${array[12]}
        field14=${array[13]}

        if [ "${j}" == "${field7}|${field13}|${field14}" ]
	then
	echo "$i"
	break 
	fi
    done < $1
done < $2

If I understand your problem, you're doing a global search and match, output conditional on that match. The reason it takes so long is that you are going through your big file for each line in your control. By the above script, you are processing "J" times for each "I". If you start by sorting your "J" records by the 7, 13 and 14 fields, then you can "ladder walk" the "I" records and the "J" records, thus only going through each list once. Your script will be done almost immediately. The "i" records must, of course, also be sorted the same way.

@RRVARMA
Your script give me
real 0m2.997s
user 0m0.980s
sys 0m1.624s

Replace

	field7=`echo $j|cut -d "|" -f7`
	field13=`echo $j|cut -d "|" -f13`
	field14=`echo $j|cut -d "|" -f14`
        if [ "${i}" == "${field7}|${field13}|${field14}" ]

by:

 if [ "${i}" == "$(echo $j | awk 'BEGIN{FS=OFS="|"}{print $7, $13, $14}')" ]

and you will save %43 procesing time using the same resources.
real 0m1.683s
user 0m0.612s
sys 0m0.888s

@cooldude solution demands more memory but is faster.
real 0m0.211s
user 0m0.036s
sys 0m0.004s

@Franklin52 awk solution is the best, but uses a significant amount of memory
real 0m0.005s
user 0m0.000s
sys 0m0.004s

If you really want to speed things up, store the contents of the first file in an array. Then read the records of the second file one by one and compare the seventh, 13th, and 14th records against the array. You can perform the binary search on the array since the array is already sorted. The problem with fpmurphy's solution was that he was constantly invoking cut and reading the file again and again. In my solution, I used ksh's functionality to split the fields in the file, but I am still reading the file again and again. I'm betting the solution I'm proposing will be even faster than Franklin52's.

you're lucky you have not had your thread closed for cross posting if this is he third thread you have started for the same topic

Since this thread seems to be the longest of the three that he had started, I am keeping this open and closing the others.
RRVARMA, please adhere to the rules. No double posting or bumping up the posts.
Here are the other threads.

Hi,

Thanks cooldude, but your script is fetching not only the unique records but also the other records as well.. I'm confused.. :confused:

But its quite faster than the other scripts which i tried.

Thanks vino, for closing other two threads.. Sorry that i didn't know how to close those threads.

I recommend Franklin's solution. But here's another version:

#!/bin/ksh

if [ $# != 2 ] 
then
exit 1
fi

while read i
do
    while read j
    do
	set -A array $i
	field7=${array[6]}
	field13=${array[12]}
	field14=${array[13]}
        if [ "${i}" == "${field7}|${field13}|${field14}" ]
	then
	print "$j"
	break 
	fi
    done < $2
done < $1

It is the same as the script you posted in your first post, except that I am not using cut to extract fields.