Hi,
I have a script that, basically, has two input files of this type:
file1
key1=value1_1_1
key2=value1_2_1
key4=value1_4_1
...
file2
key2=value2_2_1
key2=value2_2_2
key3=value2_3_1
key4=value2_4_1
...
My files are 10k lines big each (approx).
The keys are strings that don't contain whitespaces; the values are classic text strings, without "=" symbol.
The purpouse of the script is to get from file 2 the value of each key that appears both in file1 and file2.
The first part of the script sorts file1 and file2 (in order to get a complexity of O(n) rather than O(n^2)) [argumentation might be done on this sort... but that's not the point right now, since it's not the bottleneck]
Then, basically, I read each line of the (sorted) files, check whether they have the same keys, and if they do, save the value to my output. Otherwise, get the next line of the file which has the smallest key.
The problem here is to get the keys. After running the script once, I noticed the files were generated with random whitespaces before the "=" symbol (before or after the key). I can't change the generator, so I had to change the script.
I tried three variations of it:
A - sed on the line:
lineFile1=`awk "NR==${currLine1}" file1_sorted`
keyFile1=`echo $lineFile1 | sed -e 's/\s*\(\S*\)\s*=.*/\1/g'`
This sed gets all the non whitespaces characters left from the equal sign.
As you might imagine, that took an awful lot of time.
real 0m1.030s
user 0m0.996s
sys 0m.028s
This is clearly not acceptable, since I have to do the operation over 20k lines.
So I tried option B:
B - using cut on each line
lineFile1=`awk "NR==${currLine1}" file1_sorted`
keyFile1=`echo $lineFile1 | cut -d '=' -f 1 | sed -e 's/\s//g'`
That wasn't that much better...
real 0m0.659s
user 0m0.632s
sys 0m0.028s
Still not acceptable.
So I browsed a bit this forum, read somewhere that "cat foo | bar" wasn't recommended, and changed a bit of the code.
I didn't need that lineFile1 there, so there was no point in retrieving it.
I added
cut -d '=' -f 1 file1_sorted > file1_keys_sorted
before my calls, and I'm now using
keyFile1=`awk "NR==${currLine1}" file1_keys_sorted`
to get the key.
This is way better:
real 0m0.043s
user 0m0.032s
sys 0m0.008s
The problem is ... it's still taking too much time. From my logs, I'm approximatilvely processing 20 lines per second, which means one loop takes ~0.050 sec (this includes the awk I'm running on the file1_sorted file to get the output). This also means 15 min for a 20k lines input.
Is there some way of speeding up that process? (clearly, the bottleneck is this getting the line thing)
Thanks!
PS: For some reason, the process is only taking 8% of my CPU at max. Are there some commands that are slow? (echo, perhaps?)