cut, sed, awk too slow to retrieve line - other options?

fzd · December 30, 2010, 8:25am

Hi,
I have a script that, basically, has two input files of this type:

file1
key1=value1_1_1
key2=value1_2_1
key4=value1_4_1
...

file2
key2=value2_2_1
key2=value2_2_2
key3=value2_3_1
key4=value2_4_1
...

My files are 10k lines big each (approx).
The keys are strings that don't contain whitespaces; the values are classic text strings, without "=" symbol.

The purpouse of the script is to get from file 2 the value of each key that appears both in file1 and file2.

The first part of the script sorts file1 and file2 (in order to get a complexity of O(n) rather than O(n^2)) [argumentation might be done on this sort... but that's not the point right now, since it's not the bottleneck]

Then, basically, I read each line of the (sorted) files, check whether they have the same keys, and if they do, save the value to my output. Otherwise, get the next line of the file which has the smallest key.

The problem here is to get the keys. After running the script once, I noticed the files were generated with random whitespaces before the "=" symbol (before or after the key). I can't change the generator, so I had to change the script.

I tried three variations of it:

A - sed on the line:

lineFile1=`awk "NR==${currLine1}" file1_sorted`
keyFile1=`echo $lineFile1 | sed -e 's/\s*\(\S*\)\s*=.*/\1/g'`

This sed gets all the non whitespaces characters left from the equal sign.

As you might imagine, that took an awful lot of time.

real    0m1.030s
user    0m0.996s
sys     0m.028s

This is clearly not acceptable, since I have to do the operation over 20k lines.

So I tried option B:
B - using cut on each line

lineFile1=`awk "NR==${currLine1}" file1_sorted`
keyFile1=`echo $lineFile1 | cut -d '=' -f 1 | sed -e 's/\s//g'`

That wasn't that much better...

real    0m0.659s
user    0m0.632s
sys     0m0.028s

Still not acceptable.

So I browsed a bit this forum, read somewhere that "cat foo | bar" wasn't recommended, and changed a bit of the code.

I didn't need that lineFile1 there, so there was no point in retrieving it.
I added

cut -d '=' -f 1 file1_sorted > file1_keys_sorted

before my calls, and I'm now using

keyFile1=`awk "NR==${currLine1}" file1_keys_sorted`

to get the key.

This is way better:

real    0m0.043s
user    0m0.032s
sys     0m0.008s

The problem is ... it's still taking too much time. From my logs, I'm approximatilvely processing 20 lines per second, which means one loop takes ~0.050 sec (this includes the awk I'm running on the file1_sorted file to get the output). This also means 15 min for a 20k lines input.

Is there some way of speeding up that process? (clearly, the bottleneck is this getting the line thing)

Thanks!

PS: For some reason, the process is only taking 8% of my CPU at max. Are there some commands that are slow? (echo, perhaps?)

anurag.singh · December 30, 2010, 8:35am

awk -F= 'NR==FNR{a[$1]=$2;next}{if(a[$1]) print $2;}' file1 file2

Output for above file1 and file2 (values of key2 and key4):

value2_2_1
value2_2_2
value2_4_1

Is this the expected output? If not, pls post the expected one.

jim_mcnamara · December 30, 2010, 8:39am

awk supports associative arrays. Also, when you run tests, you should use a reasonable dataset, not just 100 lines. The reason is: what you get with the time command does not reflect your algorithm as much as it reflects creating a process, opening files, etc.

for keys common to file1 and file2 (try this on a big file)

awk -F'='  'FILENAME=="file1" { arr[$1]=$2; next }
               FILENAME=="file2" { if($1 in arr) {print $1, arr[$1], $2} } ' file1 file2 | sort

fzd · December 30, 2010, 8:54am

anurag.singh:

awk -F= 'NR==FNR{a[$1]=$2;next}{if(a[$1]) print $2;}' file1 file2
Output for above file1 and file2 (values of key2 and key4):
value2_2_1
value2_2_2
value2_4_1
Is this the expected output? If not, pls post the expected one.

expected output would be

key2=value2_2_1
key2=value2_2_2
key4=value2_4_1

I'm not an awk-expert (clearly not ), but this is close.
There's just the part where one of the input files can be "key1 =value2_1_2" (with that whitespace), or "\tkey2 =value2_2_2", .. that does't match the pattern here.

@Jim McNamara
I'm running my tests on my 10k files
However, in order to get the results of the time queries, I run time on the exact query, not on the whole process.

anurag.singh · December 30, 2010, 9:30am

awk 'NR==FNR{idx=index($1,"=");if(idx) $1=substr($1,1,idx-1);a[$1]++;next;}{b=$0;idx=index($1,"=");if(idx) $1=substr($1,1,idx-1);if(a[$1]) print b;}' file1 file2

This should be able to handle all cases like

key=value
key =value
key= value
key = value
     key = value
.
.
.

pravin27 · December 30, 2010, 9:37am

Hi, Try this,

Modified Anurag's code,

awk -F"[ =\t]" 'NR==FNR{a[$1]=$2;next}a[$1] || a[$2] { print}'  file1 file2

fzd · December 30, 2010, 9:53am

anurag.singh:

awk 'NR==FNR{idx=index($1,"=");if(idx) $1=substr($1,1,idx-1);a[$1]++;next;}{b=$0;idx=index($1,"=");if(idx) $1=substr($1,1,idx-1);if(a[$1]) print b;}' file1 file2

This should be able to handle all cases like

key=value
key =value
key= value
key = value
   key = value
.
.
.

I just ran it and ... wow ... That was mind-blowing!

Now, the sort is the bottleneck But that's OK

Thanks a lot !

rdcwayx · December 30, 2010, 7:34pm

awk -F= 'NR==FNR{a[$1]=$2;next} a[$1] ' file1 file2