awk to parse huge files

panyam · September 12, 2013, 9:25am

Hello All,

I have a situation as below:

(1) Read a source file (a single file of 1.2 million rows in it )
(2) Read Destination files one by one and replace the content ( few fields in it ) with the corresponding matching field from source file.

I tried as below: ( please note I am not posting the complete code and just a sue-do code )

awk -F"|" 'NR==FNR { array[$1]=$2;next } {gsub('fields in dest file',array[field positions in dest file]),$0 } 
source_file dest_files*.dat

The flaw in the above code is , irrespective of whether there is a matching string or not , the row is getting printed and performance is also not good.

Any suggestions would be appreciated.

Regards,
Ravi

ctsgnb · September 12, 2013, 10:47am

When needing to process a huge amount of data, it would be advisable to use what has been designed for doing such kind of task : a database

Corona688 · September 12, 2013, 11:56am

Without the complete code, and without any of the actual data it's working on, we cannot possibly tell you

1) why it's slow
2) why it's not working.

If you post the complete code, and some of the data you're working from, we might be able to

1) speed it up
2)make it work.

...but we can only do wild guessing right now.

panyam · September 13, 2013, 6:17am

Hi All,

Thanks for the reply.

I got the issue resolved myself and forgot to update here.

The logic I used is:

awk -F"|"  'BEGIN{ read the source file and store in array} { for each record in dest file search and replace it with source data ( stored in array ) and save it to a file }' source_file dest_files*.dat

Corona688 · September 13, 2013, 11:30am

Please post the actual code for the solution, the pseudocode for either is not useful and it'd be nice for this thread to have any point at all.