Need to optimize the loop to load parallelly

Mannu2525 · May 22, 2017, 7:07am

Hi,

I'm trying to load the huge amount of records in the database. I did the successful load but it took more time to load as numbers are huge.

Here what I have -

create a database table (t) with 2 columns- Not an issue
create a script to load huge amount of data - Here I would like to the parallel execution for faster load.

perl -e 'for ($i=0; $i<=4294968230; $i++) { printf "%d,%d,\n", $i,$i*10; }' | vsql -p $PORT -c "COPY t FROM STDIN  DELIMITER ',' direct";

The main command that does the load for me is

perl -e 'for ($i=0; $i<=4294968230; $i++) { printf "%d,%d,\n", $i,$i*10; }'

I would like my load to be faster, either by running this parallelly or using some other technique.

Please suggest something and let me know the best possible way to optimize it.

Thanks,
Mannu

rbatte1 · May 22, 2017, 12:47pm

So, you first have to consider why your data load is deemed to be slow. Each record being written will need several things to happen to actually get it committed into the database. You will be building a very large memory buffer to shovel this through a pipe, so this may slow you down a little. There are global server resources to consider too, so I will deal with them next:-

Available memory
If there is little free memory, you server might start paging/swapping when you drive your import. This is costly to your elapse time because disk access is relatively slow compared in real memory

Disk access time
Using local disks can have various issues with performance contention for actual media, controllers etc. There is also the issue that updates to local disks (not SAN or RAID protected) are likely to be slower because the OS will have to wait for the data to really be written to the physical media. Using a SAN or RAID controller usually has the benefits of write cache, so the actual physical write is done asynchronously and the OS can carry on.

You can also cause contention if you are reading a large file and the updates are being written to the same physical (local) disk or, of course, other processing is running.

--
When considering the database, you have to consider the operations it will have to perform based on how you have it operating.

Appropriate Memory allocations
There are probably several setting for this, but an important one is that there is very likely to be a fixed/real/pinned value of memory that can never be swapped. Setting it too low can cause your database to have performance problems as it wrestles with the limitations, however setting it too high can cause memory issues for other needs on your server and slow the whole OS down (usually exhibited with paging)

If there are no resource issues, the DB will still need to:-

read/accept the record
validate the input matches the target table
check for constraints
initiate the disk write to the log(s) to allow either transaction rollback or redo/recovery
disk write to the table
disk write to related indexes
disk write to related tables (triggers, etc.)

There's lots for a database to do to create 4,294,968,230 records, so yes, it will take a while. Why so many records anyway?

I'm not sure if you can give database hints on loading data like you can when parallelising SELECT statements, but splitting your data and forcing in several parallel updates may not make any improvement.

I hope this gives you something to consider,
Robin

jim_mcnamara · May 23, 2017, 8:02am

I am going to guess: your code is an example but not what you are really loading.

If really it is what you are loading, then it appears to be a total waste of resources.
You could write a simple database procedure or trigger to take a number and multiply it by 10.

There are database optimizations you can easily do, example:
Does the table have index(es)? If yes, turn them off during the load. After, rebuild them.

You will have to check resources during a trail run - rbatte explained that really well.

Since you do not mention what UNIX and db you are using, we cannot help you on measuring resource usage.

If you want that help, tell us about your system: OS, shell, and database (like oracle, sybase, mysql, etc.).

It is up to you to help us to help you.