I have big files (some are >300GB!) that need substitution for some patterns, for example, change Multiple Spaces into Tab. I used this oneliner:
sed '1,18s/ \{1,\}/\t/g' infile_big.sam > outfile_big.sam
but it seems very slow as the job is still running after 24 hours! In this example, only the first 18 rows need be changed, and the rest is untouched.
Is there any better way to do the job quickly? I'm using GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu) on Linux 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) x86_64 GNU/Linux.
Thanks a lot!
I'm afraid ANY approach will have to copy >300GB, even if only 18 lines are to be modified, which will take its time. Possible solutions might be using editors like ed (search these forums), or applying some dirty tricks like
I forgot to mention: increasing dd 's block size to several MB will speed up the copy process dramatically, but don't go too high. And, I think the TMP file is not necessary, you can use the output file immediately. So it would read like
Another thing to consider is the resources you have, so memory, disk devices and contention by other applications. If you run out of memory you may end up swapping/paging real memory to disk which is time consuming to write and to (later on) read back in.
For the disk, is it local disk or an attached SAN? I fear it might be an NFS or Samba mounted share which will be slow because another server is doing the real IO and shovelling it across the network.
If it is not NFS or Samba, is it local disk or SAN is still a question. Local simple disks (no RAID controller) will require writes to be committed before returning control to the program. You might find a high %SYS time on something like vmstat 3 , ignoring the first line which is statistics since boot.
Local disk also may have IO contention for the physical devices.
Local hardware RAID disk or SAN provided disk LUNs (hopefully fibre attached), on the other hand should give better performance because they usually come with a large cache, to IO reads are anticipated and writes and written to disk-cache memory (and committed to real disk later) so the control goes back to the CPU again.
Can you tell us more about the resources you have?
Thanks Robin & RudiC!
1) The storage disk is NFS mounted EMC Isilon, but I am not quite sure the hardware configuration. My Admin told me the network width is only 1Gb speed. Probably this is one of the reasons.
2) Tiny bug when I tried sed '1,18s/,\{1,\}/\t/g; 19q' I got an extra line (Line 19). which should be sed '1,18s/,\{1,\}/\t/g; 18q' .
Thanks a lot again!
To explain the exact problem: You can't insert data into the middle of a file, just overwrite it from that point forwards. So you can't edit without completely overwriting everything after that point, unless your data is made of fixed-size records.