How to quickly substitute pattern within certain range of a huge file?

yifangt · May 15, 2018, 4:26pm

I have big files (some are >300GB!) that need substitution for some patterns, for example, change Multiple Spaces into Tab. I used this oneliner:

sed '1,18s/ \{1,\}/\t/g' infile_big.sam > outfile_big.sam

but it seems very slow as the job is still running after 24 hours! In this example, only the first 18 rows need be changed, and the rest is untouched.
Is there any better way to do the job quickly? I'm using GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu) on Linux 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) x86_64 GNU/Linux.
Thanks a lot!

RudiC · May 15, 2018, 4:55pm

I'm afraid ANY approach will have to copy >300GB, even if only 18 lines are to be modified, which will take its time. Possible solutions might be using editors like ed (search these forums), or applying some dirty tricks like

sed '1,18s/,\{1,\}/\t/g; 19q' infile_big.sam > outfile_big.sam
dd if=infile_big.sam of=outfile_big.sam skip=$(stat -c%s outfile_big.sam) iflag=skip_bytes oflag=append conv=notrunc

Still, don't expect too much ...

yifangt · May 15, 2018, 5:02pm

Thanks!

--- This is one of what I wanted to confirm. I tried vim, but had pain to open and close >300GB file.

RudiC · May 15, 2018, 5:19pm

I forgot to mention: increasing dd 's block size to several MB will speed up the copy process dramatically, but don't go too high. And, I think the TMP file is not necessary, you can use the output file immediately. So it would read like

sed '1,18s/,\{1,\}/\t/g; 19q' infile_big.sam > outfile_big.sam
dd if=infile_big.sam of=outfile_big.sam skip=$(stat -c%s outfile_big.sam) bs=2M iflag=skip_bytes oflag=append conv=notrunc

rbatte1 · May 16, 2018, 10:11am

Another thing to consider is the resources you have, so memory, disk devices and contention by other applications. If you run out of memory you may end up swapping/paging real memory to disk which is time consuming to write and to (later on) read back in.

For the disk, is it local disk or an attached SAN? I fear it might be an NFS or Samba mounted share which will be slow because another server is doing the real IO and shovelling it across the network.

If it is not NFS or Samba, is it local disk or SAN is still a question. Local simple disks (no RAID controller) will require writes to be committed before returning control to the program. You might find a high %SYS time on something like vmstat 3 , ignoring the first line which is statistics since boot.

Local disk also may have IO contention for the physical devices.

Local hardware RAID disk or SAN provided disk LUNs (hopefully fibre attached), on the other hand should give better performance because they usually come with a large cache, to IO reads are anticipated and writes and written to disk-cache memory (and committed to real disk later) so the control goes back to the CPU again.

Can you tell us more about the resources you have?

Kind regards,
Robin

Peasant · May 16, 2018, 11:38am

At the end you know perl has it
https://perldoc.perl.org/Tie/File.html

If you give it a shot, be sure to get back with results on 300 GB file.

Regards
Peasant.

yifangt · May 16, 2018, 11:53am

Thanks Robin & RudiC!
1) The storage disk is NFS mounted EMC Isilon, but I am not quite sure the hardware configuration. My Admin told me the network width is only 1Gb speed. Probably this is one of the reasons.
2) Tiny bug when I tried sed '1,18s/,\{1,\}/\t/g; 19q' I got an extra line (Line 19). which should be sed '1,18s/,\{1,\}/\t/g; 18q' .
Thanks a lot again!

Corona688 · May 16, 2018, 1:19pm

To explain the exact problem: You can't insert data into the middle of a file, just overwrite it from that point forwards. So you can't edit without completely overwriting everything after that point, unless your data is made of fixed-size records.

rbatte1 · May 17, 2018, 11:37am

So NFS is almost bound to be slow. Could you make the update on the server that is sharing the NFS to you? That should be a fair bit quicker.

Can you share the other resources (and competing load) on the server?

Robin