Optimised way for search & replace a value on one line in a very huge file (File Size is 24 GB).

manishkomar007 · September 2, 2011, 10:55am

Hi Experts,

I had to edit (a particular value) in header line of a very huge file so for that i wanted to search & replace a particular value on a file which was of 24 GB in Size. I managed to do it but it took long time to complete. Can anyone please tell me how can we do it in a optimised way.

Thanks in advance.
Manish

Steps which i followed:

head -1 orignal_file > temp
sed -n '2,$p' original_file >> temp
mv temp original_file

yazu · September 2, 2011, 11:37am

Afaik, in the general case no. But if your new first line has the same number of bytes or shorter and you can pad it with spaces then I believe you can do it quick with low level programming - open for read/write, read some bytes (512 for example) in a buffer, change them, rewind, and write the buffer back.

manishkomar007 · September 2, 2011, 11:47am

Thanks for your suggestion Yazu.

Just now i have got this below command from my friend which works good and takes 11mins to process on such a huge file.

Please someone can tell me if it can further be optimized.

perl -i -e '(s/OLD/New/) if $.==1' original_file

Corona688 · September 2, 2011, 12:05pm

There is no fundamental operation for inserting or deleting data in the middle of a file. You have to rewrite the entire file after the edit.

A 24 gigabyte file in 11 minutes is 37 megabytes per second, which is actually a pretty impressive transfer rate! It's probably maxed out your disk or bus speed now, changing the program won't help significantly. It might help to write the output to a different disk than you're reading from.

If you could use yazu's suggestion of always keeping the string the same length, so the data afterwards doesn't need to be rewritten, that would let the edit happen in a fraction of a second...

manishkomar007 · September 2, 2011, 12:16pm

Thanks Corona688...!!

While doing this we are excatly searching & replacing 8 character like 20110901 to 20110902. And we were monitoring the performance of the server which was very good. It didn't swaped out on memory. Still it took so much time .. rite.. i think on Linux if its takes 11 mins which is still more. Please correct me if I am wrong.

giannicello · September 2, 2011, 12:31pm

How quickly did awk/nawk do it with sub/gsub? Just curious.

Corona688 · September 2, 2011, 2:27pm

Could you show us the first few lines of the file, and the data you wish replaced? If the data is always the same length and always in the same place, you can use dd to write it in...

---------- Post updated at 11:42 AM ---------- Previous update was at 11:37 AM ----------

An example:

$ cat textdata
This is line 1
This is line 2
This is the data I want replaced >>11111111<<
This is another line
etc etc until end of file.
$ printf "%s" 22222222 | dd conv=notrunc of=textdata seek=65 bs=1
$ cat textdata
This is line 1
This is line 2
This is the data I want replaced >>22222222<<
This is another line
etc etc until end of file.

The 'bs=1' tells it to work on a sector size of 1 byte, which lets us seek seek exactly 65 characters into the file with seek=65. The conv=notrunc is important, it tells dd not to replace the file but to just overwrite data that's already there.

---------- Post updated at 12:06 PM ---------- Previous update was at 11:42 AM ----------

Another method needing BASH 3.0 or newer:

#!/bin/bash

exec 5<hugedata
exec 6<>hugedata

# Read lines one at a time from both file descriptors.
# When we find the line we want in FD 5, FD 6 will still be at the
# previous line, allowing us to overwrite the line with it.
while read -u 5 LINE
do
        # Match strings like >>12345678<< anywhere in the line
        # save it in BASH_REMATCH in three segments:  ...>>, 11111111, <<...
        if [[ $LINE =~ ^(.*\>\>)([0-9]+)(\<\<.*)$ ]]
        then
                NEWLINE="${BASH_REMATCH[1]}22222222${BASH_REMATCH[3]}"

                if [ "${#NEWLINE}" -ne "${#LINE}" ]
                then
                        echo "Error, new line would be different length"
                        exit 1
                fi

                # Overwrite the line with a line of same length
                echo "${NEWLINE}" >&6
                exec 6>&-
                exec 5>&-

                echo "Found and replaced ${BASH_REMATCH[2]} with 22222222" >&2
                exit 0
        else
                read -u 6 LINE  # Keep FD 5 and FD 6 in sync
        fi
done <&5

echo "Warning, didn't find any data to replace" >&2
exit 1

$ cat hugedata
This is line 1
This is line 2
This is the data I want replaced >>11111111<<
This is another line
etc etc until end of file.
$ ./datarep2.sh
$ cat hugedata
This is line 1
This is line 2
This is the data I want replaced >>22222222<<
This is another line
etc etc until end of file.
$

Both methods are able to edit early lines in the file as long as their length doesn't change, without having to read or write data afterwards at all.

The DD version would be more reliable and portable if you always know where the data to replace is.

---------- Post updated at 12:27 PM ---------- Previous update was at 12:06 PM ----------

Another thing you could do is just keep the header always separate from the huge file. When you need to feed it into something, use sed or awk or whatever to get the modified header, and cat out the rest of the file. (one of the rare useful uses of cat.)

( sed 's/orig/replacement/' < header ; cat restoffile ) | programusinghugefile

binlib · September 2, 2011, 10:59pm

As yazu has pointed out, if you are replacing the same number of bytes (or you can pad), you can use lower level programing, or do it with dd:

echo NEW |dd of=your-file bs=1 seek=offset-of-OLD count=3 conv=notrunc

You may need to adjust your options to dd if you are not using GNU dd. Be careful with this command, the original file is changed directly.