I have a text file: oldfile.txt
It has a lot of stuff I want to do multiple find and replace within.
I have a control file: controlfile.txt
In the control file, is a 2 column list of the "find" and the "replace"
i.e.
This That
Yay Nay
Ying Yang
etc..
I would like to run this control file on the oldfile so that all the substitutions are made within the oldfile. I can use sed -f, and run it as 10,000 sed commands (I don't know if it will die on me or not though)
I'm hoping however, there's a program to do this a bit more elegantly.
Setting up a sed command file is relatively painless, so why not just run a sample of say 100 commands and test it against the xml file.
See how long it runs, and multiply that by 100 for 10000 entries. If the answer seems doable then be lazy.
This procedure will read the xml file once, and pass each line against the 10000 table entries.
The alternative is to pass the xml file against the table by using grep to see if the xml file even contains the find, then reducing the table size to only the ones found.
The file is actually fairly big; almost 100 megs, so I'm just concerned that running through 10000+ find and replaces on a 100meg file, will be a formula for crashing.. But I'll give it a try.
Are you replacing the tags or the data?
If you are changing the tags, it might be simpler to parse the xml file, create a database, then re-create the xml file using the replacement tags.
Replacing the data is somewhat trickier if the replacement word is a portion of the data field.
Some assumptions:
If the 100mb file contains 1 to 2 million records, and the control file is 10000 lines at 13 characters per line, using a simple sed routine will process about 130gb of data. Whether this is all disk i/o or memory will depend upon how well the shell uses memory.
If there is only one data field per line, create two temporary data files, one containing the tags, and the other the data. Add line numbers to the files.
Sort the data portion into data portion sequence. Sort the control file into original field sequence, write a merge program to read the sorted data only file, replace the field, and write a new temporary file with the new data (including the original line number).
Sort the new temporary data file back to line number sequence, and merge with the temporary tag file to produce a new xml file.
The total data processed this way should be less than 1gb.