Sed -f alternatives?

twoblink · April 25, 2012, 12:53pm

I have a list of items (control file) that I want to substitute in a text file, and it's BIG.

The file has two items, the original, and the new:

A B
B C
D E

The file has something like 10,000 entries.

So in the oldfile.txt, I'd like to basically make all these expression matched substitutions, is there any better way than sed -f?

Like is there a unix utility or script that I'm not aware of that will do this a bit more efficiently?

I'm open to any suggestions.
Thanks.

mayursingru · April 25, 2012, 1:14pm

Hi,
Could you please elaborate your query.

Thanks & Regards,
Mayur

twoblink · April 25, 2012, 1:18pm

I have a text file: oldfile.txt
It has a lot of stuff I want to do multiple find and replace within.

I have a control file: controlfile.txt
In the control file, is a 2 column list of the "find" and the "replace"

i.e.

This That
Yay Nay
Ying Yang

etc..

I would like to run this control file on the oldfile so that all the substitutions are made within the oldfile. I can use sed -f, and run it as 10,000 sed commands (I don't know if it will die on me or not though)

I'm hoping however, there's a program to do this a bit more elegantly.

jgt · April 25, 2012, 1:21pm

What is the structure, if any, of the original (oldfile.txt) text file?

twoblink · April 25, 2012, 1:31pm

oldfile.txt is actually in xml format.

so I guess I should have called it oldfile.xml

jgt · April 25, 2012, 1:48pm

Setting up a sed command file is relatively painless, so why not just run a sample of say 100 commands and test it against the xml file.
See how long it runs, and multiply that by 100 for 10000 entries. If the answer seems doable then be lazy.
This procedure will read the xml file once, and pass each line against the 10000 table entries.
The alternative is to pass the xml file against the table by using grep to see if the xml file even contains the find, then reducing the table size to only the ones found.

twoblink · April 26, 2012, 3:01am

The file is actually fairly big; almost 100 megs, so I'm just concerned that running through 10000+ find and replaces on a 100meg file, will be a formula for crashing.. But I'll give it a try.

jgt · April 26, 2012, 10:30am

Are you replacing the tags or the data?
If you are changing the tags, it might be simpler to parse the xml file, create a database, then re-create the xml file using the replacement tags.
Replacing the data is somewhat trickier if the replacement word is a portion of the data field.

twoblink · April 26, 2012, 12:14pm

I'm replacing data.. If I was replacing tags, I would have loaded it, changed the tags and exported it like you said.

jgt · April 26, 2012, 12:49pm

Some assumptions:
If the 100mb file contains 1 to 2 million records, and the control file is 10000 lines at 13 characters per line, using a simple sed routine will process about 130gb of data. Whether this is all disk i/o or memory will depend upon how well the shell uses memory.
If there is only one data field per line, create two temporary data files, one containing the tags, and the other the data. Add line numbers to the files.
Sort the data portion into data portion sequence. Sort the control file into original field sequence, write a merge program to read the sorted data only file, replace the field, and write a new temporary file with the new data (including the original line number).
Sort the new temporary data file back to line number sequence, and merge with the temporary tag file to produce a new xml file.
The total data processed this way should be less than 1gb.