deleting multiple records from a huge file at one time

dsravan · February 5, 2008, 11:39am

I have a very big file of 5gb size and there are about 50 million records in there. I have to delete the records based on recrord number that I know fromoutside with out opening the file. The record numbers are very random like 5000678, 7890005 etc.

Can somebody let me know how i can remove records based on the record number all at one time and not one time
a piece please?

fpmurphy · February 5, 2008, 12:19pm

What do you mean by "with out opening the file"? You cannot delete records without opening the file.

Is this file a structured file or a database file? Are the records in the file sorted or random? You seem to indicate that the records are random but I am unclear if you are referring to the file or the list of records to be deleted.

You need to provide more precise information if you want somebody to help you.

dsravan · February 5, 2008, 1:05pm

The reason I said i want to delete with out opening is because the file is too large too open. The file is a regular ascii FILE with data in it. I just need to delete some records from there one time with out having to do it as many times as i want to delete the records.

drl · February 5, 2008, 2:54pm

Hi.

See post #6 in http://www.unix.com/shell-programming-scripting/48089-pass-variable-sed-p-loop.html\#post302154933 -- I think you should be able to adapt that procedure for creating a sed script that will delete specific lines in a single pass over the file. It was used to print ("p") print lines, but a delete is a similar operation. You would also need to omit the "-n" option on the final execution of sed.

It still will not be cheap -- every program that processes a file will "open" the file in the sense that it tells the system that it will be dealing with the content of that file. The program will need to read every line in order to create the new copy minus the lines you delete. Afterwards, you could rename the new file to the old name.

I suggest you try the procedure on small sample files first ... cheers, drl

brainyoung · February 6, 2008, 9:21am

Dsarvan,
Split & process,we process few GB of file through AWK.
1.split them based lines (approximate nr)
2.process the the files in parallel ( if you server is having decent RAM & CPU's)
remember not to have same name if you use any temporary file, one way could be adding a random number or adding process id to it.

dsravan · February 6, 2008, 10:17am

Thank you very much drl. The post you gave me helped me.