kindly i need your help, i made a script to print a specific lines from a huge file about 3 million line. the output of the script will be about 700,000 line...the problem is the script is too slow...it kept working for 5 days and the output was only 200,000 lines !!!
the script is so simple:
for i in `cat file` ------> file is the file that contains the line no. to be printed from a file.
sed '$i q;d' file1 > file2 ----> where file1 is the huge file 3 millions lines and file 2 is the output file which will be 700.000 lines
so plz could anyone tell me how can i decrease the processing time of that script and why is it taking all that time !!!?
From my experience, running a pile of small(er) files take much less time than working with a single "fatty" thus, you may want to split up both source files in junks, then process them (possibly in parallel) and finally concatenate the results.
i tried not to use the cat in for and i did with head -line no | tail -1 and it also worked but still too slow....i can't print specific range because the lines not know i run script to get the line no. which differs from file to file.
i really don't know what is the problem with it !?
---------- Post updated at 09:11 PM ---------- Previous update was at 09:09 PM ----------
dr house, how many lines per file do u think it will be good to split the fatty file?
As this is very machine-dependent, I'd start with a ten-percent split, process one (!) file, time this, then either process the remaining nine splits - or split, process and time the "guinea pig" again (thus going down to e.g. five-percent splits). You get the idea ...
This statement is ridiculous and has no basis in fact whatsoever.
Back on topic.
My impression is that the O/P is reading "file" and searching through "file1" once for every line in "file" to produce the output in "file2".
Is would appear that "file" contains 700,000 line numbers and that "file1" contains 3,000,000 records.
Therefore the number of reads is:
700,000 times 3,000,000 = 2,100,000,000,000
We are clearly on a powerful computer or it would have got nowhere in two days.
To my mind the issue is how to do ONE PASS through "file1" and select the record numbers contained in "file".
We need the following facts from the O/P.
1) Is "file" in numerical order. Is each record unique? Are there leading zeros in the record numbers. Is there a delimiter?
2) Does the record layout of "file1" include the record number? If so, where exactly in the record? Is there a delimiter?
3) Is there a Database and database language available which would make this task easier?
thx a lot for your replies....
and here is the answer for your questions:
1) Is "file" in numerical order. Is each record unique? NO Are there leading zeros in the record numbers. Is there a delimiter? NO
2) Does the record layout of "file1" include the record number? YES If so, where exactly in the record? Is there a delimiter? they are in one column u can consider enter the delimiter
3) Is there a Database and database language available which would make this task easier? no i'm just trying to reformat it to a specific application.
---------- Post updated at 12:59 AM ---------- Previous update was at 12:58 AM ----------
i will try it and feed u back
thanks a alot
---------- Post updated at 01:01 AM ---------- Previous update was at 12:59 AM ----------
it is just lines
and the sed is used to print a line no.s saved in a file
As suggested in post #6, can we see a sample portion of "file" and "file" making it clear which field is the record number.
Please confirm that "file" can contain duplicate record numbers. If so, this is one that needs cleaning up first.