a problem with large files

hello all,

kindly i need your help, i made a script to print a specific lines from a huge file about 3 million line. the output of the script will be about 700,000 line...the problem is the script is too slow...it kept working for 5 days and the output was only 200,000 lines !!!

the script is so simple:

for i in `cat file` ------> file is the file that contains the line no. to be printed from a file.

sed '$i q;d' file1 > file2 ----> where file1 is the huge file 3 millions lines and file 2 is the output file which will be 700.000 lines

so plz could anyone tell me how can i decrease the processing time of that script and why is it taking all that time !!!?

thanks in advance

Do you know the to and from line no of the file which you want to print??

because cat file will be really slow if file has 3 million records please avoid that.

From my experience, running a pile of small(er) files take much less time than working with a single "fatty" :wink: thus, you may want to split up both source files in junks, then process them (possibly in parallel) and finally concatenate the results.

i tried not to use the cat in for and i did with head -line no | tail -1 and it also worked but still too slow....i can't print specific range because the lines not know i run script to get the line no. which differs from file to file.
i really don't know what is the problem with it !?

---------- Post updated at 09:11 PM ---------- Previous update was at 09:09 PM ----------

dr house, how many lines per file do u think it will be good to split the fatty file?

As this is very machine-dependent, I'd start with a ten-percent split, process one (!) file, time this, then either process the remaining nine splits - or split, process and time the "guinea pig" again (thus going down to e.g. five-percent splits). You get the idea ...

Can you provide a couple sample lines from file ?
The sed command simply prints the appropriate line, right?

awk 'FNR==NR{n[$0];next}FNR in n'  file file1 > file2
1 Like

Or since 2,3 million lines will have to be deleted, this may be necessary:

awk 'NR==1{getline x<f}NR==x{print;getline x<f}' f=file file1 > file2

assuming that "file" is sorted numerically. otherwise sort -n first.

@vidyahar85

This statement is ridiculous and has no basis in fact whatsoever.

Back on topic.

My impression is that the O/P is reading "file" and searching through "file1" once for every line in "file" to produce the output in "file2".
Is would appear that "file" contains 700,000 line numbers and that "file1" contains 3,000,000 records.
Therefore the number of reads is:
700,000 times 3,000,000 = 2,100,000,000,000
We are clearly on a powerful computer or it would have got nowhere in two days.

To my mind the issue is how to do ONE PASS through "file1" and select the record numbers contained in "file".
We need the following facts from the O/P.
1) Is "file" in numerical order. Is each record unique? Are there leading zeros in the record numbers. Is there a delimiter?
2) Does the record layout of "file1" include the record number? If so, where exactly in the record? Is there a delimiter?
3) Is there a Database and database language available which would make this task easier?

thx a lot for your replies....
and here is the answer for your questions:
1) Is "file" in numerical order. Is each record unique? NO Are there leading zeros in the record numbers. Is there a delimiter? NO

2) Does the record layout of "file1" include the record number? YES If so, where exactly in the record? Is there a delimiter? they are in one column u can consider enter the delimiter
3) Is there a Database and database language available which would make this task easier? no i'm just trying to reformat it to a specific application.

---------- Post updated at 12:59 AM ---------- Previous update was at 12:58 AM ----------

i will try it and feed u back
thanks a alot

---------- Post updated at 01:01 AM ---------- Previous update was at 12:59 AM ----------

it is just lines
and the sed is used to print a line no.s saved in a file

As suggested in post #6, can we see a sample portion of "file" and "file" making it clear which field is the record number.
Please confirm that "file" can contain duplicate record numbers. If so, this is one that needs cleaning up first.

it didn't work.....syntax error...could you please advise?

---------- Post updated at 08:33 PM ---------- Previous update was at 08:32 PM ----------

it didn't work also...syntax error....could you please explain and advise...i really need your help...

Are you on Solaris? If so use nawk or /usr/xpg4/bin/awk instead of the silly awk that is the default.

yeah i'm on Solaris..i will check and feed u back ...thx a lot but could u please explain to me the command

awk 'NR==1{getline x<f}NR==x{print;getline x<f}' f=file file1 > file2
  • We are are reading all lines in "file1"
  • If we are reading line 1 from "file1" then read the variable x from f (which is set to "file")
  • If we are on line x in file1 then print the line and read the next variable x from "file"
  • Repeat until we reach the end of file1
  • Output to file2
1 Like

it worked but it only printed the first line no. in file which is line no. 16 in file 1...why is that, as i know nawk must go through all the file !!!

---------- Post updated at 02:43 PM ---------- Previous update was at 09:20 AM ----------

it worked with nawk.....i can't describe how can i thank ya ....thanx a billion :slight_smile:

---------- Post updated at 02:44 PM ---------- Previous update was at 02:43 PM ----------

Really thanks a lot for your help :slight_smile: