a problem with large files

m_wassal · July 10, 2010, 1:35pm

hello all,

kindly i need your help, i made a script to print a specific lines from a huge file about 3 million line. the output of the script will be about 700,000 line...the problem is the script is too slow...it kept working for 5 days and the output was only 200,000 lines !!!

the script is so simple:

for i in `cat file` ------> file is the file that contains the line no. to be printed from a file.

sed '$i q;d' file1 > file2 ----> where file1 is the huge file 3 millions lines and file 2 is the output file which will be 700.000 lines

so plz could anyone tell me how can i decrease the processing time of that script and why is it taking all that time !!!?

thanks in advance

vidyadhar85 · July 10, 2010, 1:50pm

Do you know the to and from line no of the file which you want to print??

because cat file will be really slow if file has 3 million records please avoid that.

dr.house · July 10, 2010, 2:08pm

From my experience, running a pile of small(er) files take much less time than working with a single "fatty" thus, you may want to split up both source files in junks, then process them (possibly in parallel) and finally concatenate the results.

m_wassal · July 10, 2010, 2:11pm

i tried not to use the cat in for and i did with head -line no | tail -1 and it also worked but still too slow....i can't print specific range because the lines not know i run script to get the line no. which differs from file to file.
i really don't know what is the problem with it !?

---------- Post updated at 09:11 PM ---------- Previous update was at 09:09 PM ----------

dr house, how many lines per file do u think it will be good to split the fatty file?

dr.house · July 10, 2010, 2:27pm

As this is very machine-dependent, I'd start with a ten-percent split, process one (!) file, time this, then either process the remaining nine splits - or split, process and time the "guinea pig" again (thus going down to e.g. five-percent splits). You get the idea ...

pseudocoder · July 10, 2010, 4:53pm

Can you provide a couple sample lines from file ?
The sed command simply prints the appropriate line, right?

binlib · July 10, 2010, 5:18pm

awk 'FNR==NR{n[$0];next}FNR in n'  file file1 > file2

Scrutinizer · July 10, 2010, 5:50pm

Or since 2,3 million lines will have to be deleted, this may be necessary:

awk 'NR==1{getline x<f}NR==x{print;getline x<f}' f=file file1 > file2

assuming that "file" is sorted numerically. otherwise sort -n first.

methyl · July 11, 2010, 3:08pm

@vidyahar85

This statement is ridiculous and has no basis in fact whatsoever.

Back on topic.

My impression is that the O/P is reading "file" and searching through "file1" once for every line in "file" to produce the output in "file2".
Is would appear that "file" contains 700,000 line numbers and that "file1" contains 3,000,000 records.
Therefore the number of reads is:
700,000 times 3,000,000 = 2,100,000,000,000
We are clearly on a powerful computer or it would have got nowhere in two days.

To my mind the issue is how to do ONE PASS through "file1" and select the record numbers contained in "file".
We need the following facts from the O/P.
1) Is "file" in numerical order. Is each record unique? Are there leading zeros in the record numbers. Is there a delimiter?
2) Does the record layout of "file1" include the record number? If so, where exactly in the record? Is there a delimiter?
3) Is there a Database and database language available which would make this task easier?

m_wassal · July 11, 2010, 6:01pm

thx a lot for your replies....
and here is the answer for your questions:
1) Is "file" in numerical order. Is each record unique? NO Are there leading zeros in the record numbers. Is there a delimiter? NO

2) Does the record layout of "file1" include the record number? YES If so, where exactly in the record? Is there a delimiter? they are in one column u can consider enter the delimiter
3) Is there a Database and database language available which would make this task easier? no i'm just trying to reformat it to a specific application.

---------- Post updated at 12:59 AM ---------- Previous update was at 12:58 AM ----------

i will try it and feed u back
thanks a alot

---------- Post updated at 01:01 AM ---------- Previous update was at 12:59 AM ----------

it is just lines
and the sed is used to print a line no.s saved in a file

methyl · July 12, 2010, 6:20am

As suggested in post #6, can we see a sample portion of "file" and "file" making it clear which field is the record number.
Please confirm that "file" can contain duplicate record numbers. If so, this is one that needs cleaning up first.

m_wassal · July 12, 2010, 7:58am

scrutinizer:

Or since 2,3 million lines will have to be deleted, this may be necessary:
awk 'NR==1{getline x<f}NR==x{print;getline x<f}' f=file file1 > file2
assuming that "file" is sorted numerically. otherwise sort -n first.

m_wassal · July 13, 2010, 1:33pm

it didn't work.....syntax error...could you please advise?

---------- Post updated at 08:33 PM ---------- Previous update was at 08:32 PM ----------

scrutinizer:

Or since 2,3 million lines will have to be deleted, this may be necessary:
awk 'NR==1{getline x<f}NR==x{print;getline x<f}' f=file file1 > file2
assuming that "file" is sorted numerically. otherwise sort -n first.

it didn't work also...syntax error....could you please explain and advise...i really need your help...

Scrutinizer · July 13, 2010, 1:58pm

Are you on Solaris? If so use nawk or /usr/xpg4/bin/awk instead of the silly awk that is the default.

m_wassal · July 13, 2010, 2:03pm

yeah i'm on Solaris..i will check and feed u back ...thx a lot but could u please explain to me the command

Scrutinizer · July 13, 2010, 2:12pm

awk 'NR==1{getline x<f}NR==x{print;getline x<f}' f=file file1 > file2

We are are reading all lines in "file1"
If we are reading line 1 from "file1" then read the variable x from f (which is set to "file")
If we are on line x in file1 then print the line and read the next variable x from "file"
Repeat until we reach the end of file1
Output to file2

m_wassal · July 14, 2010, 7:44am

it worked but it only printed the first line no. in file which is line no. 16 in file 1...why is that, as i know nawk must go through all the file !!!

---------- Post updated at 02:43 PM ---------- Previous update was at 09:20 AM ----------

it worked with nawk.....i can't describe how can i thank ya ....thanx a billion

---------- Post updated at 02:44 PM ---------- Previous update was at 02:43 PM ----------

scrutinizer:

awk 'NR==1{getline x<f}NR==x{print;getline x<f}' f=file file1 > file2
We are are reading all lines in "file1"

If we are reading line 1 from "file1" then read the variable x from f (which is set to "file")

If we are on line x in file1 then print the line and read the next variable x from "file"

Repeat until we reach the end of file1

Output to file2

Really thanks a lot for your help