A script that processes a sample of a file

c_lady · June 7, 2013, 5:57am

hi all, I need some help in regards of how to process just a sample from a large .txt file

I have a large file from many new lines (say above 200.000 new lines), I need a script that process just a sample of it, say 10.000 bur a random sample (taking rows from top top to the the bottom)

Could someone help? I will be happy, if you could also enable me to give the number of sample as an input - for example, if I need 20.000 instead of 10.000 to give the 20.000 as input

Thanks a lot!

Jotne · June 7, 2013, 6:08am

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}' file

Should give you ca 10% of the file

c_lady · June 7, 2013, 8:33am

Could you please let me know how do you know that it will give ca 10%, for example; if I know the number of rows before hand; how I should play with the

(rand() <= .01)

parameter, which I think does the trick?

Thank you very much!

Jotne · June 7, 2013, 6:39pm

I found this using google. It seem that if you change the parameter you get more and less data.

Don_Cragun · June 7, 2013, 8:54pm

According to the standards, awk's rand() function returns a pseudo-random number x such that 0 < x < 1. If we assume that the pseudo-random number is uniformly distributed (which is not required by the standards), then (rand() <= .01) should be true approximately 1% of the time. For 10%, the test would be ICODE](rand() <= .1)[/ICODE] (again, assuming rand() produces a uniformly distributed set of values).

Note that when using "random" numbers, there is certainly no guarantee that using (rand() <= 10000/200000) as the test in Jotne's script will print exactly 10,000 lines from a 200,000 line file.