gzcat number of records

sg3 · August 23, 2012, 10:47am

Hey guys,

  I want to do something quite simple but I just can't no matter what I try. I have a large file and i usually just:

gzcat test.gz | nohup /test/this-script-does-things-to-the-records.pl -> /testdir/tmp_test.txt

But now I need to do it only for the first 100k records. I sure hope you can be not to hard on a newbie like me

jim_mcnamara · August 23, 2012, 10:57am

gzcat test.gz | awk 'FNR<100001' | nohup /test/this-script-does-things-to-the-records.pl -> /testdir/tmp_test.txt

one way

alister · August 23, 2012, 11:49am

That could be wasteful if the file is much larger than 100k lines, since it would still read the file in its entirety. Downstream, the pipeline won't see EOF until awk eventually exits.

I would suggest

head -n100000

If awk is preferred, perhaps

awk 'FNR==100001 {exit}'

Regards,
Alister

sg3 · August 24, 2012, 8:06am

Thank you very much! head -n100000 works like a charm.

---------- Post updated at 03:06 PM ---------- Previous update was at 09:13 AM ----------

ok new problem

i decided that i would like to preform an action between the lines 100k and 200k (for example). The easy way i tried out

head -n100 | tail -n200

did not worked and i'm guessing that tail reads the whole file first so that will be a bad idea (the file is really big). Next thing i'm thinking is using sed

something like

sed -n '100,200 p' /filelocation/file | grep "string im searching for"

but it's kinda slow when the grep is added. Any help would be greatly appricieted

elixir_sinari · August 24, 2012, 8:16am

sed -n '100000,200000{;p;200000q;}' file

This will quit after printing 200000th line.

sg3 · August 27, 2012, 6:37am

thanks, i played a bit with this part and this is the part that bugs me - for lines let's say to the first million it works just fine, even above, but when i tried to print a single line

sed -n '177998637,177998638{;p;177998638q;}' /testdir/testfile

it took a lot of time (actually i didn't even waited for result) is that normal behaviour.

elixir_sinari · August 27, 2012, 6:45am

Yes. That command is reading your file sequentially till the 177998637th line and then printing this and the next line and then quitting the read operation. So you see, to print just those 2 lines, the command still has to read the first 177.99 million lines (have to say, a huge huge file).

For an improvement in the time taken, you could replace that sed command with awk:

awk 'NR==177998637||NR==177998638;NR==177998638{exit}' /testdir/testfile

sg3 · August 27, 2012, 8:12am

Yeah the file is pretty big (6gb), if i got this right - awk will still read all the lines till the one i need but is better for performance?

Thanks again : )