Reading ALL BUT the first and last line of a huge file

kumarjt · March 30, 2016, 9:08am

Hi.

Pardon me if I'm posting a duplicate thread but..
I have a text file with over 150 Million records, file size is in the range if MB(close to GB).
The requirement is to read ALL the lines excepting the FIRST LINE which is the file header and the LAST LINE which is it's trailer record.

What is the most OPTIMUM way to do it?
I'm aware that the SED solution will take a significantly long time to process such a huge file, hence I'm not opting it.

Please advise.

Thanks.

Warm Regards,
Kumarjit.

jazzman58 · March 30, 2016, 9:21am

You're not really giving much information, but you could always start by keeping a count of the record being processed and throw out the first and last.

wc -l inputfile | read totalrecs

Will give you the total number of records in the file. So, ...

 
recordCount=0
wc -l filename | read totalrecs
cat filename | while read rec
do
  ((recordCount+=1))
  if [[ $recordCount == 1 ]] ; then
     continue ;
  fi
  if [[ $recordCount == $totalrecs ]] ; then
     break;
  fi
# ... your other processing goes here
done

rdrtx1 · March 30, 2016, 9:53am

try also:

wc -l infile | read l ; awk 'NR>1 && NR<l' l=$l infile > newfile

RudiC · March 30, 2016, 10:14am

I'm not sure why the sed solution (which, BTW?) should take significantly longer than the other ones posted. Would you mind to post some comparisons?

Scrutinizer · March 30, 2016, 1:52pm

Tested with a 2 GB file (excluding writing to a file which should be similar for all approaches):

$ time sed '1d;$d' greptestin1 > /dev/null

real	0m29.835s
user	0m29.186s
sys	0m0.591s
$ time awk 'NR>2{print p}{p=$0}' greptestin1 > /dev/null    # BSD awk

real	1m44.183s
user	1m43.627s
sys	0m0.481s
$ time mawk 'NR>2{print p}{p=$0}' greptestin1 > /dev/null

real	0m14.982s
user	0m14.463s
sys	0m0.498s
$ time gawk 'NR>2{print p}{p=$0}' greptestin1 > /dev/null

real	0m24.682s
user	0m24.210s
sys	0m0.414s
$ time gawk4 'NR>2{print p}{p=$0}' greptestin1 > /dev/null

real	0m27.621s
user	0m27.173s
sys	0m0.419s

MadeInGermany · March 30, 2016, 3:49pm

If the shell reads the first line, then the loop has a condition less:

time { read header; sed '$d'; } < greptestin1 > /dev/null
time { read header; perl -pe '{exit if eof}'; } < greptestin1 > /dev/null

Scrutinizer · March 30, 2016, 4:05pm

Then I get these results:

$ time { read header; sed '$d'; } < greptestin1 > /dev/null

real	0m31.812s
user	0m30.796s
sys	0m0.658s

$ time { read header; perl -pe '{exit if eof}'; } < greptestin1 > /dev/null

real	0m20.205s
user	0m19.719s
sys	0m0.472s

$ time perl -ne 'print unless ($.==1 || eof)' greptestin1 > /dev/null

real	0m20.225s
user	0m19.600s
sys	0m0.490s

rovf · March 31, 2016, 3:57am

tail -n +2 YOUR_LARGE_FILE

cjcox · March 31, 2016, 10:50am

sed solution is fastest, doubt you'll be able to beat it by much (virutally immeasurable).

Maybe if we understood what you're wanting to do with the middle data?