How to delete rows by RowNumber from a Large text file

ppat7046 · May 14, 2009, 5:03pm

Friends,

I have text file with 700,000 rows.
Once I load this file to our database via our cutom process, it logs the row number for rejected rows.

How do I delete rows from a Large text file based on the Row Number?

Thanks,
Prashant

jim_mcnamara · May 14, 2009, 5:34pm

sed '10d' filename > newfilename

delete line # 10.

colemar · May 14, 2009, 5:59pm

Assuming that deletelistfile is a file listing the row numbers to delete:

awk 'NR==FNR{a[$1]=1;next}{if(!a[FNR])print$0}' deletelistfile largetextfile > resultfile

vgersh99 · May 14, 2009, 6:46pm

or better yet:

awk 'NR==FNR{a[$1]++;next} !(FNR in a)}' deletelistfile largetextfile > resultfile

colemar · May 14, 2009, 6:58pm

Yes, I forgot that print$0 is the default action.
Can you tell whether !(FNR in a) is any better than !a[FNR] performancewise?

vgersh99 · May 14, 2009, 7:10pm

Actually.... I read somewhere (c.l.a most likely) that the 'lookup' is generally slower than the direct 'value fetch and comparison', but I don't quite remember for what version of awk.
Me personally I'm just more accustomed to think of 'array membership' rather than 'value fetch' - just a matter of taste I reckon.

ppat7046 · May 15, 2009, 3:05pm

I am geeting syntax error when I run
awk 'NR==FNR{a[$1]++;next} !(FNR in a)}' deletelistfile largetextfile > resultfile

Thanks,
Prashant

vgersh99 · May 15, 2009, 3:25pm

sorry - fat fingers:

awk 'NR==FNR{a[$1]++;next} !(FNR in a)' deletelistfile largetextfile > resultfile

colemar · May 15, 2009, 5:14pm

Made some tests with an array of 9999 numbers.

GNU Awk 3.1.5, Copyright (C) 1989, 1991-2005 Free Software Foundation.
When k is in a then (k in a) is about as quick as a[k].
When k is not in a then (k in a) is quicker, it takes about 85% of the time of a[k].

mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
a[k] is quicker, it takes about 88% of the time of (k in a) regardless of whether k is in a or not.

Overall, mawk takes only about 50% of the time of GNU awk.
On the other hand I have been able to consistently crash mawk with a program easily handled by GNU awk.