Friends,
I have text file with 700,000 rows.
Once I load this file to our database via our cutom process, it logs the row number for rejected rows.
How do I delete rows from a Large text file based on the Row Number?
Thanks,
Prashant
Friends,
I have text file with 700,000 rows.
Once I load this file to our database via our cutom process, it logs the row number for rejected rows.
How do I delete rows from a Large text file based on the Row Number?
Thanks,
Prashant
sed '10d' filename > newfilename
delete line # 10.
Assuming that deletelistfile is a file listing the row numbers to delete:
awk 'NR==FNR{a[$1]=1;next}{if(!a[FNR])print$0}' deletelistfile largetextfile > resultfile
or better yet:
awk 'NR==FNR{a[$1]++;next} !(FNR in a)}' deletelistfile largetextfile > resultfile
Yes, I forgot that print$0 is the default action.
Can you tell whether !(FNR in a) is any better than !a[FNR] performancewise?
Actually.... I read somewhere (c.l.a most likely) that the 'lookup' is generally slower than the direct 'value fetch and comparison', but I don't quite remember for what version of awk.
Me personally I'm just more accustomed to think of 'array membership' rather than 'value fetch' - just a matter of taste I reckon.
I am geeting syntax error when I run
awk 'NR==FNR{a[$1]++;next} !(FNR in a)}' deletelistfile largetextfile > resultfile
Thanks,
Prashant
sorry - fat fingers:
awk 'NR==FNR{a[$1]++;next} !(FNR in a)' deletelistfile largetextfile > resultfile
Made some tests with an array of 9999 numbers.
GNU Awk 3.1.5, Copyright (C) 1989, 1991-2005 Free Software Foundation.
When k is in a then (k in a) is about as quick as a[k].
When k is not in a then (k in a) is quicker, it takes about 85% of the time of a[k].
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
a[k] is quicker, it takes about 88% of the time of (k in a) regardless of whether k is in a or not.
Overall, mawk takes only about 50% of the time of GNU awk.
On the other hand I have been able to consistently crash mawk with a program easily handled by GNU awk.