awk to ignore multiple rows based on a condition

ks_reddy · February 24, 2016, 6:03am

All,
I have a text file(Inputfile.csv) with millions of rows and 100 columns. Check the sample for 2 columns below.

Key,Check
A,1
A,2
A,
A,4
B,0
B,1
B,2
B,3
B,4
....
million rows.

My requirement is to delete all the rows corresponding to all the keys which ever has at least one blank cell in Check column.

Outputfile.csv

Key,Check
B,0
B,1
B,2
B,3
B,4

Currently I am using the following code

awk -F, '$1==""' inputfile.csv |awk -F, '{print $1}' |uniq >list_of_keys_to_ignore.txt
for each in `cat list_of_keys_to_ignore.txt`; do grep -v $each inputfile.csv; done >Outputfile.csv

But this script is taking lot of time(especially grep -v) as I have millions of rows and 100's of columns.

Please suggest a faster alternative to my above code.

Thanks and Regards
Sidda

RudiC · February 24, 2016, 6:42am

How about

sort -t, -k2 file | awk -F, '$2 == "" {T[$1]} !($1 in T)'

Don_Cragun · February 24, 2016, 12:39pm

Hi ks_reddy,
Assuming that the Check values in your input are all numeric values, I note that RudiC's code will sort the header from your input file to the end of your output file. And by using the 2nd field as the primary sort key, the output will be grouped by (alphanumeric; not numeric) Check values while your input seems to be grouped by Key values.

Does your real input have all lines for each distinct Key value grouped together?

Do you want the header line in the output file? If so, does the header need to be kept as the first line in the output?

Does the order of other lines in the output matter? If so, does the input order need to be maintained in the output? Or is a different sort order required (and, if so, what order)?

Approximately how many distinct Key values are there in your real input? Approximately how many of those Key values will need to be removed?

rdrtx1 · February 24, 2016, 3:27pm

try also:

awk -F, 'NR==FNR {if ($2 !~ /./) a[$1]=1; next;} ! a[$1] ' inputfile.csv inputfile.csv > output.csv

ks_reddy · February 24, 2016, 7:01pm

Hi rdrtx1,
Your code works very well. Thank you so much.

---------- Post updated at 05:27 AM ---------- Previous update was at 05:26 AM ----------

Hello Don,
As mentioned already the code suggested by rdrtx1 works well as my output require header to be kept as it is and also the original order to be kept.

---------- Post updated at 05:31 AM ---------- Previous update was at 05:27 AM ----------

Hi Rudi,
I should not sort my whole input data. So I followed the code suggested by the user rdrtx1. It works perfectly.
The speed comparison.
Your code took 90 secs on my sample data and rdrtx1's code took 21 secs.