Help- counting delimiter in a huge file and split data into 2 files

lv99 · February 16, 2011, 11:07pm

I�m new to Linux script and not sure how to filter out bad records from huge flat files (over 1.3GB each). The delimiter is a semi colon �;�

Here is the sample of 5 lines in the file:

Name1;phone1;address1;city1;state1;zipcode1
Name2;phone2;address2;city2;state2;zipcode2;comment
Name3;phone3;address3;city3;state3;zipcode3
Name4;phone4;address4;city4;state4;zipcode4
Name5;phone5;address5

I need a script to read each line and count the number of ; on each line

If delimiter counts = 5 Then
Write that line to goodfile1
Else
Write bad line to rejectedfile1.

The result of two output files should look like this

goodfile1 has:

Name1;phone1;address1;city1;state1;zipcode1
Name3;phone3;address3;city3;state3;zipcode3
Name4;phone4;address4;city4;state4;zipcode4

rejectedfile1 has:

Name2;phone2;address2;city2;state2;zipcode2;comment
Name5;phone5;address5

Thanks

yinyuemi · February 16, 2011, 11:34pm

awk '{if(gsub(";",";")==5) {print >"goodfile1"} else {print >"rejectedfile1"}}' file

malcomex999 · February 17, 2011, 1:17am

Or...

 
awk -F";" 'NF==6{print >"goodfile" ;next}{print >"rejected"}' infile

kurumi · February 17, 2011, 4:20am

 $ ruby -ne '$_.count(";")==5 && print ' file >> good

rdcwayx · February 17, 2011, 5:51am

awk -F \; '{print>(NF==6?"goodfile1":"rejectedfile1")}' infile

lv99 · February 17, 2011, 12:17pm

I tried this command below since the record is 4000 byte and has 290 ; as delimiters. i need to filter out bad records where the delimiter counts do not match.

awk -F \; '{print>(NF==291?"goodfile1":"rejectedfile1")}' infile
and got this error

awk: syntax error near line 1
awk: bailing out near line 1

I also tried this

awk -F";" 'NF==291{print >"goodfile" ;next}{print >"rejected"}' infile

and get different error

awk: record `00000036200800;20080...' too long

it seems like awk has limitation on the record length google directed me to simple change to the command. and nawk worked well. Thank you everyone!

nawk -F";" 'NF==291{print >"goodfile" ;next}{print >"rejected"}' infile

Corona688 · February 17, 2011, 1:28pm

I suspect your system isn't linux, because linux generally has [gn]awk, only gawk, and nothing but gawk. Glad you got it working.

lv99 · March 1, 2011, 2:32pm

IT's SunOS. Today i found out the nawk also gave bad result. many records with correct counts of 291. and the rejected file showed that they had only 198 semi-colon in them. those rejected records also got truncate in middle. the original record showed more data in the row than the rejected row???