Identify lines with wrong format in a file and fix

Gurus,

I have a data file which has a certain number of columns say 101. It has one description column which contains foreign characters and due to this some times, those special characters are translated to new line character and resulting in failing the process.

I am using the following awk command to indentify the number of columns

cat File name | awk -F "|" '{print NF}' | sort -u

If the file has foreign characters then it will display some thing like this if I run the above command.

101
102

I have written a small code to compare the counts of previous line with the next line and displays the line no. I have not done extensive unix scripting but the performance is not good as it takes for ever.

#!/usr/bin/ksh
filename=$1
count=0
prev_count=0
lineno=1
while IFS= read line
do
count=`echo $line| awk -F "|" '{print NF}'`
if [ $count -ne $prev_count ]
then
echo $prev_count
echo $count
echo $line_no
echo "previous count is not matching"
fi
prev_count=$count
line_no=`expr $line_no + 1`
done < $filename

Is there a better way to identify the issue record and replace the foreign character with space or remove it all together.

Here is the issue record:

bad:

98671|20150731|C||CNY||||||||||||2015|07||PT 9867100000258887|00099PPD-CD DDJ�|�~�|20150731|20150731|||||||||164737||0|||PT||CCID|||WKN|||||CNY|D|00001.000000|+000000000000877.50|+000000000000877.50|164737||||||61|680006003000211008187000||||WKNPT150731A000099|OTH|||||||||||EXT|164737|CN||68|1008187000||0600300021|||||CNY||0600300021||||||||||||

good:

00001|20150731|C||CNY||||||||||||2015|07||FB 0000100000000001|FB - GA1890|20150731|20150731|||||||||164737||0|||FB||CCID|||WKN|||||CNY|C|00001.000000|+000000000791640.00|+000000000791640.00|164737||||||62|580088999000011008400000||||WKNFB1507315800001|OTH|||||||||||EXT|164737|CN||58|1008400000||8899900001|||||CNY||8899900001||||||||||||

00001|20150731|C||CNY||||||||||||2015|07||AT 0000100000000002||20150731|20150731|||||||||164737||0|||AT||CCID|||CAT|||||CNY|C|00001.000000|+000000000002000.00|+000000000002000.00|164737||||||62|680002888554511008010000||||CATEX1507310813829|OTH|||||||||||EXT|164737|CN||68|1008010000||0288855451|||||CNY||0288855451||||||||||||

When the above bad record is read by abinitio etl tool, it gets new line character and the record gets scrambled and will shifted to next column and process fails.

Any help or quick command will be helpful. Thanks.

Why not do the entire loop in awk ? Try

awk '{print NR, NF}' FS="|" file

to find the field count of the lines with line-No.s.
The interpretation of non- ASCII characters is locale- dependent. This

LC_ALL=C sed -r s/[$'\200'-$'\377'']+/ /g' file

would do as you requested.

---------- Post updated at 22:06 ---------- Previous update was at 22:01 ----------

Actually, I can't see those non-ASCII chars modify the field count. Try also

awk '1; {print NR, NF, gsub ("[\200-\377]+", " ")} 1' FS="|" file

to compare the lines before - after the replacement

RudiC,

Thanks for your suggestion. After searching for other threads in this forum, I have modified the code as follows. It identifies the problem records which have greater than 101 columns.

 awk -F "|" 'NF > 101 {print NR,$0}' <filename>

But to fix it is still a manual step. I am getting the output as shown below.

313301 98671|20150929|C||CNY||||||||||||2015|09||PT  9867100000303106|00283PPD-CDDDJ�|�~��|20150929|20150929|||||||||164737||0|||PT||CCID|||WKN|||||CNY|D|00001.000000|+000000000000610.00|+000000000000610.00|164737||||||61|680006003000211008187000||||WKNPT150929A000283|OTH|||||||||||EXT|164737|CN||68|1008187000||0600300021|||||CNY||0600300021||||||||||||

Is there a way to identify the record which is modified to

|�~�� 

in advance and fix it.

I don't understand your request. The red pipe char is the normal ASCII 0X7C used as a field separator in your file. Do you imply that one is an artefact? Did you try the gsub from post#1?

1 Like

RudiC,

I tried your command and it works. It removes the foreign characters.

 
 < 313301 98671|20150929|C||CNY||||||||||||2015|09||PT  9867100000303106|00283PPD-CDDDJ�|�~��|20150929|20150929|||||||||164737||0|||PT||CCID|||WKN|||||CNY|D|00001.000000|+000000000000610.00|+000000000000610.00|164737||||||61|680006003000211008187000||||WKNPT150929A000283|OTH|||||||||||EXT|164737|CN||68|1008187000||0600300021|||||CNY||0600300021||||||||||||

 > 313301 98671|20150929|C||CNY||||||||||||2015|09||PT  9867100000303106|00283PPD-CDDDJ | ~ |20150929|20150929|||||||||164737||0|||PT||CCID|||WKN|||||CNY|D|00001.000000|+000000000000610.00|+000000000000610.00|164737||||||61|680006003000211008187000||||WKNPT150929A000283|OTH|||||||||||EXT|164737|CN||68|1008187000||0600300021|||||CNY||0600300021||||||||||||