Search and Replace Extended Ascii Characters

ysvsr1 · October 30, 2014, 11:10pm

We are getting extended Ascii characters in the input file and my requirement is to search and replace them with a space. I am using the following command

LANG=C sed -e 's/[\x80-\xFF]/ /g'

It is doing a good job, but in some cases it is replacing the extended characters with two spaces. So my input file is fixed length file and because of this the length is increasing by 1 character or 2 characters depending on number of extended characters in the single line.

What is the best way to replace extended characters with only one space ?
(preferably sed command)

jim_mcnamara · October 30, 2014, 11:29pm

What OS are you on, and what is the system-wide default locale setting?

ysvsr1 · October 31, 2014, 12:04am

uname -a

Linux xxx.com 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Locale

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

RudiC · October 31, 2014, 5:11am

Please post the output of od -tx1c for a few meaningful lines of your input file.

jim_mcnamara · October 31, 2014, 9:04am

What we are getting at: your choice of the C locale does not "work" with the file.
So, is the file from an external source like a vendor or is it corrupted?

Because if a line has 100 bytes of characters used in a given locale, the output of your sed will be 100 bytes of data, not 101. So something is going on with the data in the file.

This uses cat as UUOC to simplify the example. You know the fixed record length of your file. For this example assume it is 100.

 
fsize=$(cat yourfile | wc -c)
echo $((  $fsize % 100   ))

This should produce the answer of zero, meaning all records are the same, correct size. Try it to make sure the file not corrupt. And we are not barking up the wrong tree.

ysvsr1 · October 31, 2014, 11:16am

We are getting binary data from the external vendor. It is then processed using a C Program and the output is good, but occasionally we get these extended characters that too, in 1 record out of million records. So i can safely say that the file is not corrupted.

This is how the bad characters look in vi editor

�

The octal equivalent of the above three characters:

357 277 275

Jim, i ran your command and the output is 2

shamrock · October 31, 2014, 11:21am

Can't you fix these extended ascii chars in your C program and avoid this post-processing...

ysvsr1 · October 31, 2014, 11:29am

RudiC, Sample output

0000000  41  36  31  31  34  30  39  32  39  30  30  30  30  30  30  30
          A   6   1   1   4   0   9   2   9   0   0   0   0   0   0   0
0000020  30  30  34  33  30  30  30  31  30  30  35  30  38  32  37  36
          0   0   4   3   0   0   0   1   0   0   5   0   8   2   7   6
0000040  31  30  32  30  31  34  2d  30  39  2d  32  38  31  36  3a  34
          1   0   2   0   1   4   -   0   9   -   2   8   1   6   :   4
0000060  32  3a  31  34  31  30  30  33  37  34  39  30  31  30  30  35
          2   :   1   4   1   0   0   3   7   4   9   0   1   0   0   5
0000100  30  38  32  37  36  31  30  31  30  20  20  20  20  20  20  20
          0   8   2   7   6   1   0   1   0
0000120  20  20  20  20  20  20  20  20  20  20  20  20  20  34  20  20
                                                              4
0000140  20  20  20  20  20  20  20  20  20  20  20  20  20  20  20  20

*
0000200  20  20  20  20  20  20  3f  23  42  32  4e  30  39  32  38  31
                                  ?   #   B   2   N   0   9   2   8   1
0000220  34  30  52  42  51  30  32  36  44  4d  6a  4d  33  4e  6a  59
          4   0   R   B   Q   0   2   6   D   M   j   M   3   N   j   Y
0000240  31  4d  6a  63  79  41  44  4a  4b  76  4d  57  39  65  61  53
          1   M   j   c   y   A   D   J   K   v   M   W   9   e   a   S
0000260  74  37  65  71  50  7a  46  76  37  5a  59  73  52  6d  6a  61
          t   7   e   q   P   z   F   v   7   Z   Y   s   R   m   j   a
0000300  42  36  45  44  52  61  31  6c  78  4b  33  77  49  30  67  61
          B   6   E   D   R   a   1   l   x   K   3   w   I   0   g   a
0000320  76  55  79  7a  76  69  31  54  59  72  47  34  39  32  38  6a
          v   U   y   z   v   i   1   T   Y   r   G   4   9   2   8   j
0000340  71  74  47  6d  35  30  41  3d  3d  4d  54  4d  77  4f  54  63
          q   t   G   m   5   0   A   =   =   M   T   M   w   O   T   c
0000360  77  4f  54  67  32  4e  77  44  33  31  57  4d  6b  56  6c  32
          w   O   T   g   2   N   w   D   3   1   W   M   k   V   l   2
0000400  52  39  65  43  7a  7a  4e  51  71  43  54  33  51  4a  6e  62
          R   9   e   C   z   z   N   Q   q   C   T   3   Q   J   n   b
0000420  69  79  6a  73  33  4a  70  65  74  67  46  31  56  71  5a  43
          i   y   j   s   3   J   p   e   t   g   F   1   V   q   Z   C
0000440  73  38  77  3d  3d  35  31  32  31  30  37  0a
          s   8   w   =   =   5   1   2   1   0   7  \n
0000454

0000000  41  36  31  31  34  30  39  32  39  30  30  30  30  30  30  30
          A   6   1   1   4   0   9   2   9   0   0   0   0   0   0   0
0000020  30  33  32  35  30  30  30  31  30  30  35  35  32  31  31  31
          0   3   2   5   0   0   0   1   0   0   5   5   2   1   1   1
0000040  36  32  32  30  31  34  2d  30  39  2d  32  38  31  34  3a  30
          6   2   2   0   1   4   -   0   9   -   2   8   1   4   :   0
0000060  30  3a  32  30  31  30  38  32  31  35  36  30  31  30  30  35
          0   :   2   0   1   0   8   2   1   5   6   0   1   0   0   5
0000100  35  32  31  31  31  36  32  31  30  20  20  20  20  20  20  20
          5   2   1   1   1   6   2   1   0
0000120  20  20  20  20  20  20  20  20  20  20  20  20  20  34  20  20
                                                              4
0000140  20  20  20  20  20  20  20  20  20  20  20  20  20  20  20  20

*
0000200  20  20  20  20  20  20  3f  23  42  32  4e  30  39  32  38  31
                                  ?   #   B   2   N   0   9   2   8   1
0000220  34  30  52  41  53  31  39  30  44  4d  6a  4d  33  4e  6a  59
          4   0   R   A   S   1   9   0   D   M   j   M   3   N   j   Y
0000240  31  4d  6a  63  79  41  45  65  51  52  70  58  46  68  6b  37
          1   M   j   c   y   A   E   e   Q   R   p   X   F   h   k   7
0000260  41  74  38  6f  56  37  4b  46  56  66  48  41  37  66  70  6a
          A   t   8   o   V   7   K   F   V   f   H   A   7   f   p   j
0000300  4f  6b  78  32  73  4e  7a  65  37  79  63  37  4b  5a  59  43
          O   k   x   2   s   N   z   e   7   y   c   7   K   Z   Y   C
0000320  70  78  51  59  4c  73  47  5a  36  79  72  65  50  34  42  67
          p   x   Q   Y   L   s   G   Z   6   y   r   e   P   4   B   g
0000340  73  68  35  4c  4c  37  41  3d  3d  4d  54  4d  77  4f  54  63
          s   h   5   L   L   7   A   =   =   M   T   M   w   O   T   c
0000360  77  4f  54  67  32  4e  77  43  50  4a  56  46  45  7a  64  35
          w   O   T   g   2   N   w   C   P   J   V   F   E   z   d   5
0000400  61  5a  5a  50  77  64  58  2b  51  75  44  71  6a  7a  79  34
          a   Z   Z   P   w   d   X   +   Q   u   D   q   j   z   y   4
0000420  77  35  4e  77  69  39  2b  2b  6b  35  79  77  30  62  5a  45
          w   5   N   w   i   9   +   +   k   5   y   w   0   b   Z   E
0000440  45  53  77  3d  3d  35  31  32  31  30  37  0a
          E   S   w   =   =   5   1   2   1   0   7  \n
0000454

---------- Post updated at 10:29 AM ---------- Previous update was at 10:26 AM ----------

This C Program was develop some 20 years ago and it is so complex, it would take a lot of time to make the code changes test it and deploy it, Our project went live this week and i am looking for a quick and temporary solution for now.

RudiC · October 31, 2014, 11:43am

Where in your last post are the bytes under discussion?

The octal sequence 357 277 275 (hex: EF BF BD) is the three byte unicode representation of FFFD, which is (from wikipedia )

Looks like it is a left over from a recent (incorrect) character set conversion?

jim_mcnamara · October 31, 2014, 11:45am

Aaagh. 2? That means that your assumption about fixed length is not quite right.

Or - there are several flavors of records like HEADER DATA TRAILER and HEADER and DATA have an extra byte.

Or - the file layout is broken.

Your sed cannot ever fix something that is already broken. I do not get how this was pushed into production with a data flaw like that. It should have broken things in earlier testing. Assuming testing went well, I would look to see that everything that was pushed and tested as good matches exactly what is in PROD.

BTW - junk like this usually originates in C code where somebody does something to cause a trailing NUL to be overwritten or none put there to start with. Example: memcpy rather than strcpy. It starts with the questionable practice of not initializing C strings.

The junk comes from what was on the stack earlier.

Why do I say all this? I do not know for sure, but I believe you are going to have to run your C code in a debugger, locate the problem, and fix it.

This has now gotten past a trivial sed one-liner. Or anything we can fix by remote control for you. Maybe someone else here has a better idea. I hope.

RudiC · October 31, 2014, 11:54am

Maybe the LANG=C setting is not the best? What locale do the files come from?

shamrock · October 31, 2014, 11:59am

Why don't you post your C code here if it ain't too long and maybe some forumite can locate the problem and suggest a fix...

jim_mcnamara · October 31, 2014, 12:02pm

RudiC - I think it is just bad C code leaving stack detritus in a string variable.