We are getting extended Ascii characters in the input file and my requirement is to search and replace them with a space. I am using the following command
LANG=C sed -e 's/[\x80-\xFF]/ /g'
It is doing a good job, but in some cases it is replacing the extended characters with two spaces. So my input file is fixed length file and because of this the length is increasing by 1 character or 2 characters depending on number of extended characters in the single line.
What is the best way to replace extended characters with only one space ?
(preferably sed command)
What we are getting at: your choice of the C locale does not "work" with the file.
So, is the file from an external source like a vendor or is it corrupted?
Because if a line has 100 bytes of characters used in a given locale, the output of your sed will be 100 bytes of data, not 101. So something is going on with the data in the file.
This uses cat as UUOC to simplify the example. You know the fixed record length of your file. For this example assume it is 100.
This should produce the answer of zero, meaning all records are the same, correct size. Try it to make sure the file not corrupt. And we are not barking up the wrong tree.
We are getting binary data from the external vendor. It is then processed using a C Program and the output is good, but occasionally we get these extended characters that too, in 1 record out of million records. So i can safely say that the file is not corrupted.
This is how the bad characters look in vi editor
�
The octal equivalent of the above three characters:
0000000 41 36 31 31 34 30 39 32 39 30 30 30 30 30 30 30
A 6 1 1 4 0 9 2 9 0 0 0 0 0 0 0
0000020 30 30 34 33 30 30 30 31 30 30 35 30 38 32 37 36
0 0 4 3 0 0 0 1 0 0 5 0 8 2 7 6
0000040 31 30 32 30 31 34 2d 30 39 2d 32 38 31 36 3a 34
1 0 2 0 1 4 - 0 9 - 2 8 1 6 : 4
0000060 32 3a 31 34 31 30 30 33 37 34 39 30 31 30 30 35
2 : 1 4 1 0 0 3 7 4 9 0 1 0 0 5
0000100 30 38 32 37 36 31 30 31 30 20 20 20 20 20 20 20
0 8 2 7 6 1 0 1 0
0000120 20 20 20 20 20 20 20 20 20 20 20 20 20 34 20 20
4
0000140 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
*
0000200 20 20 20 20 20 20 3f 23 42 32 4e 30 39 32 38 31
? # B 2 N 0 9 2 8 1
0000220 34 30 52 42 51 30 32 36 44 4d 6a 4d 33 4e 6a 59
4 0 R B Q 0 2 6 D M j M 3 N j Y
0000240 31 4d 6a 63 79 41 44 4a 4b 76 4d 57 39 65 61 53
1 M j c y A D J K v M W 9 e a S
0000260 74 37 65 71 50 7a 46 76 37 5a 59 73 52 6d 6a 61
t 7 e q P z F v 7 Z Y s R m j a
0000300 42 36 45 44 52 61 31 6c 78 4b 33 77 49 30 67 61
B 6 E D R a 1 l x K 3 w I 0 g a
0000320 76 55 79 7a 76 69 31 54 59 72 47 34 39 32 38 6a
v U y z v i 1 T Y r G 4 9 2 8 j
0000340 71 74 47 6d 35 30 41 3d 3d 4d 54 4d 77 4f 54 63
q t G m 5 0 A = = M T M w O T c
0000360 77 4f 54 67 32 4e 77 44 33 31 57 4d 6b 56 6c 32
w O T g 2 N w D 3 1 W M k V l 2
0000400 52 39 65 43 7a 7a 4e 51 71 43 54 33 51 4a 6e 62
R 9 e C z z N Q q C T 3 Q J n b
0000420 69 79 6a 73 33 4a 70 65 74 67 46 31 56 71 5a 43
i y j s 3 J p e t g F 1 V q Z C
0000440 73 38 77 3d 3d 35 31 32 31 30 37 0a
s 8 w = = 5 1 2 1 0 7 \n
0000454
0000000 41 36 31 31 34 30 39 32 39 30 30 30 30 30 30 30
A 6 1 1 4 0 9 2 9 0 0 0 0 0 0 0
0000020 30 33 32 35 30 30 30 31 30 30 35 35 32 31 31 31
0 3 2 5 0 0 0 1 0 0 5 5 2 1 1 1
0000040 36 32 32 30 31 34 2d 30 39 2d 32 38 31 34 3a 30
6 2 2 0 1 4 - 0 9 - 2 8 1 4 : 0
0000060 30 3a 32 30 31 30 38 32 31 35 36 30 31 30 30 35
0 : 2 0 1 0 8 2 1 5 6 0 1 0 0 5
0000100 35 32 31 31 31 36 32 31 30 20 20 20 20 20 20 20
5 2 1 1 1 6 2 1 0
0000120 20 20 20 20 20 20 20 20 20 20 20 20 20 34 20 20
4
0000140 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
*
0000200 20 20 20 20 20 20 3f 23 42 32 4e 30 39 32 38 31
? # B 2 N 0 9 2 8 1
0000220 34 30 52 41 53 31 39 30 44 4d 6a 4d 33 4e 6a 59
4 0 R A S 1 9 0 D M j M 3 N j Y
0000240 31 4d 6a 63 79 41 45 65 51 52 70 58 46 68 6b 37
1 M j c y A E e Q R p X F h k 7
0000260 41 74 38 6f 56 37 4b 46 56 66 48 41 37 66 70 6a
A t 8 o V 7 K F V f H A 7 f p j
0000300 4f 6b 78 32 73 4e 7a 65 37 79 63 37 4b 5a 59 43
O k x 2 s N z e 7 y c 7 K Z Y C
0000320 70 78 51 59 4c 73 47 5a 36 79 72 65 50 34 42 67
p x Q Y L s G Z 6 y r e P 4 B g
0000340 73 68 35 4c 4c 37 41 3d 3d 4d 54 4d 77 4f 54 63
s h 5 L L 7 A = = M T M w O T c
0000360 77 4f 54 67 32 4e 77 43 50 4a 56 46 45 7a 64 35
w O T g 2 N w C P J V F E z d 5
0000400 61 5a 5a 50 77 64 58 2b 51 75 44 71 6a 7a 79 34
a Z Z P w d X + Q u D q j z y 4
0000420 77 35 4e 77 69 39 2b 2b 6b 35 79 77 30 62 5a 45
w 5 N w i 9 + + k 5 y w 0 b Z E
0000440 45 53 77 3d 3d 35 31 32 31 30 37 0a
E S w = = 5 1 2 1 0 7 \n
0000454
---------- Post updated at 10:29 AM ---------- Previous update was at 10:26 AM ----------
This C Program was develop some 20 years ago and it is so complex, it would take a lot of time to make the code changes test it and deploy it, Our project went live this week and i am looking for a quick and temporary solution for now.
Aaagh. 2? That means that your assumption about fixed length is not quite right.
Or - there are several flavors of records like HEADER DATA TRAILER and HEADER and DATA have an extra byte.
Or - the file layout is broken.
Your sed cannot ever fix something that is already broken. I do not get how this was pushed into production with a data flaw like that. It should have broken things in earlier testing. Assuming testing went well, I would look to see that everything that was pushed and tested as good matches exactly what is in PROD.
BTW - junk like this usually originates in C code where somebody does something to cause a trailing NUL to be overwritten or none put there to start with. Example: memcpy rather than strcpy. It starts with the questionable practice of not initializing C strings.
The junk comes from what was on the stack earlier.
Why do I say all this? I do not know for sure, but I believe you are going to have to run your C code in a debugger, locate the problem, and fix it.
This has now gotten past a trivial sed one-liner. Or anything we can fix by remote control for you. Maybe someone else here has a better idea. I hope.