Replace Special Character With Next Present Byte

dineshnak · August 20, 2014, 4:07am

Hi,

First find the special character, from the special character take next two bytes convert the bytes to decimal and replace with next present byte of decimal value times.

E.g.

Input: 302619�1A? 
Output: 302619(3 spaces for �1A)??????????????????????????

Thanks,
Dines

vbe · August 20, 2014, 4:25am

Yes?

What have you done so far?

Skrynesaver · August 20, 2014, 4:34am

$ perl -Mutf8 -E '$value="302619�1A?";$value=~s/[^A-Za-z\s\d]([\da-fA-F]{2})(.)/sprintf("   %s", $2 x hex($1))/e; say $value;' -
302619   ??????????????????????????

dineshnak · August 20, 2014, 5:23am

Hi,

The below code failed with below error
"Unrecognized switch: -E (-h will show valid options)."
Could you guide us to change the code to use it in SunOS.

Thanks,
Dines

Skrynesaver · August 20, 2014, 5:35am

You're running an older version of Perl try the following:

$ perl -Mutf8 -e '$value="302619�1A?";$value=~s/[^A-Za-z\s\d]([\da-fA-F]{2})(.)/sprintf("   %s", $2 x hex($1))/e; print "$value\n";' -
302619   ??????????????????????????

dineshnak · August 20, 2014, 5:53am

Hi,

Thanks for you reply.
Could you please explain the process how to deal this in a file as multiple occurrence. eg:

IBM513AMMOD�07 ibmyx66mcp00�06 302619�1A 00005014072605331600�0A 980�32 201407260533160�14

Thanks,
Dines

rbatte1 · August 20, 2014, 5:53am

Welcome dineshnak,
I don't really see the question clearly, but I have a few to questions pose in response first:-

Is this homework/assignment? There are specific forums for these.
What have you tried so far?
What output/errors do you get?
What OS and version are you using?
What are your preferred tools? (C, shell, perl, awk, etc.)
What logical process have you considered? (to help steer us to follow what you are trying to achieve)

Most importantly, What have you tried so far?

There are probably many ways to achieve most tasks, so giving us an idea of your style and thoughts will help us guide you to an answer most suitable to you so you can adjust it to suit your needs in future.

We're all here to learn and getting the relevant information will help us all.

Regards,
Robin

dineshnak · August 20, 2014, 6:08am

Hi Robin,
We are getting a fixed length file got compressed with special character "�07 & �1A?, etc.," in between the characters, need to read the special characters along with the next two bytes hexadecimal value. Once we read the hexadecimal value need to convert to decimal and add mentioned byte (symbol or space) after hexadecimal value in a file. We tried using awk, sed but no progress in output. we need some sample script or information to run the script on SunOS and drive further.

Thanks,
Dines

rbatte1 · August 20, 2014, 6:28am

Is this packed-decimal data from a mainframe perhaps? It would probably be easier to generate the file as truly plain text at the source before transferring it. If you are using FTP, make sure the transfer is forced to be an ASCII transfer.

From your example in the first post, I think what you want is to read the 1A? to mean 'please insert 26 (decimal) question marks' and any other time we hit the special character in the same line or any other line.

It makes it all a bit complex, hence why I suggest you generate a fully expanded file at the source. If it won't fit, or the transfer takes too long, then there are commercial compression tools that are available for pretty much any platform combination. We changed one transfer from 23 hours to 4 by using one, but there will be others out there.

What is your source system?

Robin

dineshnak · August 20, 2014, 6:35am

Hi Robin,

We encounter fixed format special characters like "? or spaces" in the file between characters after hexadecimal values. The file got FTP from windows machine, provide some information how to handle the situtation using UNIX script.

Thanks,
Dines

rbatte1 · August 20, 2014, 6:38am

Is this a Windows compressed file, a Winzip file or something else?

It may be possible to expand this with gunzip or similar utilities if these are available to you, but we'd need to know how it is generated in the first place.

Robin

Skrynesaver · August 20, 2014, 6:54am

The substitution regex above matches any character which is not a letter, number or space (as defined by the current locale) followed by 2 characters that could be interpreted as hexadecimal, followed by any character and replaces them with the 3 spaces followed by the character repeated n times.

The fact that Perl allows executable code in the substitution block means we can do things like this not available in sed or awk as a single substitution.

The e flag marks the substitution block for evaluation, the g flag would allow the substitution to be applied globally rather than to just the first match.

man perlre for more details on Perl regex

dineshnak · August 20, 2014, 6:54am

Hi Robin,

File FTP from windows system without any compression, only data got compressed, we need to expand the data by finding the special character ie. "�07 & �1A?.," lies in the file and to pad the next byte ie(space or ?) after the hexadecimal value convert to decimal.

Sample Data:
Input Data:

ABCD172 2 B10001�0E F�08 DineshG�14 KumarNakka�0E �3C?IN�14?EFGH340

Output Data:

ABCD172 2 B10001   (14 spaces) DineshG   (14 spaces) KumarNakka   (14 spaces)    ??????(60-?symbol)IN   ??????????????EFGH340

Thanks,
Dines

rbatte1 · August 20, 2014, 7:04am

From your input of:-

ABCD172 2 B10001�0E F�08 DineshG�14 KumarNakka�0E �3C?IN�14?EFGH340

.....I would expect this:-

ABCD172 2 B10001              F        DineshG                    KumarNakka              ????????????????????????????????????????????????????????????IN????????????????????EFGH340

.... which does not match your description of the expected output.

I'd still like to pursue how this is generated in the first place and see if there is a simpler alternative.

Robin

dineshnak · August 20, 2014, 7:15am

Hi Robin,

By typo missed the data in between. The expectation of your's is correct, could you please provide the logic how would you processed the data and got the result. The file got generated by VB script.

Thanks,
Dines

rbatte1 · August 20, 2014, 7:32am

It was read by eye and converted by brain to hopefully confirm that we are working on the correct input to output relationship, hence why CODE tags are so important. I have done no coding so far.

Is the special character always the same? Can you confirm what character it is with od? If you can cut down a copy of the file to just contain the character, then use:-

od -x input_file

and paste the output here in

```text
 & 
```

tags, then that would help.

Is this a large file that needs the powerful processing of awk, sed or perl, or would a simpler, but slower loop in a shell be acceptable? If the file is smaller , but there are many than you want to call in a loop, sometimes calling awk etc. can work out slower.

Robin

dineshnak · August 20, 2014, 8:07am

Hi,

Find the below output for the above command. File size would be greater than 1 MB.

0000000 314d 4847 3139 3620 3720 4942 4d35 3133
0000020 414d 4d4f 44e2 9692 3037 2069 626d 7978
0000040 3636 6d63 7030 30e2 9692 3036 2033 3032
0000060 3631 39e2 9692 3141 2030 3030 3035 3031
0000100 3430 3732 3630 3533 3331 3630 30e2 9692
0000120 3041 2039 3830 e296 9233 3220 3230 3134
0000140 3037 3236 3035 3333 3136 30e2 9692 3134
0000160 2032 5452 4732 3132 2037 2039 38e2 9692
0000200 3039 2033 5020 5041 5554 4f20 464d 4742
0000220 4e20 2049 424d 3531 3341 4d4d 4f41 4d45
0000240 5249 4341 4e20 4d4f 4445 520a 4e20 494e
0000260 5355 5241 4e43 4533 3032 3631 39e2 9692
0000300 3430 2039 3738 3731 3430 3732 3539 3738
0000320 3720 2020 2031 3430 3732 3420 4e42 53e2
0000340 9692 3139 2041 3230 3134 3037 3235 3230
0000360 3134 3037 3234 200a
0000370

Thanks,
Dines

rbatte1 · August 20, 2014, 8:28am

Huh?

I was expecting to see just a single line with the character code for � and a terminating 0a

I don't see a match for the rest of your sample either. I was looking for hex strings like this:-

0000000 4142 4344 3137 3220 3220 4231 3030 3031

The next character would be the � that we're interested in. I don't fancy working out byte-for-byte what there is and what's which.
The above is from your first line up to the �

Can you clarify which input you have used for this? If you have to sanitise the input, please provide the matching output for the sanitised version.

Thanks,
Robin

dineshnak · August 20, 2014, 9:05am

Hi,

Find the below one.

0000000 4142 4344 3137 3220 3220 4231 3030 3031
0000020 c3ba 3045 2046 c3ba 3038 2044 696e 6573
0000040 6847 c3ba 3134 204b 756d 6172 4e61 6b6b
0000060 61c3 ba30 4520 c3ba 3343 3f49 4ec3 ba31
0000100 343f 4546 4748 3334 300a
0000112

Thanks,
Dines

---------- Post updated at 08:05 AM ---------- Previous update was at 07:41 AM ----------

Hi Robin,

Could you please provide some information on the above.

Thanks,
Dines

rbatte1 · August 20, 2014, 9:09am

Right, so hex character c3 is what we're after. Now that we have much of the information we need, it's time for a think.......

Robin