Print byte position of extended ascii character

rosebud123 · July 14, 2018, 1:18pm

Hello,

I am on AIX.

When I encounter extended ascii characters and special characters on a file I need to print..

Byte position, actual character and line number.

Is there a simple command that can give me the above result ?

Thanks in advance

RudiC · July 14, 2018, 1:27pm

Welcome to the forum.

Please become accustomed to provide decent context info of your problem.

It is always helpful to carefully and detailedly phrase a request, and to support it with system info like OS and shell, related environment (variables, options), preferred tools, adequate (representative) sample input and desired output data and the logics connecting the two including your own attempts at a solution, and, if existent, system (error) messages verbatim, to avoid ambiguities and keep people from guessing.

Please specify what you mean by "extended ascii characters" and "special characters". You're not talking of code sets like UTF-8 but more of code pages, I presume?

rosebud123 · July 14, 2018, 1:40pm

Thanks RudiC.

I am talking about the extended ascii set , please see the extended ascii table from the below link

Sorry if i am using CODE tag at incorrect location , I am not allowed to post the link so I used the code tag

ascii-table.com/ascii-extended-pc-list.php

RudiC · July 14, 2018, 2:43pm

OK.
This is about a third of the required answers.

wisecracker · July 14, 2018, 3:22pm

I am making an assumption that you are looking for ISO-8859-1, 8 bit extensions where the extended characters are NOT like Code Page 437.

You could try something like 'hexdump' to get a 'subscript' position from 0 to the length of line, with 8 bit values 128 and above...
VALUE=$( hexdump -n1 -s$SUBSCRIPT -v -e '1/1 "%u"' /full/path/to/file )
..Where VALUE is a decimal number from 0 to 255 and SUBSCRIPT is the position in each line read. this could even give the positions of all characters greater than 127 with the correct filtering.
This might help for your requirements:

RudiC · July 14, 2018, 5:52pm

Howsoever, try

while read T
  do    ((CNT++))
        for ((i=0; i<${#T}; i++))
          do    LC_ALL=C TMP=$(printf "%d\n" "'"${T:i:1})
                [ $TMP -gt 127 ] && printf "%d %c %d\n" $i ${T:i:1} $CNT
          done
  done <file

rosebud123 · July 14, 2018, 7:02pm

Here is the source file and the desired output file

Source File

MDQ �D201803132018031400
MDQ "�201707112018071100
MDQ =�201605202018052000
MDQ "�201605202018052000
QDX ��201705012018050200
MDQ �a�201708102018081000
QDU �c-201708092018080900

Desired Output

BytePosition	Special_Chracter     Linen_Number             Special_Character_Count    Total_Count_of_Special_Chracaters_In_file
5	          	                     1                           1                      1
6                 �                          1                           1                      1
5	                                    2                           1                      1
6                 �                          2                           1                      1
5	                                    3                           1                      2
6                 �                          6                           1                      1

Total_Count_of_Special_Chracaters_In_file column value is the total occurrence of the character in the input file.

In the above example appeared twice once in record number 2 and once in record number 3 , so for this special character Total_Count_of_Special_Chracaters_In_file is '2'

Thanks in advance!

jim_mcnamara · July 14, 2018, 8:34pm

What locale are you using? UTF-8, Unicode?

Without more information we can't help very much

rosebud123 · July 14, 2018, 8:41pm

Yes UTF-8

wisecracker · July 15, 2018, 1:19am

Well your BYTE is much more complex than that!
You meant CHARACTER, BUT, take a look at your file snippet:

#!/bin/bash
text='MDQ �D201803132018031400
MDQ "�201707112018071100
MDQ =�201605202018052000
MDQ "�201605202018052000
QDX ��201705012018050200
MDQ �201708102018081000
QDU �c-201708092018080900'
hexdump -C <<< "$text"

With the results, (OSX 10.13.5, default bash terminal.):

Last login: Sun Jul 15 05:28:32 on ttys000
AMIGA:amiga~> cd Desktop/Code/Shell
AMIGA:amiga~/Desktop/Code/Shell> ./byte_or_char.sh
00000000  4d 44 51 20 03 c5 b8 44  32 30 31 38 30 33 31 33  |MDQ ...D20180313|
00000010  32 30 31 38 30 33 31 34  30 30 0a 4d 44 51 20 02  |2018031400.MDQ .|
00000020  22 c3 a3 32 30 31 37 30  37 31 31 32 30 31 38 30  |"..2017071120180|
00000030  37 31 31 30 30 0a 4d 44  51 20 02 3d c3 bf 32 30  |71100.MDQ .=..20|
00000040  31 36 30 35 32 30 32 30  31 38 30 35 32 30 30 30  |1605202018052000|
00000050  0a 4d 44 51 20 02 22 c3  84 32 30 31 36 30 35 32  |.MDQ ."..2016052|
00000060  30 32 30 31 38 30 35 32  30 30 30 0a 51 44 58 20  |02018052000.QDX |
00000070  03 c3 bb c3 81 32 30 31  37 30 35 30 31 32 30 31  |.....20170501201|
00000080  38 30 35 30 32 30 30 0a  4d 44 51 20 c3 ac 07 c2  |8050200.MDQ ....|
00000090  a9 32 30 31 37 30 38 31  30 32 30 31 38 30 38 31  |.201708102018081|
000000a0  30 30 30 0a 51 44 55 20  c3 ac 63 2d 32 30 31 37  |000.QDU ..c-2017|
000000b0  30 38 30 39 32 30 31 38  30 38 30 39 30 30 0a     |08092018080900.|
000000bf
AMIGA:amiga~/Desktop/Code/Shell> _

As you can see there are multiple bytes including low byte values too, that is, as an example, '[0x]03', '[0x]02' etc... etc... '[0x]0a' is the newline so that can be ignored here...
This is not straightforward as we have no idea what these low value bytes do, are they hidden characters etc... etc?
Sometimes the extended character has 2 bytes and sometimes more, ( 03 c3 bb c3 81 ), with those added strange low byte values that were unknown to us all without me looking first.
As I pointed out before 'hexdump', (or 'od' or 'xxd'), is(/are) your initial friends here...
This combination is particularly hard to catch c3 ac 07 c2 a9 what does the '[0x]07' do here?
Much more information is needed before we can proceed, assuming there is a solution.
CHARACTERS and HIDDEN characters are not the same as bytes as you have now discovered...
And finally, the bizarre thing is your last line here does NOT have a low byte value so what is its requirement as they ARE technically ASCII characters, albeit control ones.
EDIT:
I have just noticed this 02 3d , are these 2 real ASCII characters or one _imaginary_ and one real?

I have a sneaking suspicion that the BYTES following the spaces should be 4 BYTE pointers of some description AND have become corrupted by those _EXTENDED_ characters! Hence the varying number of bytes before the numerical ?DATE? value.

rosebud123 · July 15, 2018, 11:14am

My Input file contains combination of ascii/extended ascii/unprintable/double byte characters

The idea is to find these extended ascii/unprintable/doublebyte characters and provide an output file in the required format.

Is there a way we can do converse operation where all good characters are replaced with some constant value and the problem child's are left as is and from there we can do another operation to get the desired output.

Please advise

Don_Cragun · July 15, 2018, 1:21pm

By definition, a text file cannot contain NUL bytes.

If the file you're reading contains pointers or other binary values. You need to really understand the format of the data you are processing and use tools appropriate to your task. Without understanding the format of the data you're reading, all bets are off. Note that the format includes not only knowing where there are binary values in your data (if there are any), but also knowing what codeset is being used to encode characters in your file. (For example, there is obviously a big difference between extended ASCII characters encoded in ISO 8859-1 and extended ASCII character encoded in UTF-8.)

RudiC · July 15, 2018, 2:02pm

Did you consider / adapt post#6?

rosebud123 · July 15, 2018, 8:46pm

RudiC.

I tried but for some strange reason I am receiving a syntax error

0403-057 Syntax error at line 4 : `(' is not expected.

I am on AIX.

Thank you

RudiC · July 16, 2018, 11:00am

What's your shell?
And, please show the entire error msg including context. What be "line 4"?

jim_mcnamara · July 16, 2018, 11:50am

Let's try another way to get you to understand the problem.

What Don said was correct and very polite. Here is what you have to do.

UTF-8 means all characters have one byte, 256 possibilities ranging from 0 to 255.
So if we read byte-by-byte reach read produces a character we can check. This how computers work. Which can be annoying.

Now if there are wide characters - say 2 bytes wide - and we do not know where they live on a line of UTF-8 bytes, we cannot tell them apart from their UTF-8 neighbors. It takes 2 bytes to create one character. Bottom line: if we think the byte we read is UTF-8, but is really UTF-16 we cannot tell the difference.

In order to do what you want:

1. we have to know where multibyte characters live  ahead of time. If they do exist.

2. If what you are seeing as a problem is really just single byte "high ascii" characters, then any single byte value that is > 127 is a problem and should be reported. 

3. If there are embedded nul (ASCII 0) characters , then we have to read the file in a completely different manner.

Got it? We need information to help. So please help us to help you.
Want a correct answer? Then provide us with choice 1, or choice 2, or choice 3.

wisecracker · July 16, 2018, 12:53pm

To demonstrate Jim's 3rd point and Don's similar point:
Longhand, OSX 10.13.5, default bash terminal running ksh:

Last login: Mon Jul 16 17:38:24 on ttys000
AMIGA:amiga~> ksh
AMIGA:uw> 
AMIGA:uw> text=$'abcd\007efgh'
AMIGA:uw> printf "%b\n" "${text}"
abcdefgh
AMIGA:uw> # A sound should be generated!
AMIGA:uw> 
AMIGA:uw> echo "${#text}"
9
AMIGA:uw> text=$'abcd\000efgh'
AMIGA:uw> printf "%b\n" "${text}"
abcd
AMIGA:uw> 
AMIGA:uw> # HUH? where are the other characters?
AMIGA:uw> 
AMIGA:uw> echo "${#text}"
4
AMIGA:uw> hexdump -C <<< "${text}"
00000000  61 62 63 64 0a                                    |abcd.|
00000005
AMIGA:uw> # Forever lost due to NULL!
AMIGA:uw> exit
AMIGA:amiga~> _

You now see why we are mentioning these subtle details...

rosebud123 · July 16, 2018, 1:44pm

All,

Truly appreciate your inputs in solving the issue.

At this point I cannot confirm on the source encoding as the file hops on various locations before each reaches to me.

I am assuming that it is UTF-8 but the sample data suggests that it is NOT.

Can we move with choice number 2 on post #16

Thank you

jim_mcnamara · July 16, 2018, 11:10pm

RudiC's post #6 in this thread does exactly what choice #2 does for you. It finds ASCII values > 127.

Use the bash shell, his example will not work in all shells. Put a shebang as the absolutely first line in your shell script. This invokes bash. I don't even know if you have bash available as a shell or not....

#!/bin/bash

This will cause an immediate error message if you do not have bash. Some other shells may work okay, but since that is still secret we can't help.

rosebud123 · July 17, 2018, 4:53pm

KSH : Version M-11/16/88

---------- Post updated at 03:53 PM ---------- Previous update was at 03:35 PM ----------

I am I missing anything here...I am still receiving an error

Version M-11/16/88f

 admin@t1g(/opt/test)$ sh -x test127.sh
  test127.sh[2]: 0403-057 Syntax error at line 4 : `(' is not expected.

Here is my script.

#!/bin/bash
while read T
  do    ((CNT++))
        for ((i=0; i<${#T}; i++))
          do    LC_ALL=C TMP=$(printf "%d\n" "'"${T:i:1})
                [ $TMP -gt 127 ] && printf "%d %c %d\n" $i ${T:i:1} $CNT
          done
  done <Test.TXT

Please advise