Removal Extended ASCII using awk

tostay2003 · December 31, 2014, 11:57pm

Hi All,

I am trying to remove (SELECTIVE - passed as argument) Extended ASCII using Awk based on adhoc basis. Can you please let me know how to do it. I have to implement this using awk only.

Thanks & Regads

Don_Cragun · January 1, 2015, 12:37am

Is this another homework assignment?

What have you tried so far?

What do you mean by Extended ASCII? Are you trying to remove a single character? Are you trying to remove individually specified characters with each character specified as a separate argument? Are you trying to remove a string of characters? Are you trying to remove individual characters included in a single argument string?

What do you mean by Awk(sic) based on ad hoc basis?

tostay2003 · January 1, 2015, 3:22am

Hi Don

This is a part of script enhancement. The script would take ascii values as input arguments, generally Extended ASCII (i.e. ASCII values >=128 ) and remove them from input file.

Since the place within script that I need to modify is in awk script, I have to implement this within awk itself instead of any other commands such as tr or sed.

Don_Cragun · January 1, 2015, 4:38am

I asked 8 questions. You partially answered one of them (generically, but not specifically for this assignment).

Unless you convince us that this is not a homework assignment, show us that you have made an attempt at solving this, show us the part of your existing awk script that you're trying to modify, show us that you have some idea of what your input arguments need to look like, and provide us with some sample input and output for your script; this thread will be closed.

We are here to help you learn how to write code using the tools available on UNIX and Linux systems to perform various tasks. We are not here to act as your unpaid programming staff trying to guess at why you're trying to do, coaxing descriptions of the tasks that need to be performed out of you, and then designing and writing your code for you. And we most certainly are not here to do your homework assignments for you!

tostay2003 · January 1, 2015, 11:14pm

This is not a homework assignment. It is part of script which I am currently modifying. I am not well aware of awk. I can do the same using tr or sed. I want to know if there is any function in awk that can perform similar function. I was using sub/gsub function, but the manual contains how to replace a pattern. Here I am not looking for a specific pattern, but a match of ANY of the characters.

The script is on client secured network, which cannot be copied.

The input arguments would be range of ascii values and/or comma separated ascii values.

eg: 128-140, 145, 147

If any of the input ascii values appear in any of the lines of input file, then it has to be replaced with empty string.

suppose I have input as

testing_�_testing

I need the output as

testing__testing

Don_Cragun · January 2, 2015, 1:12am

It appears that your strings are UTF-8; not extend ASCII. Furthermore, printing your strings through od shows that the byte values that you said you wanted to remove are not present in your input string or output string samples:

printf '%s' 'testing_�_testing' | od -t cu1
printf '%s' 'testing__testing' | od -t cu1

shows us that the unsigned decimal byte values of the two bytes you want to remove are 197 and 160:

0000000    t   e   s   t   i   n   g   _   �  **   _   t   e   s   t   i
          116 101 115 116 105 110 103  95 197 160  95 116 101 115 116 105
0000020    n   g                                                        
          110 103                                                        
0000022
printf '%s' 'testing__testing' | od -t cu1
0000000    t   e   s   t   i   n   g   _   _   t   e   s   t   i   n   g
          116 101 115 116 105 110 103  95  95 116 101 115 116 105 110 103
0000020

If you are working with UTF-8 input and want "extended ASCII" output (where you may be removing 1 or more bytes out of a multi-byte UTF-8 character, but might not be removing complete characters), you may end up with an unintelligible mess. If you want to remove a specific set of UTF-8 characters, that is easy to do. If you want to remove all non-(7-bit)ASCII characters, that is easy to do on some systems (depending on how well your version of awk handles locales and multi-byte characters).

What OS (including version) and shell are you using?

What Locale are you using when your run this script?

Is it OK to just remove all bytes from your input stream that have the high order bit set? If not, is there a specific list of UTF-8 characters you want to remove? If not, and you really want to remove individual bytes from strings containing multi-byte characters, this may be hard to do in some versions of awk .

You said you know how to do what you want using sed . Show us the sed substitute command that does what you want and we can show you how to easily change that into an awk sub() or gsub() function call.

tostay2003 · January 2, 2015, 3:28am

Hi Don,

I want to remove any character specified as argument (decimal ascii value).

eg. For values 128-140, 145, 147

I am trying to implement below code

tr -d '\145\147\128-\140' < InputFileName > OutputFileName

OR

cat InputFileName  | sed -e 's/\d145//g' -e 's/\d147//g'  -e s'/\d128-\d140//g' > OutputFileName

I am making these changes using korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)

Don_Cragun · January 2, 2015, 5:13am

When I run the command:

printf '%s\n' 'testing_�_testing' 'testing__testing'|tr -d '\145\147\128-\140'

I get the output:

tstin�tstintstintstin$

(Note that the $ at the end of the output is my shell's prompt. The arguments you are giving to tr are treated as octal values (not decimal), \145 is the character e ; \147 is the character g ; \128 is treated as \12 (the newline character) followed by the character 8 ; and 8-\140 in ASCII removes the characters 8 , 9 , all upper-case alphabetic characters, and the [ , \ , ] , ^ , _ , and ` characters.

And the command:

printf '%s\n' 'testing_�_testing' 'testing__testing'|sed -e 's/\d145//g' -e 's/\d147//g'  -e s'/\d128-\d140//g'

Produces the output:

testing_�_testing
testing__testing
$

because, as I said before, the two byte character � in UTF-8 is made up of bytes with the decimal values 197 and 160 (neither of which are in your list of byte values to be deleted by the sed command). (Note also that while, \dx (where x is a one, two, or three digit decimal number) works on some systems, it is an extension to the standards and, on many systems, will give you a syntax error or delete the characters d , 0 , 1 , 4 , 5 , 7 , and 8 .)

Please show us the output you get when you run the commands above!

I repeat:
What OS (including version) and shell are you using?

What Locale are you using when you run your script?

tostay2003 · January 2, 2015, 7:14pm

Hi Don,

I am making these changes using korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)

I will check and provide you the results on monday when I have the system in front of me

Thanks & Regards

jim_mcnamara · January 2, 2015, 7:47pm

The example you gave does not match what you say you want to remove. The character is made of TWO ASCII characters not one. Please post the output of

locale
echo $LANG

tostay2003 · January 5, 2015, 5:16am

echo $LANG produces the below result

echo $LANG

en_US.UTF-8

Shell and OS

korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)

Apologies, I am unable to copy and paste results to and from the client network.

The "�" was a character that I picked up to show how I wanted to remove such characters. It's good that I came to know new thing that we have multi byte characters as well.

I just noticed that I was not removing the characters properly as Don mentioned even with tr command.

PS: I am manually typing the results

printf "testing_\x80\x81\x82\x88_testing" > test.txt
cat -v test.txt | tr -d '[\d128-\d130]' | tr -d '[\d136]'

is resulting in

testing----testing

We did not notice it until now, that the results were incorrect.

Can you please help to remove such UTF-8 characters using awk and tr as well

RudiC · January 5, 2015, 5:57am

In your last example, check the output of cat -v first; then you know where the dashes come from. And, not all systems/commands accept the \dnnn sequences. Try

cat test.txt | tr -d '\200-\202\210'
testing__testing

In UTF-8 (and other UTFs), single chars above the ASCII range don't exist. They come in pairs or even longer char groups. So you could

delete ALL chars above ASCII
explicitly list the chars to be removed
use iconv or recode to convert to e.g. "extended ASCII" (of which several char sets exist) and then remove those unwanted chars.

Example for option 2:

FN=$1
shift
TBD=$@
TBD=${TBD// /\|}
sed -r "s/$TBD//g" $FN

running this on your first testfile:

./remscript testfile � � �
testing__testing

Don_Cragun · January 5, 2015, 2:10pm

If you just want to get rid of non-ASCII characters (rather than a particular list of single- and/or multi-byte UTF-8 characters), the following awk and tr commands should work:

LANG=C awk '{gsub(/[\200-\377]/, "")}1' input_file > output_file   
LANG=C tr -d '\200-\377' < input_file > output_file

as evidenced by these examples:

$ printf '%s\n' 'testing_�_testing' 'testing__testing'| LANG=C awk '{gsub(/[\200-\377]/, "")}1'|od -c        
0000000    t   e   s   t   i   n   g   _   _   t   e   s   t   i   n   g
0000020   \n   t   e   s   t   i   n   g   _   _   t   e   s   t   i   n
0000040    g  \n                                                        
0000042
$ printf '%s\n' 'testing_�_testing' 'testing__testing'| LANG=C tr -d '\200-\377'|od -c        
0000000    t   e   s   t   i   n   g   _   _   t   e   s   t   i   n   g
0000020   \n   t   e   s   t   i   n   g   _   _   t   e   s   t   i   n
0000040    g  \n                                                        
0000042

tostay2003 · April 20, 2015, 10:59am

Hi,

Sorry for digging the old thread. Please let me know if I have to open another thread.

Can you please let me know how you have the number 200 instead of dec 128.

I want to remove selected characters, which includes multi bytes.

I am making these changes using korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)
LANG=en_US.UTF-8

Scrutinizer · April 20, 2015, 11:56am

Hi 200 octal is 128 decimal:

$ printf "%d\n" "0200" "0377"
128
255