Hi All,
I am trying to remove (SELECTIVE - passed as argument) Extended ASCII using Awk based on adhoc basis. Can you please let me know how to do it. I have to implement this using awk only.
Thanks & Regads
Hi All,
I am trying to remove (SELECTIVE - passed as argument) Extended ASCII using Awk based on adhoc basis. Can you please let me know how to do it. I have to implement this using awk only.
Thanks & Regads
Is this another homework assignment?
What have you tried so far?
What do you mean by Extended ASCII? Are you trying to remove a single character? Are you trying to remove individually specified characters with each character specified as a separate argument? Are you trying to remove a string of characters? Are you trying to remove individual characters included in a single argument string?
What do you mean by Awk(sic) based on ad hoc basis?
Hi Don
This is a part of script enhancement. The script would take ascii values as input arguments, generally Extended ASCII (i.e. ASCII values >=128 ) and remove them from input file.
Since the place within script that I need to modify is in awk script, I have to implement this within awk itself instead of any other commands such as tr or sed.
I asked 8 questions. You partially answered one of them (generically, but not specifically for this assignment).
Unless you convince us that this is not a homework assignment, show us that you have made an attempt at solving this, show us the part of your existing awk
script that you're trying to modify, show us that you have some idea of what your input arguments need to look like, and provide us with some sample input and output for your script; this thread will be closed.
We are here to help you learn how to write code using the tools available on UNIX and Linux systems to perform various tasks. We are not here to act as your unpaid programming staff trying to guess at why you're trying to do, coaxing descriptions of the tasks that need to be performed out of you, and then designing and writing your code for you. And we most certainly are not here to do your homework assignments for you!
This is not a homework assignment. It is part of script which I am currently modifying. I am not well aware of awk. I can do the same using tr or sed. I want to know if there is any function in awk that can perform similar function. I was using sub/gsub function, but the manual contains how to replace a pattern. Here I am not looking for a specific pattern, but a match of ANY of the characters.
The script is on client secured network, which cannot be copied.
The input arguments would be range of ascii values and/or comma separated ascii values.
eg: 128-140, 145, 147
If any of the input ascii values appear in any of the lines of input file, then it has to be replaced with empty string.
suppose I have input as
testing_�_testing
I need the output as
testing__testing
It appears that your strings are UTF-8; not extend ASCII. Furthermore, printing your strings through od
shows that the byte values that you said you wanted to remove are not present in your input string or output string samples:
printf '%s' 'testing_�_testing' | od -t cu1
printf '%s' 'testing__testing' | od -t cu1
shows us that the unsigned decimal byte values of the two bytes you want to remove are 197 and 160:
0000000 t e s t i n g _ � ** _ t e s t i
116 101 115 116 105 110 103 95 197 160 95 116 101 115 116 105
0000020 n g
110 103
0000022
printf '%s' 'testing__testing' | od -t cu1
0000000 t e s t i n g _ _ t e s t i n g
116 101 115 116 105 110 103 95 95 116 101 115 116 105 110 103
0000020
If you are working with UTF-8 input and want "extended ASCII" output (where you may be removing 1 or more bytes out of a multi-byte UTF-8 character, but might not be removing complete characters), you may end up with an unintelligible mess. If you want to remove a specific set of UTF-8 characters, that is easy to do. If you want to remove all non-(7-bit)ASCII characters, that is easy to do on some systems (depending on how well your version of awk
handles locales and multi-byte characters).
What OS (including version) and shell are you using?
What Locale are you using when your run this script?
Is it OK to just remove all bytes from your input stream that have the high order bit set? If not, is there a specific list of UTF-8 characters you want to remove? If not, and you really want to remove individual bytes from strings containing multi-byte characters, this may be hard to do in some versions of awk
.
You said you know how to do what you want using sed
. Show us the sed
substitute command that does what you want and we can show you how to easily change that into an awk sub()
or gsub()
function call.
Hi Don,
I want to remove any character specified as argument (decimal ascii value).
eg. For values 128-140, 145, 147
I am trying to implement below code
tr -d '\145\147\128-\140' < InputFileName > OutputFileName
OR
cat InputFileName | sed -e 's/\d145//g' -e 's/\d147//g' -e s'/\d128-\d140//g' > OutputFileName
I am making these changes using korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)
When I run the command:
printf '%s\n' 'testing_�_testing' 'testing__testing'|tr -d '\145\147\128-\140'
I get the output:
tstin�tstintstintstin$
(Note that the $
at the end of the output is my shell's prompt. The arguments you are giving to tr
are treated as octal values (not decimal), \145
is the character e
; \147
is the character g
; \128
is treated as \12
(the newline character) followed by the character 8
; and 8-\140
in ASCII removes the characters 8
, 9
, all upper-case alphabetic characters, and the [
, \
, ]
, ^
, _
, and `
characters.
And the command:
printf '%s\n' 'testing_�_testing' 'testing__testing'|sed -e 's/\d145//g' -e 's/\d147//g' -e s'/\d128-\d140//g'
Produces the output:
testing_�_testing
testing__testing
$
because, as I said before, the two byte character �
in UTF-8 is made up of bytes with the decimal values 197 and 160 (neither of which are in your list of byte values to be deleted by the sed
command). (Note also that while, \dx
(where x is a one, two, or three digit decimal number) works on some systems, it is an extension to the standards and, on many systems, will give you a syntax error or delete the characters d
, 0
, 1
, 4
, 5
, 7
, and 8
.)
Please show us the output you get when you run the commands above!
I repeat:
What OS (including version) and shell are you using?
What Locale are you using when you run your script?
Hi Don,
I am making these changes using korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)
I will check and provide you the results on monday when I have the system in front of me
Thanks & Regards
The example you gave does not match what you say you want to remove. The character is made of TWO ASCII characters not one. Please post the output of
locale
echo $LANG
echo $LANG produces the below result
echo $LANG
en_US.UTF-8
Shell and OS
korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)
Apologies, I am unable to copy and paste results to and from the client network.
The "�" was a character that I picked up to show how I wanted to remove such characters. It's good that I came to know new thing that we have multi byte characters as well.
I just noticed that I was not removing the characters properly as Don mentioned even with tr command.
PS: I am manually typing the results
printf "testing_\x80\x81\x82\x88_testing" > test.txt
cat -v test.txt | tr -d '[\d128-\d130]' | tr -d '[\d136]'
is resulting in
testing----testing
We did not notice it until now, that the results were incorrect.
Can you please help to remove such UTF-8 characters using awk and tr as well
In your last example, check the output of cat -v
first; then you know where the dashes come from. And, not all systems/commands accept the \dnnn
sequences. Try
cat test.txt | tr -d '\200-\202\210'
testing__testing
In UTF-8 (and other UTFs), single chars above the ASCII range don't exist. They come in pairs or even longer char groups. So you could
Example for option 2:
FN=$1
shift
TBD=$@
TBD=${TBD// /\|}
sed -r "s/$TBD//g" $FN
running this on your first testfile:
./remscript testfile � � �
testing__testing
If you just want to get rid of non-ASCII characters (rather than a particular list of single- and/or multi-byte UTF-8 characters), the following awk
and tr
commands should work:
LANG=C awk '{gsub(/[\200-\377]/, "")}1' input_file > output_file
LANG=C tr -d '\200-\377' < input_file > output_file
as evidenced by these examples:
$ printf '%s\n' 'testing_�_testing' 'testing__testing'| LANG=C awk '{gsub(/[\200-\377]/, "")}1'|od -c
0000000 t e s t i n g _ _ t e s t i n g
0000020 \n t e s t i n g _ _ t e s t i n
0000040 g \n
0000042
$ printf '%s\n' 'testing_�_testing' 'testing__testing'| LANG=C tr -d '\200-\377'|od -c
0000000 t e s t i n g _ _ t e s t i n g
0000020 \n t e s t i n g _ _ t e s t i n
0000040 g \n
0000042
Hi,
Sorry for digging the old thread. Please let me know if I have to open another thread.
Can you please let me know how you have the number 200 instead of dec 128.
I want to remove selected characters, which includes multi bytes.
I am making these changes using korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)
LANG=en_US.UTF-8
Hi 200 octal is 128 decimal:
$ printf "%d\n" "0200" "0377"
128
255