Easy VI Question (I hope)

Hi,
I've FTPed some text files from windows to my Linux workstation. I'm finding that the characters for quotes (") have been replaced with control characters <93> and <94>. and apostrophes (') have been replaced with what looks like control character <92>.

I have attempted the following substitution commands with no joy:

:1,$ s/<92>/'/g
and
:1,$ s/\<92>/'/g

Any Thoughts?

Thanking You,

Larry Moon

Once on your linux, you did not use dos2unix utility?
How did you transfer (FTP as ASCII?)

Please post a couple of lines of sample data (blotting anything confidential with X's) after processing with this "sed" command which is designed to make control characters and special characters visible:

sed -n l filename.txt

When you look at the file on a Windows computer with say notepad, do the double quotes appear to be slanted?

Same question as vbe. Did you use ASCII mode FTP?

Hi Larry,

In order to see what is there, a useful tip here would be to try the following.

cat -v -t filename | grep root

This should show any non printable characters, could you post the output if you get the chance.

In addition, this command would replace the characters in one go (if they were just ascii characters).

:1,$s/<9[3,4]>/"/g

Regards

Dave

Still wish to see a sample of the "sed" output. My sed command does not fix the file, it just makes it so we can see the octal code for the rogue characters ... or perhaps see that each is actually a 4-character string?
I notice that "gull04" is thinking along the same lines.

Gull04,

Below is the output on the target text from the command you provided...... Really interesting:

accounts are typically described as M-^SrootM-^T or M-^SadministratorM-^T for various types of 

I attempted your substitution command with the following out put:

E486

: Pattern not found: <9[3,4]>

Still I don't know that I've ever seen this manner of character representation. (i.e. M-^S and M-^T).

Are you able to make sense of it?

Thanking You,

Larry Moon

---------- Post updated at 04:56 PM ---------- Previous update was at 04:51 PM ----------

Sorry Methyl,

The file is HUGE. So when I ran your code the target output scrolled of the screen. My lacking observation skills didn't help either. Gull04's grep on "root" focused me.

---------- Post updated at 04:58 PM ---------- Previous update was at 04:56 PM ----------

I hope I don't get a slap on the wrist from vbe for now misusing the code tags he put me on to. :expressionless:

Hi Larry,

These control characters that you are seeing are different from most that I have seen before, however if you can make a copy of the file to work on I'll try and help you out here.

In a copy of the file could you try the following for me, I'm actually going to type some of the instructions in without the code tags as they might cause a problem in displaying the correct information.

The first part of the command is as follows;

:1,$s/

Just as a normal substitution in vi, then you should type in the following key sequence hold down the 'ctrl' key and 'press v' still holding down the 'ctrl' key press 'shift+s'.

This should be followed by the normal end of the statement like.

/"/g

Could you give that a test and let me know how you get on - on a copy of the file please.

Regards

Dave

Okay, let's see the same line in a different way:

sed -n l filename.txt | grep "root"

Unfortunately the output from "cat -v" is fiddly to interpret, but this does tell us that these are single bytes.

Just out of interest, what Microsoft program wrote this file?

Managed to generate what I think are these characters:
Note that I don't think that they are ctrl/s and ctrl/t.
Btw: If you type ctrl/V ctrl/S and your terminal hangs, just type ctrl/Q (it's the old xon/xoff stop start code sequence).

# What they are not:
echo "\0023\0024" | sed -n l
echo "\0023\0024" | cat -v

\023\024$
^S^T
# What they probably are:
echo "\0223\0224" | sed -n l
echo "\0223\0224" | cat -v

\223\224$
M-^SM-^T

Have delved through some very old notes on Vi and some more recent notes on Vim, seems that you used to be able to do this using the substitute command like this;

:1,$%s/\\0223/'/g

Or in sed

sed -e s/\\0223/"/g <oldfile> > newfile

However I'm unable to test this for you, the post that this replaced was incorrect.

Regards

Dave

Sorry gull04, I'm not into the outer fringes of "vi" and if this is a "large" file we may find that "vi" can't cope.
Other readers on this board may be able to advise.

I'm slowly building-up to a unix "tr" command to translate the characters but will need to know what version of "echo" we have so we can generate each character.
The "awk" and "sed" experts will not doubt be poised.

Will need to know:
Can this be sorted out on the Windows platform?
What Operating System and version are you running and what Shell do you use?
How big is the file?
What is the octal code of the funny characters and what character should they be?

Gents,

I am incredibility thankful for all of the approaches you're providing. I'm going to have a quick go with the benefit of your suggestions and will attempt to come back with a consolidated appreciation.

---------- Post updated at 09:31 AM ---------- Previous update was at 09:18 AM ----------

Attempted to run this with the following results:

[lmoon@oc4562845586 tools]$ sed -e s/\\0223/" /g SP800-53a_HOLD1 > SP800-53a_HOLD1a
> 
> ^C
[lmoon@oc4562845586 tools]$ 

It put me into an interactive mode (for which I knew not what to enter, so I Ctrl Ced out).

---------- Post updated at 09:32 AM ---------- Previous update was at 09:31 AM ----------

Oh the windows program these where generated with was ultra edit

---------- Post updated at 09:55 AM ---------- Previous update was at 09:32 AM ----------

methyl,

Please find below answers to your questions:

$ uname -r
2.6.32-220.4.2.el6.x86_64

$ ls -al | grep SP800-53a_HOLD
-rwxrwxrwx.  1 lmoon lmoon   404089 Mar 12 15:51 SP800-53a_HOLD1
-rwxrwxrwx.  1 lmoon lmoon   395269 Jan  5 15:25 SP800-53a_HOLD1.bak

$ echo $BASH_VERSION
4.1.2(1)-release

Interesting that the sed command you sent earlier (that I ^C ed out of) did generate another file with a small delta in size. I did review it and found the target characters to still be there. Still it looks as though something was replaced.

I'm happy to send you the offending document if you're interested. There is no sensitive information in it.

Thanking You,

Larry

---------- Post updated at 09:59 AM ---------- Previous update was at 09:55 AM ----------

Oh,

I'm not able to discern the octal code for those characters. Mind you the ascii table I looked at only had 128 characters with the highest octal code being '177'.

---------- Post updated at 10:04 AM ---------- Previous update was at 09:59 AM ----------

dubdubdub dot asciitable dot com Not allowed to post URLs until I've engaged five posts :slight_smile: In good faith, I'll get there.

We can use the unix "tr" command to change the three funny characters you mentioned in post #1:

# Demo
echo "\0222\0223\0224" | tr '\222\223\224' "'"'""'

'""
# For example
cat original_file | tr '\222\223\224' "'"'""'  > new_file

Ps. I now realise that 92, 93 and 94 are the hexadecimal interpretation of each character.

Methyl, Gull04, and vbe,

Below is some perl code that I've run on each line of the evaluated file to mitigate for those bloody control characters:

s/[^!-~\s]//g;

To me this is like skinning the cat from the inside out, so I've already given myself the "performance lecture". Still I would be very interested if any of you foresee any function issue that may surface by doing this.

By way of example below is a small perl script that demonstrates the behavior:

#!/usr/bin/perl -X


$str = "Please give Bob the account of root so he can be more responsible.";

#########################################################################################
# In the below, !-~ is a range which matches all characters between ! and ~. The range  #
# is set between ! and ~ because these are the first and last characters in the ASCII   # 
# table (Alt+033 for ! and Alt+126 for ~ in Windows). As this range does not include    # 
# whitespace, \s is separately included. \t simply represents a tab character. \s is    # 
# similar to \t but the metacharacter \s is a shorthand for a whole character class that# 
# matches any whitespace character. This includes space, tab, newline and carriage      # 
# return. For strings assigned a value it may take this form: $str =~ s/[^!-~\s]//g;    #
#########################################################################################

$str =~ s/[^!-~\s]//g;


print "$str\n";

Again, thanks to each of you for engaging this with me.

Cheers,

Larry Moon

1 Like

Sorry, I don't read Perl but I get the idea.
If all you want to do is strip out certain characters then that is quite different from correcting them.

They are characters from the Extended ASCII character set, not Control characters.