Junk character appearing after downloading the file from windows server

Hello,

Im downloading the file from windows server through FTP, the downloaded file is containing some junk character at very start of the file as below and causing my whole script is to fail,
how to download without junk or how to remove these before processing it?

"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf

...
...

Im using the below code to download the file

get_file ()
{
ftp -vn 10.430.21.10 21 <<EOF
user "138" "abcd@9197"
pass
bi
cd /PVR
get testfile
bye
EOF
}

You need to use as instead of bi (ASCII , not BINARY transfer)

Also, if you still get those characters, they are probably BOM characters inserted by Windows.
See if this thread helps ..

--
BTW probably not, but if those are genuine passwords then you need to anonymize them before posting here..

Tried, the as option but no changes,still the these character causing these issue in my scripts.

I dont know what is this....

actually my awk is hunting one string from the first line, but these junk characters is not allowing awk to proceed...

What is your LANG setting?? What LANG setting does the remote computer use? awk is barfing on locale (LANG encoding)
-- because it thinks that leading character means something.

You can see the hex for the problem character: (Linux example)

od -x -N 10 filename

-- The very first hex entry.

If the character were a BOM for 16bit encoding and it was wrong the whole file would be garbage. So, I think Scrutinizer is correct.

Looks like two times the UTF-8 three byte representation of 0x2592 - a faint memory alludes that these could be MAC control chars.
Why don't you delete them awk 'NR==1 {$0=substr($0,7)}1' file ?

Dont know about the remote server as i do not have full access but i have my loacal AIX box lang setting as below

od -x -N 10 filename
0000000  fffe 2200 4500 7600 6500
0000012

As suggested by rudie tried to remove these bytes as below.

tail -n +4 file 

It got removed successfully but seems that whole file is having this garbage value but not visible in the new truncated file, really in a puzzling situation now , that was the final touch everything was written on by creating dummy file/text , now once i acted on the original downloaded file it very much disappointed me.

This is the actual code which i was trying to search the pattern , but giving me the null value, on my dummy text it was really working fine.

awk '/'severe.job'/ {print;for(i=1; i<=4; i++) {getline; print}}' filename 

Hey - this is a totally different story! That snippet is from a utf-16 coded file. Use iconv -futf-16 -tutf-8 file .

Hi Rudie,

Provided command is converting the file in some other format, seems the file which iam downloading is not UTF-16.

when checking the file type
#file filename
#filename: data or International Language text

here is the list for iconv, for the above type of file which conversion has to be used not sure , please advise.

My local box Format..

od -x -N 10 filename
0000000  fffe 2200 4500 7600 6500
0000012
iconv -l
ASCII-GR
BIG5-HKSCS
CNS11643.1986-1
CNS11643.1986-2
GB18030
GBK
IBM-1006
IBM-1046
IBM-1124
IBM-1129
IBM-1251
IBM-1252
IBM-1363
IBM-850
IBM-856
IBM-921
IBM-922
IBM-932
IBM-943
IBM-eucCN
IBM-eucJP
IBM-eucKR
IBM-eucTW
IBM-sbdTW
IBM-udcJP
IBM-udcTW
ISCII.1991
ISO8859-1
ISO8859-1-GL
ISO8859-1-GR
ISO8859-15
ISO8859-15-GL
ISO8859-15-GR
ISO8859-2
ISO8859-2-GL
ISO8859-2-GR
ISO8859-3
ISO8859-3-GL
ISO8859-3-GR
ISO8859-4
ISO8859-4-GL
ISO8859-4-GR
ISO8859-5
ISO8859-5-GL
ISO8859-5-GR
ISO8859-6
ISO8859-6-GL
ISO8859-6-GR
ISO8859-7
ISO8859-7-GL
ISO8859-7-GR
ISO8859-8
ISO8859-8-GL
ISO8859-8-GR
ISO8859-9
ISO8859-9-GL
ISO8859-9-GR
JISX0201.1976-0
JISX0208.1983-0
KSC5601.1987-0
KSC5601.1987-1
TIS-620
UCS-2
UNICODE-2
UTF-16
UTF-16le
UTF-32
UTF-8
big5
ct
fold7
fold8
uucode
IBM-037
IBM-1006
IBM-1025
IBM-1026
IBM-1027
IBM-1046
IBM-1047
IBM-1051
IBM-1097
IBM-1098
IBM-1112
IBM-1122
IBM-1123
IBM-1124
IBM-1125
IBM-1129
IBM-1130
IBM-1131
IBM-1132
IBM-1133
IBM-1140
IBM-1141
IBM-1142
IBM-1143
IBM-1144
IBM-1145
IBM-1146
IBM-1147
IBM-1148
IBM-1149
IBM-1250
IBM-1251
IBM-1252
IBM-1253
IBM-1254
IBM-1255
IBM-1256
IBM-1257
IBM-1258
IBM-12712
IBM-1275
IBM-1280
IBM-1281
IBM-1282
IBM-1283
IBM-1284
IBM-1285
IBM-273
IBM-275
IBM-277
IBM-278
IBM-280
IBM-284
IBM-285
IBM-290
IBM-297
IBM-420
IBM-424
IBM-437
IBM-4899
IBM-4909
IBM-4971
IBM-500
IBM-5346
IBM-5347
IBM-5348
IBM-5349
IBM-5350
IBM-5351
IBM-5352
IBM-5353
IBM-5354
IBM-737
IBM-803
IBM-838
IBM-850
IBM-850-GL
IBM-850-GR
IBM-852
IBM-855
IBM-856
IBM-857
IBM-858
IBM-860
IBM-861
IBM-862
IBM-863
IBM-864
IBM-865
IBM-866
IBM-867
IBM-868
IBM-869
IBM-870
IBM-871
IBM-874
IBM-875
IBM-880
IBM-897
IBM-9048
IBM-9061
IBM-918
IBM-921
IBM-922
IBM-924
ISO8859-1
ISO8859-1-GL
ISO8859-1-GR
ISO8859-15
ISO8859-2
ISO8859-5
ISO8859-6
ISO8859-7
ISO8859-8
ISO8859-

Ignoring the 1st two bytes in your file, it looks like the UTF-16 encoding for "Eve . You haven't answered the question about what locale you're using.

What is the output from the commands:

uname -a
locale

If, instead of using:

od -x filename

you use:

od -cb filename

does the output look like the characters you're expecting with NUL bytes (displayed as \0 ) between them (which is exactly what we'd expect if your locale is based on a superset of ASCII code set and the data you're viewing is encoded using UTF-16).

What application is creating this file on Windows?

Look at what it yields on my system:

hd file
00000000  ff fe 22 00 45 00 76 00  65 00                    |..".E.v.e.|
file file
file: Little-endian UTF-16 Unicode text, with no line terminators
iconv -futf-16 -tutf-8 file
"Eve

At least for the snippet you posted it seems quite persuading to me to treat it as UTF-16.

This is a little tongue-in-cheek and relies entirely in this particular case on a common junk character.
OSX 10.7.5, default bash terminal.
I have no idea if it will work on extremely huge files as it is read into variable and not streamed, however......
Using 'IFS' to your advantage and made as fully readable as possible...

#!/bin/bash
# jink1.sh
# Store the default IFS.
ifs_str="$IFS"
> /tmp/ascii
> /tmp/filename
# Generate a file with these common junk characters...
echo '1) "nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
21)        sdmnbfnmbdsf' > /tmp/filename
# Now read file into a variable...
textfile=$(cat < /tmp/filename)
# Check it is correct.
echo "$textfile"
# Now use IFS trick to get rid of them... ;o)
IFS="${textfile:0:1}"
# Make the variable an array.
textfile=($textfile)
# Now remove the junk characters.
n=0
while [ $n -lt ${#textfile[@]} ]
do
	printf "${textfile[$n]}" >> /tmp/ascii
	n=$((n+1))
done
# Ensure stripped newline is re-inserted...
echo "" >> /tmp/ascii
# Check it has been done.
cat < /tmp/ascii
# Reset IFS back to default.
IFS="$ifs_str"
exit 0

Results:-

Last login: Sun Sep  7 12:02:23 on ttys000
AMIGA:barrywalker~> ./junk1.sh
1) "nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
21)        sdmnbfnmbdsf
1) "nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
        sdmnbfnmbdsf"nmdbfnmdsfsdf"
       nmdsbnmfmdsf
        nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
21)        sdmnbfnmbdsf
AMIGA:barrywalker~> _

I know it is a little tongue-in-cheek, but a couple of remarks:
---

textfile=$(cat < /tmp/filename)

The redirection is unnecessary:

textfile=$(cat /tmp/filename)

---

textfile=($textfile)

To use the same name for a variable and an array in the same assignment is a bit *erm* ;):

Why not do it all in one go:

textfile=( $(cat /tmp/filename ) )

or since you are using bash:

textfile=( $(< /tmp/filename ) )

One thing to be aware off is that not just one, but any of number of trailing newlines will be removed.
---

IFS="${textfile:0:1}"

The quotes have no function here..

--

n=$((n+1))

Since you are using bash, you may also use:

((n++))

--

printf "${textfile[$n]}" >> /tmp/ascii

Do not leave out the format field. Leaving it out may bring all kinds of surprises. use:

printf "%s" "..."

--

n=0
while [ $n -lt ${#textfile[@]} ]
do
	printf "${textfile[$n]}" >> /tmp/ascii
	n=$((n+1))
done

Since you are using arrays instead of a loop you can also use:

printf "%s" "${textfile[@]}" > /tmp/ascii
1 Like

Hi Don,
Here is the locale setting

#uname -a
AIX PTR07t 1 6 00C29FB64C00

The file Which iam downloading from windows box is Scheduler log file.

C:\windows\tasks\SchedLgu.txt

Hi wisecracker,

Do you mean to work out my existing code ,Do i need to the add this complete new long code...

No. I asked you to run two commands; one to determine what OS you're using (which we now know is AIX) and one to determine what locale you're using (which you have not yet shown us). But it is now obvious (at least to me RudiC and me) that (despite your declaration that the file you have downloaded from Windows is not encoded using UTF-16); it is indeed encoded in UTF-16. If you will tell us what locale you're using, we can then tell you what option-argument to give to the iconv -t option to give you a file you can process in your locale.

I would expect that you will either want:

iconv -f UTF-16 -t UTF-8 filename > utf8.txt

or:

iconv -f UTF-16 -t ISO8859-1 filename > 8859.txt

where filename is the name of the file you downloaded from Windows.

Please don't post the results (since it obviously contains sensitive information), but please look at the output from the command:

od -bc filename

as I suggested before. Doesn't the output from od contain the output you're looking for with a null bytes between the characters you want?

Hi Don,

Below is the output

od -bc filename |head -3
0000000  456 567 043 000 123 000 456 000 145 000 157 000 184 000 184 000
                 "  \0   R  \0   P  \0   e  \0   n  \0  Qt  \0   l  \0

I have tried both the command suggested by you.

  1. iconv -f UTF-16 -t UTF-8 filename > utf8.txt .. got converted in a starnge format
head -1 utf8.txt
  1. iconv -f UTF-16 -t ISO8859-1 filename > 8859.txt this is giving nothing rather than a empty file.

About sensitivity of data yes data is being altered fully pasting just the dummy samples.

Hi Riverstone,

It might be worth a look here, to see if you can identify the file type.

Regards

Dave

I am seriously surprised by your od -bc output, as 456, 567, and 184 cannot be octal bytes, and 456 in the beginning is interpreted as a block char, whilst it's a "P" later on. 184 has two meanings as well. Puzzled.

And, it's different from previous samples again. How many files are we talking of?

And yet you still refuse to tell us what locale you're using! What is the output from the command?:

locale
locale
LANG=en_US
LC_COLLATE="en_US"
LC_CTYPE="en_US"
LC_MONETARY="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_MESSAGES="en_US"
LC_ALL=

File is surely UTF-16 only, we are talking about the single file.

I have heard that on some versions of AIX, the en_US locale uses the IBM 850 code set rather than USASCII, ISO 8859-1, or UTF-8 that would commonly be used on other systems.

Use:

locale -a|grep en_US

to get a list of available US English locales. Hopefully, you will see something like en_US.ISO8859-1 or en_US.UTF-8 . If en_US.ISO8859-1 is in the list, try:

LC_ALL=en_US.ISO8859-1 cat 8859.txt

or if en_US.UTF-8 is in the list, try:

LC_ALL=en_US.US.UTF-8 cat utf8.txt

where 8859.txt and utf8..txt are the files you created earlier using iconv . You could also try setting LC_ALL=C before cat'ing those two files.

If one of those works, look through your shell's initialization scripts and change whatever is setting LANG and the LC_* variables to en_US to instead set them to the one of the above that works for you.