Im downloading the file from windows server through FTP, the downloaded file is containing some junk character at very start of the file as below and causing my whole script is to fail,
how to download without junk or how to remove these before processing it?
What is your LANG setting?? What LANG setting does the remote computer use? awk is barfing on locale (LANG encoding)
-- because it thinks that leading character means something.
You can see the hex for the problem character: (Linux example)
od -x -N 10 filename
-- The very first hex entry.
If the character were a BOM for 16bit encoding and it was wrong the whole file would be garbage. So, I think Scrutinizer is correct.
Looks like two times the UTF-8 three byte representation of 0x2592 - a faint memory alludes that these could be MAC control chars.
Why don't you delete them awk 'NR==1 {$0=substr($0,7)}1' file ?
As suggested by rudie tried to remove these bytes as below.
tail -n +4 file
It got removed successfully but seems that whole file is having this garbage value but not visible in the new truncated file, really in a puzzling situation now , that was the final touch everything was written on by creating dummy file/text , now once i acted on the original downloaded file it very much disappointed me.
This is the actual code which i was trying to search the pattern , but giving me the null value, on my dummy text it was really working fine.
Ignoring the 1st two bytes in your file, it looks like the UTF-16 encoding for "Eve . You haven't answered the question about what locale you're using.
What is the output from the commands:
uname -a
locale
If, instead of using:
od -x filename
you use:
od -cb filename
does the output look like the characters you're expecting with NUL bytes (displayed as \0 ) between them (which is exactly what we'd expect if your locale is based on a superset of ASCII code set and the data you're viewing is encoded using UTF-16).
What application is creating this file on Windows?
This is a little tongue-in-cheek and relies entirely in this particular case on a common junk character.
OSX 10.7.5, default bash terminal.
I have no idea if it will work on extremely huge files as it is read into variable and not streamed, however......
Using 'IFS' to your advantage and made as fully readable as possible...
#!/bin/bash
# jink1.sh
# Store the default IFS.
ifs_str="$IFS"
> /tmp/ascii
> /tmp/filename
# Generate a file with these common junk characters...
echo '1) "nmdbfnmdsfsdf"
nmdsbnmfmdsf
nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
sdmnbfnmbdsf"nmdbfnmdsfsdf"
nmdsbnmfmdsf
nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
sdmnbfnmbdsf"nmdbfnmdsfsdf"
nmdsbnmfmdsf
nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
sdmnbfnmbdsf"nmdbfnmdsfsdf"
nmdsbnmfmdsf
nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
sdmnbfnmbdsf"nmdbfnmdsfsdf"
nmdsbnmfmdsf
nmsbmfdsbnmfbds
nmbdsnmfbsnmdbfnmds
21) sdmnbfnmbdsf' > /tmp/filename
# Now read file into a variable...
textfile=$(cat < /tmp/filename)
# Check it is correct.
echo "$textfile"
# Now use IFS trick to get rid of them... ;o)
IFS="${textfile:0:1}"
# Make the variable an array.
textfile=($textfile)
# Now remove the junk characters.
n=0
while [ $n -lt ${#textfile[@]} ]
do
printf "${textfile[$n]}" >> /tmp/ascii
n=$((n+1))
done
# Ensure stripped newline is re-inserted...
echo "" >> /tmp/ascii
# Check it has been done.
cat < /tmp/ascii
# Reset IFS back to default.
IFS="$ifs_str"
exit 0
No. I asked you to run two commands; one to determine what OS you're using (which we now know is AIX) and one to determine what locale you're using (which you have not yet shown us). But it is now obvious (at least to me RudiC and me) that (despite your declaration that the file you have downloaded from Windows is not encoded using UTF-16); it is indeed encoded in UTF-16. If you will tell us what locale you're using, we can then tell you what option-argument to give to the iconv -t option to give you a file you can process in your locale.
I would expect that you will either want:
iconv -f UTF-16 -t UTF-8 filename > utf8.txt
or:
iconv -f UTF-16 -t ISO8859-1 filename > 8859.txt
where filename is the name of the file you downloaded from Windows.
Please don't post the results (since it obviously contains sensitive information), but please look at the output from the command:
od -bc filename
as I suggested before. Doesn't the output from od contain the output you're looking for with a null bytes between the characters you want?
I am seriously surprised by your od -bc output, as 456, 567, and 184 cannot be octal bytes, and 456 in the beginning is interpreted as a block char, whilst it's a "P" later on. 184 has two meanings as well. Puzzled.
And, it's different from previous samples again. How many files are we talking of?
I have heard that on some versions of AIX, the en_US locale uses the IBM 850 code set rather than USASCII, ISO 8859-1, or UTF-8 that would commonly be used on other systems.
Use:
locale -a|grep en_US
to get a list of available US English locales. Hopefully, you will see something like en_US.ISO8859-1 or en_US.UTF-8 . If en_US.ISO8859-1 is in the list, try:
LC_ALL=en_US.ISO8859-1 cat 8859.txt
or if en_US.UTF-8 is in the list, try:
LC_ALL=en_US.US.UTF-8 cat utf8.txt
where 8859.txt and utf8..txt are the files you created earlier using iconv . You could also try setting LC_ALL=C before cat'ing those two files.
If one of those works, look through your shell's initialization scripts and change whatever is setting LANG and the LC_* variables to en_US to instead set them to the one of the above that works for you.