Check if file is EBCDIC or ASCII format

So, i have this requirement where i need to check the file format, whether it's EBCDIC or ASCII, and based on format retrieve the information from that file:

my file is:

file1.txt-->this ebcdic file
file2.txt-->ascii file

i tried below code:

file=file1.txt
type="`file $file`"

i get output if file is ebcdic:

file1.txt: data

if file is ascii:

file2.txt: ASCII text, with no line terminators

now my requirement is i need to do some operations if file is ebcdic and some if file is ascii:

type="`file $file`"
if grep -q data "$type"; then
echo "ebcdic"
else
echo "ascii"
fi

output:

grep: 010020001_S-FOR-Sort-SYEXC_20180109_062320_0100.x937: data: No such file or directory
ascii

firstly its not searching the word data correctly in $type variable and its showing something like no such file or directory, can anyone guide me where i am going wrong, and also what will the best approach for this requirement.
TIA

Hi,

I think I can see where you might have gone a little bit wrong here. I assume that when you run:

grep -q data "$type"

you're attempting to see if the string "data" occurs in the contents of the variable "$type", which will indeed be correctly populated with the output of the file command. However, that's not what grep does. The grep command specifically searches the contents of files, and nothing else.

So what's going wrong here is that grep is trying to search for the string "data" in a file on disk in the current directory called (e.g. in the case of an ASCII file) "test: ASCII text", which does not exist.

If you want to check for the contents of a variable rather than a file, there's a few ways you could do that. Probably one of the easiest changes for you to make in your script would be to do a test like this instead:

if echo "$type" | grep "data"

This would cause grep to treat sandard input as its file, into which we are piping the output of the echo "$type" command. So this would ultimately check to see if the variable "$type" contained the string "data", which is I think what you're wanting to do.

Hope this helps.

grep takes a file for input, not the contents of a shell variable. To analyse a shell variable, you can use the test (or [ ) or case command, or - in more recent shells - the [[ compound command. For example (please be aware that my file command outputs different results for ebcdic):

case ${type#*: } in
        Non-ISO*)       echo "file type: ebcdic" ;;
        ASCII*)         echo "file type: ASCII" ;;
        *)              echo "file type not recognised." ;;
esac

or

[[ "$type" =~ Non-ISO ]] && echo ebcdic || echo ASCII

It would help to know the source of this data--is it mostly letters and numbers or are there a lot of special characters. If you have very few characters (like Greek letters, graphics, or math symbols) higher than x'7f', you and assume it is ASCII. In EBCDIC alphabetic and numeric characters are all higher than that. Also, a space in ASCII is x'20' while an EBCDIC space is x'40', the ASCII symbol for the at sign "@". Perhaps compare the number of at signs to the file size?
This table: Ascii Table - ASCII character codes and html, octal, hex and decimal chart conversion is a good place to start.

Also, by EBCDIC do happen to mean "packed decimal"?

Packed decimal has little to do with EBCDIC or ASCII. Each byte contains two decimal digits except the last byte may contain a sign like x'C' for positive, x'D' for negative, and x'F' for unsigned.
In COBOL, a signed five digit field would be represented with

77  FIELD-NAME   PIC S9(5) COMP-3.

and would occupy three bytes. In hex, the number -12345 would be x'12 34 5D'.

I know that. People asking about EBCDIC here usually don't, though, and almost always are looking into packed decimal instead. Time will tell.

True. EBCDIC or ASCII only describe the hex code used to represent a character, like the letter B is x'42' (ASCII) or x'c2' (EBCDIC). It wouldn't apply to packed decimal, binary, or floating point. Negative 12345 in binary would be x'cfc7' regardless of how other alphanumeric fields are coded in the record where it resides.