File encoding

palex · May 25, 2024, 2:10pm

Hello,
I am having difficulty editing/processing a data file due to its format.

bash-3.2$ file 20240501cengagelearning_titles.TXT 
20240501cengagelearning_titles.TXT: data

The cat/head/tail commands all show the proper data, but I cannot vi into the file. I feel that I need to use iconv, but I don't know how to identify the endoding.

Thanks.

vbe · May 25, 2024, 2:59pm

May I ask you what you are trying to do?
file cmd says its a data file OK, but data file does not automatically mean text file...

palex · May 25, 2024, 3:05pm

I'm trying to run scripts on the file, but they won't work until I format/decode the file. They scripts worked previously, but company changed the way they encoded the file.

munkeHoller · May 25, 2024, 3:10pm

@palex ,
Go back to whoever encoded and ask them what/why they’ve changed. I imagine they’ll also have tools to decode it if they’re encoding it

palex · May 25, 2024, 3:13pm

I cannot get straight answers from them. I need to figure this out on my end.

munkeHoller · May 25, 2024, 3:23pm

Can you share the output from your head/tail cmds ?

Any messages when you try to open using vim/vi ?

munkeHoller · May 25, 2024, 3:30pm

What., and this is colleagues from the same company?

palex · May 25, 2024, 4:00pm

bash-3.2$ head 20240501cengagelearning_titles.TXT 
ÿþ"ISBN"|"ISBN-13"|"Full Title"|"Previous ISBN"|"Edition"|"Copyright"|"Pub Date"|"Status Code"|"Discount Code"|"Page Count"|"Binding"|"US List Price"|"Author"|"Publisher"|"Imprint"|"Description"|"TOC"|"Major Description"|"Minor Description"|"Primary BISAC"|"Secondary BISAC"|"Tertiary BISAC"|"Media Type Description"|"Carton Qty"|"Unit Weight"|"Next Edition ISBN"|"Next Edition Pub Date"|"MTO Flag"|
"0534264964"|"9780534264963"|"Distance EducationA Systems View"||"001"|"1996"|"11/3/1995 12:00:00 AM"|"RRA"|"A"|"304"|"HB"|"66.95"|"Michael G. Moore Greg Kearsley"|"Cengage Learning"|"Cengage Learning"|"The only comprehensive and current book on the subject of distance education, this book utilizes a systems approach to organize and justify material and includes information on the fundamental issues of distance education as well as the theory, research, and practice."|"1. Fundamentals of Distance Education.  2. The Historical Context of Distance Education.  3. The Scope of Distance Education.  4. Research on Effectiveness.  5. Technologies and Media.  6. Course Design and Development.  7. Teaching and Tutoring.  8. The Distance Education Student.  9. Administration, Management, and Policy.  10. The Theoretical Basis for Distance Education.  11. International Perspectives.  12. The Transformation of Education.  Appendices.  Glossary.  Bibliography.  Index."|"Education"|"Education"|"BUSINESS & ECONOMICS / Education"|"EDUCATION / General"|"EDUCATION / Teaching Methods & Materials / General"|"Bound Book"|"26"|"1.3000"|"9780534506889"|"1/1/0001 12:00:00 AM"|"N"|

bash-3.2$ vi 20240501cengagelearning_titles.TXT 

ÿþ"^@I^@S^@B^@N^@"^@|^@"^@I^@S^@B^@N^@-^@1^@3^@"^@|^@"^@F^@u^@l^@l^@ ^@T^@i^@t^@l^@e^@"^@|^@"^@P^@r^@e^@v^@i^@o^@u^@s^@ ^@I^@S^@B^@N^@"^@|^@"^@E^@d^@i^@t^@i^@o^@n^@"^@|^@"^@C^@o^@p^@y^@r^@i^@g^@h^@t^@"^@|^@"^@P^@u^@b^@ ^@D^@a^@t^@e^@"^@|^@"^@S^@t^@a^@t^@u^@s^@ ^@C^@o^@d^@e^@"^@|^@"^@D^@i^@s^@c^@o^@u^@n^@t^@ ^@C^@o^@d^@e^@"^@|^@"^@P^@a^@g^@e^@ ^@C^@o^@u^@n^@t^@"^@|^@"^@B^@i^@n^@d^@i^@n^@g^@"^@|^@"^@U^@S^@ ^@L^@i^@s^@t^@ ^@P^@r^@i^@c^@e^@"^@|^@"^@A^@u^@t^@h^@o^@r^@"^@|^@"^@P^@u^@b^@l^@i^@s^@h^@e^@r^@"^@|^@"^@I^@m^@p^@r^@i^@n^@t^@"^@|^@"^@D^@e^@s^@c^@r^@i^@p^@t^@i^@o^@n^@"^@|^@"^@T^@O^@C^@"^@|^@"^@M^@a^@j^@o^@r^@ ^@D^@e^@s^@c^@r^@i^@p^@t^@i^@o^@n^@"^@|^@"^@M^@i^@n^@o^@r^@ ^@D^@e^@s^@c^@r^@i^@p^@t^@i^@o^@n^@"^@|

munkeHoller · May 25, 2024, 4:10pm

Fundamentally what is the content of this file supposed to be ( based on whatever you working on and prior experience based on the fact that you had previously been working on same/similar data file(s))

palex · May 25, 2024, 4:18pm

It's just a large data file with multiple fields per line. I need to awk out select fields.

munkeHoller · May 25, 2024, 5:46pm

@palex , that's a totally inadequate 'description' of the contents - you could apply that to anything. ,

I would have expected something like 'cenage learning catalog data' ( that's my own guess based on the sample you've provided.

Perhaps sharing the script(s) you've previously written would help us. -

bendingrodriguez · May 25, 2024, 5:57pm

Hi @palex,

the first two chars of your file look like binary, i.e. non-ASCII. Try to strip them via

$ cut -d '"' -f2-  infile > outfile

and then edit outfile. Does that work? Note that the first " will be stripped, too.

palex · May 25, 2024, 6:19pm

I did try this... same problem unfortunately.

munkeHoller · May 25, 2024, 6:34pm

@palex , show some workings please , stating something doesn't work is useless unless accompanied with supporting evidence - all basic investigative stuff.

given the sample you provided, what would you expect the output to be ...

supply existing code - we then have something concrete to use to give feedback on,
otherwise we're trying to solve a problem that we don't have any real details on.

we are suffering from the same malaise you've alluded to .... ie you are not giving any real help or clear description ....

munkeHoller · May 25, 2024, 7:02pm

a basic dump

awk -f junk.awk junk.txt 
ÿþ"ISBN" ["0534264964"]
"ISBN-13" ["9780534264963"]
"Full Title" ["Distance EducationA Systems View"]
"Previous ISBN" []
"Edition" ["001"]
"Copyright" ["1996"]
"Pub Date" ["11/3/1995 12:00:00 AM"]
"Status Code" ["RRA"]
"Discount Code" ["A"]
"Page Count" ["304"]
"Binding" ["HB"]
"US List Price" ["66.95"]
"Author" ["Michael G. Moore Greg Kearsley"]
"Publisher" ["Cengage Learning"]
"Imprint" ["Cengage Learning"]
"Description" ["The only comprehensive and current book on the subject of distance education, this book utilizes a systems approach to organize and justify material and includes information on the fundamental issues of distance education as well as the theory, research, and practice."]
"TOC" ["1. Fundamentals of Distance Education.  2. The Historical Context of Distance Education.  3. The Scope of Distance Education.  4. Research on Effectiveness.  5. Technologies and Media.  6. Course Design and Development.  7. Teaching and Tutoring.  8. The Distance Education Student.  9. Administration, Management, and Policy.  10. The Theoretical Basis for Distance Education.  11. International Perspectives.  12. The Transformation of Education.  Appendices.  Glossary.  Bibliography.  Index."]
"Major Description" ["Education"]
"Minor Description" ["Education"]
"Primary BISAC" ["BUSINESS & ECONOMICS / Education"]
"Secondary BISAC" ["EDUCATION / General"]
"Tertiary BISAC" ["EDUCATION / Teaching Methods & Materials / General"]
"Media Type Description" ["Bound Book"]
"Carton Qty" ["26"]
"Unit Weight" ["1.3000"]
"Next Edition ISBN" ["9780534506889"]
"Next Edition Pub Date" ["1/1/0001 12:00:00 AM"]
"MTO Flag" ["N"]

stripping non-printing chars is also relatively simple

Paul_Pedant · May 25, 2024, 10:50pm

Each of those ^@ things is an ASCII NUL character, and they alternate with single readable characters. So each character seems to be taking two bytes.

The terminal just ignores these NULs (as timing padding). vi makes them visible using ^ to mark a control character.

I am 90% sure that this data is now being written in Windows Wide (16-bit) character format by your "friends" in the organisation. Key-words here are UTF-16 and wchar_t. I suspect the first two characters are a 16-bit marker for the wide character style.

munkeHoller · May 25, 2024, 11:40pm

@Paul_Pedant , good shout (again).

All occurances of the literal ^@ replaced by NULL (ascii 0) .. then simple test ...

file vi.junk 
vi.junk: data

cat vi.junk | tr -d '\000'
ÿþ"ISBN"|"ISBN-13"|"Full Title"|"Previous ISBN"|"Edition"|"Copyright"|"Pub Date"|"Status Code"|"Discount Code"|"Page Count"|"Binding"|"US List Price"|"Author"|"Publisher"|"Imprint"|"Description"|"TOC"|"Major Description"|"Minor Description"|

palex · May 26, 2024, 12:05am

Hey everyone,
Thanks for your help, and thanks Paul for seeing the interleaving. I did find a solution that works:

tr < inputfile -d '\000' > outputfile

Paul_Pedant · May 26, 2024, 8:10am

I discovered that the user-level output uses the first two bytes to indicate whether the UTF-16 is Big-Endian (0xFE 0xFF) or Little-Endian (0xFF 0xFE) -- a problem that UTF-8 sidesteps very neatly.

Those are immediately converted to UTF-8 (!) as 0xC3 0xBE 0xC3 0xBF or 0xC3 0xBF 0xC3 0xBE, which are Unicode LATIN SMALL LETTER Y WITH DIAERESIS and LATIN SMALL LETTER THORN, which is why your vi starts out with ÿþ.

So your encoding is UTF-16LE (aka UTF16LE) as found with iconv -l | grep 'UTF.*16'

You might find that setting a locale to one of those names (temporarily, and in a subshell) helps you send the data out in a more readable form.