Hello,
I am having difficulty editing/processing a data file due to its format.
bash-3.2$ file 20240501cengagelearning_titles.TXT
20240501cengagelearning_titles.TXT: data
The cat/head/tail commands all show the proper data, but I cannot vi into the file. I feel that I need to use iconv, but I don't know how to identify the endoding.
I'm trying to run scripts on the file, but they won't work until I format/decode the file. They scripts worked previously, but company changed the way they encoded the file.
bash-3.2$ head 20240501cengagelearning_titles.TXT
ÿþ"ISBN"|"ISBN-13"|"Full Title"|"Previous ISBN"|"Edition"|"Copyright"|"Pub Date"|"Status Code"|"Discount Code"|"Page Count"|"Binding"|"US List Price"|"Author"|"Publisher"|"Imprint"|"Description"|"TOC"|"Major Description"|"Minor Description"|"Primary BISAC"|"Secondary BISAC"|"Tertiary BISAC"|"Media Type Description"|"Carton Qty"|"Unit Weight"|"Next Edition ISBN"|"Next Edition Pub Date"|"MTO Flag"|
"0534264964"|"9780534264963"|"Distance EducationA Systems View"||"001"|"1996"|"11/3/1995 12:00:00 AM"|"RRA"|"A"|"304"|"HB"|"66.95"|"Michael G. Moore Greg Kearsley"|"Cengage Learning"|"Cengage Learning"|"The only comprehensive and current book on the subject of distance education, this book utilizes a systems approach to organize and justify material and includes information on the fundamental issues of distance education as well as the theory, research, and practice."|"1. Fundamentals of Distance Education. 2. The Historical Context of Distance Education. 3. The Scope of Distance Education. 4. Research on Effectiveness. 5. Technologies and Media. 6. Course Design and Development. 7. Teaching and Tutoring. 8. The Distance Education Student. 9. Administration, Management, and Policy. 10. The Theoretical Basis for Distance Education. 11. International Perspectives. 12. The Transformation of Education. Appendices. Glossary. Bibliography. Index."|"Education"|"Education"|"BUSINESS & ECONOMICS / Education"|"EDUCATION / General"|"EDUCATION / Teaching Methods & Materials / General"|"Bound Book"|"26"|"1.3000"|"9780534506889"|"1/1/0001 12:00:00 AM"|"N"|
bash-3.2$ vi 20240501cengagelearning_titles.TXT
ÿþ"^@I^@S^@B^@N^@"^@|^@"^@I^@S^@B^@N^@-^@1^@3^@"^@|^@"^@F^@u^@l^@l^@ ^@T^@i^@t^@l^@e^@"^@|^@"^@P^@r^@e^@v^@i^@o^@u^@s^@ ^@I^@S^@B^@N^@"^@|^@"^@E^@d^@i^@t^@i^@o^@n^@"^@|^@"^@C^@o^@p^@y^@r^@i^@g^@h^@t^@"^@|^@"^@P^@u^@b^@ ^@D^@a^@t^@e^@"^@|^@"^@S^@t^@a^@t^@u^@s^@ ^@C^@o^@d^@e^@"^@|^@"^@D^@i^@s^@c^@o^@u^@n^@t^@ ^@C^@o^@d^@e^@"^@|^@"^@P^@a^@g^@e^@ ^@C^@o^@u^@n^@t^@"^@|^@"^@B^@i^@n^@d^@i^@n^@g^@"^@|^@"^@U^@S^@ ^@L^@i^@s^@t^@ ^@P^@r^@i^@c^@e^@"^@|^@"^@A^@u^@t^@h^@o^@r^@"^@|^@"^@P^@u^@b^@l^@i^@s^@h^@e^@r^@"^@|^@"^@I^@m^@p^@r^@i^@n^@t^@"^@|^@"^@D^@e^@s^@c^@r^@i^@p^@t^@i^@o^@n^@"^@|^@"^@T^@O^@C^@"^@|^@"^@M^@a^@j^@o^@r^@ ^@D^@e^@s^@c^@r^@i^@p^@t^@i^@o^@n^@"^@|^@"^@M^@i^@n^@o^@r^@ ^@D^@e^@s^@c^@r^@i^@p^@t^@i^@o^@n^@"^@|
Fundamentally what is the content of this file supposed to be ( based on whatever you working on and prior experience based on the fact that you had previously been working on same/similar data file(s))
@palex , show some workings please , stating something doesn't work is useless unless accompanied with supporting evidence - all basic investigative stuff.
given the sample you provided, what would you expect the output to be ...
supply existing code - we then have something concrete to use to give feedback on,
otherwise we're trying to solve a problem that we don't have any real details on.
we are suffering from the same malaise you've alluded to .... ie you are not giving any real help or clear description ....
awk -f junk.awk junk.txt
ÿþ"ISBN" ["0534264964"]
"ISBN-13" ["9780534264963"]
"Full Title" ["Distance EducationA Systems View"]
"Previous ISBN" []
"Edition" ["001"]
"Copyright" ["1996"]
"Pub Date" ["11/3/1995 12:00:00 AM"]
"Status Code" ["RRA"]
"Discount Code" ["A"]
"Page Count" ["304"]
"Binding" ["HB"]
"US List Price" ["66.95"]
"Author" ["Michael G. Moore Greg Kearsley"]
"Publisher" ["Cengage Learning"]
"Imprint" ["Cengage Learning"]
"Description" ["The only comprehensive and current book on the subject of distance education, this book utilizes a systems approach to organize and justify material and includes information on the fundamental issues of distance education as well as the theory, research, and practice."]
"TOC" ["1. Fundamentals of Distance Education. 2. The Historical Context of Distance Education. 3. The Scope of Distance Education. 4. Research on Effectiveness. 5. Technologies and Media. 6. Course Design and Development. 7. Teaching and Tutoring. 8. The Distance Education Student. 9. Administration, Management, and Policy. 10. The Theoretical Basis for Distance Education. 11. International Perspectives. 12. The Transformation of Education. Appendices. Glossary. Bibliography. Index."]
"Major Description" ["Education"]
"Minor Description" ["Education"]
"Primary BISAC" ["BUSINESS & ECONOMICS / Education"]
"Secondary BISAC" ["EDUCATION / General"]
"Tertiary BISAC" ["EDUCATION / Teaching Methods & Materials / General"]
"Media Type Description" ["Bound Book"]
"Carton Qty" ["26"]
"Unit Weight" ["1.3000"]
"Next Edition ISBN" ["9780534506889"]
"Next Edition Pub Date" ["1/1/0001 12:00:00 AM"]
"MTO Flag" ["N"]
stripping non-printing chars is also relatively simple
Each of those ^@ things is an ASCII NUL character, and they alternate with single readable characters. So each character seems to be taking two bytes.
The terminal just ignores these NULs (as timing padding). vi makes them visible using ^ to mark a control character.
I am 90% sure that this data is now being written in Windows Wide (16-bit) character format by your "friends" in the organisation. Key-words here are UTF-16 and wchar_t. I suspect the first two characters are a 16-bit marker for the wide character style.
I discovered that the user-level output uses the first two bytes to indicate whether the UTF-16 is Big-Endian (0xFE 0xFF) or Little-Endian (0xFF 0xFE) -- a problem that UTF-8 sidesteps very neatly.
Those are immediately converted to UTF-8 (!) as 0xC3 0xBE 0xC3 0xBF or 0xC3 0xBF 0xC3 0xBE, which are Unicode LATIN SMALL LETTER Y WITH DIAERESIS and LATIN SMALL LETTER THORN, which is why your vi starts out with ÿþ.
So your encoding is UTF-16LE (aka UTF16LE) as found with iconv -l | grep 'UTF.*16'
You might find that setting a locale to one of those names (temporarily, and in a subshell) helps you send the data out in a more readable form.