I have a large file encoded in Unicode that I need to convert to CSV. In general, I know how to do this by regular expression substitutions using sed or Perl, but one problem I am having is that I need to put a quotation mark at the end of each line to protect the last field. The usual regex substitution ...
s/$/"/
... works fine for 7-bit ASCII text, but when I run this on my Unicode text file, the double quotation mark appears at the BEGINNING of the FOLLOWING line, not at the end of the line on which it's supposed to appear.
The file came from a Windows system, but piping through dos2unix doesn't seem to make any difference. I've tried the "use Encode;" pragma with several different encodings, but I get the same result. Perhaps I'm doing something wrong. Does anyone know of a special library function intended for this purpose, a Perl pragma, etc., that would accomplish this easily? This should be a trivial problem.
Thanks in advance for any suggestions.
Tom
---------- Post updated at 10:51 PM ---------- Previous update was at 10:49 AM ----------
As I mentioned, the regex ...
s/$/"/
... puts the `"' at the beginning of the following line, and piping through dos2unix doesn't matter one way or the other. Using `\n' instead of `$' doesn't
make any difference. However, I discovered an interesting fact: The regex ... s/\r/"/ # use `\r' instead of `$' or `\n' ... gives the expected result, and does so without piping through dos2unix making any difference! I also found that a C program using a wchar_t declaration behaves similarly. That is, a character that appears as if it should be output BEFORE the EOL actually appears after it, if it is matched as '\n', but if it is matched as '\r' then it is output as expected.
My immediate problem is solved, however I wonder whether this doesn't show a bug in Perl or in its regular expression engine ...
This seems to be a clear case where Unicode text is handled differently than non-Unicode text.
Any opinions?
Tom