Regular expression / regex substition on Unicode text

thomas.hedden · February 2, 2010, 10:51pm

I have a large file encoded in Unicode that I need to convert to CSV. In general, I know how to do this by regular expression substitutions using sed or Perl, but one problem I am having is that I need to put a quotation mark at the end of each line to protect the last field. The usual regex substitution ...
s/$/"/
... works fine for 7-bit ASCII text, but when I run this on my Unicode text file, the double quotation mark appears at the BEGINNING of the FOLLOWING line, not at the end of the line on which it's supposed to appear.
The file came from a Windows system, but piping through dos2unix doesn't seem to make any difference. I've tried the "use Encode;" pragma with several different encodings, but I get the same result. Perhaps I'm doing something wrong. Does anyone know of a special library function intended for this purpose, a Perl pragma, etc., that would accomplish this easily? This should be a trivial problem.
Thanks in advance for any suggestions.
Tom

---------- Post updated at 10:51 PM ---------- Previous update was at 10:49 AM ----------

As I mentioned, the regex ...
s/$/"/
... puts the `"' at the beginning of the following line, and piping through dos2unix doesn't matter one way or the other. Using `\n' instead of `$' doesn't
make any difference. However, I discovered an interesting fact: The regex ... s/\r/"/ # use `\r' instead of `$' or `\n' ... gives the expected result, and does so without piping through dos2unix making any difference! I also found that a C program using a wchar_t declaration behaves similarly. That is, a character that appears as if it should be output BEFORE the EOL actually appears after it, if it is matched as '\n', but if it is matched as '\r' then it is output as expected.
My immediate problem is solved, however I wonder whether this doesn't show a bug in Perl or in its regular expression engine ...
This seems to be a clear case where Unicode text is handled differently than non-Unicode text.
Any opinions?
Tom

adderek · February 7, 2010, 9:55am

Hello,

Please, post some example data and detailed test case for the issue you are experiencing.
Don't explain - that is just a wast of time (your time and the reader's time). Show it.

Unicode data? You need unicode tools...
Are you trying to turn multiple lines into a single long line? Although it seems strange it might be what you need...
Try using regexp as it was a large block - not "set of lines" - it might help you.
Keep in mind that your solution might be very slow.
Also keep in mind the old saying: "You had a problem and decided to solve it by using regexps? So now you have two problems."