Tip: Remove newlines in quoted fields in a csv

A couple of times I had to remove newlines in quoted fields in a csv.
(Once because the Excel import failed in certain locales. Was a bug of course.)
I did not find a robust script solution on the Internet.
So I created the following script with embedded sed code:

#!/bin/sh
# Remove newlines in quoted fields in a csv input
sed '
# Save input in hold buffer
  h
  :Q1
# Substitute an opening "
  s/"//
# If success jump to :Q2
  tQ2
# Restore input; jump to the end (next cycle)
  x; b
  :Q2
# Substitute a closing "
  s/"//
# If success jump to :Q1
  tQ1
# Restore input; jump to the end if last line (having unbalanced quotes)
  x; $b
# Append the next line with a separating space
  N; s/\n/ /
# Save input in hold buffer; jump to :Q1
  h; bQ1
' "$@"

If it finds an opening quote ( ") it will join the next line, until it will find the matching closing quote. In other words, the number of quotes must be even, otherwise the next lines will be joined until it becomes even.
Currently the effective change is s/\n/ / i.e. sed substitutes a newline character by a space character.
It works with all delimiters: comma, semicolon, tab, ...
Quoted quotes ( "" ) are kept unchanged.

Example input

"f1 ""quoted""",f2unquoted,"f3 b
roken",f4unquoted,"f5 on
three
lines"
f1unquoted,f2unquoted,"f3 b
roken",f4unquoted,"f5"

is changed to

"f1 ""quoted""",f2unquoted,"f3 b roken",f4unquoted,"f5 on three lines"
f1unquoted,f2unquoted,"f3 b roken",f4unquoted,"f5"

It prints to stdout, so output should be redirected to a new(!) file.
Of course you can change to sed -i for an inplace-change of the given files.

EDIT: on StackOverflow I found a shorter solution that seems to work as well. Here translated to a Posix sed

#!/bin/sh
sed '
  :a
  /^[^"]*\("[^"]*"[^"]*\)*"[^"]*$/{
    $!N
    s/\n/ /
    ta
  }
' "$@"
7 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.