Duplicate rows in CSV files based on values

vbhonde11 · April 15, 2011, 4:48pm

I am new to this forum and this is my first post.

I am looking at an old post with exactly the same name. Can not paste URL because I do not have 5 posts

My requirement is exactly opposite.

I want to get rid of duplicate rows and try to append the values of columns in those rows

Input

abc, first line, value1
def, second line, value2
def, second line, value3
ghi, third line, value4

Output

abc, first line, value1
def, second line, "value2,value3"
ghi, third line, value4

DGPickett · April 15, 2011, 4:55pm

sort and use sed, with two lines in the buffer, to fold them together.

vbhonde11 · April 15, 2011, 5:07pm

Thanks a lot for your quick reply. I appreciate your help but I am new to scripting could you please add some sample code. I can modify it as per my requirement.

Shell_Life · April 15, 2011, 6:05pm

See if this works for you:

#!/usr/bin/ksh
typeset -i mCnt=0
mPFld1="First_Time"
IFS=','
while read mFld1 mFld2 mValue
do
  if [[ "${mFld1}" != "${mPFld1}" || "${mFld2}" != "${mPFld2}" ]]; then
    if [[ "${mPFld1}" != "First_Time" ]]; then
      if [[ ${mCnt} -gt 1 ]]; then
        echo ${mPFld1}'COMMA'${mPFld2}'COMMA"'${mOutValue}'"'
      else
        echo ${mPFld1}'COMMA'${mPFld2}'COMMA'${mOutValue}
      fi
    fi
    mOutValue=''
    mCnt=0
  fi
  if [[ "${mOutValue}" = "" ]]; then
    mOutValue=${mValue}
  else
    mOutValue=${mOutValue}'COMMA'${mValue}
  fi
  mPFld1=${mFld1}
  mPFld2=${mFld2}
  mCnt=${mCnt}+1
done < Inp_File
if [[ "${mPFld1}" != "First_Time" ]]; then
  if [[ ${mCnt} -gt 1 ]]; then
    echo ${mPFld1}'COMMA'${mPFld2}'COMMA"'${mOutValue}'"'
  else
    echo ${mPFld1}'COMMA'${mPFld2}'COMMA'${mOutValue}
  fi
fi

Run it as follows:

>my_script > Out_File
>sed 's/COMMA/,/g' Out_File

vbhonde11 · April 15, 2011, 6:32pm

Thanks a lot for your reply. Really appreciate your quick help
The output is as follows

 
abc, first line, value1
def, second line," value2, value3"
ghi, third line, value4
,,

I am sorry but one more question. If file has 3 more columns

Input

 
abc, first line, value1, col1,col2,col3
def, second line, value2, col4,col5,col6
def, second line, value3, col4,col5,col6
ghi, third line, value4, col7,col8,col9

output

 
abc, first line, value1, col1,col2,col3

def, second line," value2, value3",col4,col5,col6
ghi, third line, value4,  col7,col8,col9

will there be a major change in this code? I am trying it now. Also I am trying to get rid of those extra commas on the last line of the output file.

yinyuemi · April 15, 2011, 7:02pm

echo "abc, first line, value1, col1,col2,col3
def, second line, value2, col4,col5,col6
def, second line, value3, col4,col5,col6
ghi, third line, value4, col7,col8,col9" |sed -n -r '1h;{2,$H;x;s/(.*), "?(.*), ([^\n]*)\n\1, (.*), \3/\1 "\2, \4",\3/;h};${s/", /, /g;p}'
abc, first line, value1, col1,col2,col3
def, second line "value2, value3",col4,col5,col6
ghi, third line, value4, col7,col8,col9

---------- Post updated at 06:02 PM ---------- Previous update was at 05:52 PM ----------

awk would be much more controllable

echo "abc, first line, value1, col1,col2,col3
def, second line, value2, col4,col5,col6
def, second line, value3, col4,col5,col6
ghi, third line, value4, col7,col8,col9" |awk '{sub(",","",$4);x=$1 FS $2 FS $3 FS $5;a[x]=a[x]?a[x] FS $4:$4}END{for(i in a) {split(i,b,FS);print b[1],b[2],b[3],a~FS?"\""a"\",":a",",b[4]}}'
abc, first line, value1, col1,col2,col3
def, second line, "value2 value3", col4,col5,col6
ghi, third line, value4, col7,col8,col9

DGPickett · April 18, 2011, 4:52pm

Is this csv code robust against quoted commas?

Shell_Life · April 19, 2011, 11:32am

DGPickett,

The code does not handle quoted commas.

It works for what was specified in the original post.

We have to remember that the solutions here are designed to solve
the original post description and stated rules.

An experienced IT professional can always propose �what-if� cases
that may cause a code to break - consider the following situations:

Input:

field1, field2, �last1, first1�
field1, field2, �last2, first2�
fieldA, fieldB, last1
fieldA, fieldB, first1
fieldA, fieldB, last2
fieldA, fieldB, first2

According to the post rules, the output would be:

field1, field2, �last1, first1, last2, first2�
fieldA, fieldB, �last1, first1, last2, first2�

There could be quoted comma in the first two fields:

"Last1, First1", "Last2, First2", Value

Also, how would quotation inside of quoted commas would work:

Input:

field1, field2, �His Answer: \�Yes, I do.\��

or:

field1, field2, �His Answer: ��Yes, I do.���

As you can see, there are several other instances of �what-ifs� where the solution could become
very complex and thus not practical for the purpose of solving the original post where the member
does not have such cases.

Corona688 · April 19, 2011, 11:40am

Such as someone editing a code solution in Microsoft Word, adding surplus linebreaks and converting ordinary quotes into special � � quotes.

Shell_Life · April 19, 2011, 11:50am

I am glad to see that some people really pay attention to what is written.

Cheers.

DGPickett · April 19, 2011, 3:05pm

Well, they are explicit when one says CSV, but often neglected, so I did not drag anything new in. I am into robust solutions, but there may be no commas in the data here . . . yet.

There are JDBC and ODBC tools that can handle CSV files like database tables: filtering, combining and sorting them as you please to produce new CSV files. I used to correspond with a nice Chinese guy, Dawei, at HXTT, adding features to their flat file database tool. They have trial versions, and they work fine on the command line with xigole jisql. I also used JStels.

HXTT Text JDBC Drivers and CSV JDBC Drivers

Jisql - a Java based interactive SQL application

StelsCSV JDBC CSV Driver Documentation