Merge 2 CSV files using sed

Help in writing a script using sed which updates fileOne with the contents from fileTwo

Example:
Contents of fileOne

1,111111
2,897823
3,235473
4,222222

Contents of fileTwo

1,111111,A,1,2
4,222222,A,2,2
5,374632,A,3,2
6,374654,A,4,2

Final File should be:

1,111111,A,1,2
2,897823
3,235473
4,222222,A,2,2
5,374632,A,3,2
6,374654,A,4,2

Thanks in advance...

Best done with awk:

 
awk -F"," 'NR==FNR {a[$1]=$0} {a[$1]=$0;} END { for(i=1;i in a;i++) print a}' file1 file2

Thanks for the reply... Could you please point to any tutorial which explains about the funtionalities used in your command... like array 'a[$1]', 'in a',...

you can try sed but awk is more usefull if subject is condition.

# cat fileOne
1,111111
2,897823
3,235473
4,222222
# cat fileTwo
1,111111,A,1,2
4,222222,A,2,2
5,374632,A,3,2
6,374654,A,4,2
# ./justdoit fileOne fileTwo
1,111111,A,1,2
2,897823
3,235473
4,222222,A,2,2
5,374632,A,3,2
6,374654,A,4,2
# cat justdoit
## justdoit ##
#!/bin/bash
for i in $(sed 's/^\([0-9]\),.*/\1/' $1)
 do
   [ $(sed -n "/^$i,/p" $2|grep "$i,") ] || (sed "$i i $(sed -n "/^$i,/p" $1) " $2 >$2tmp ; mv $2tmp $2 )
 done ; more $2

regards
ygemici

Thanks,

But in my example i need to compare only the second column, I am posting the original contents of my files

file1:

"20","9900000007"
"3","9900000001"
"3","9900000002"
"4","9900000003"

file2

"20","9900000007","1","A","2"
"3","9900000001","1","A","2"
"13","9900000004","0","A","2"
"17","9900000005","1","A","2"

outPutFile:

"20","9900000007","1","A","2"
"3","9900000001","1","A","2"
"3","9900000002"
"4","9900000003"
"13","9900000004","0","A","2"
"17","9900000005","1","A","2"

I have written this script, which has 2 problem

#!/bin/bash
while read updateline
do
   updateField=`echo $updateline | awk -F "," '{print $2}'`
   while read initialline
   do
      initialField=`echo $initialline | awk -F "," '{print $2}'`
      if [ "$initialField" == "$updateField" ]; then
         sed -e "s/$initialline/$updateline/" <file1 
      fi
   done < file1 
done < file2 >tmpUpdateFile

output of this script:

"20","9900000007","1","A","2"
"3","9900000001"
"3","9900000002"
"4","9900000003"
"20","9900000007"
"3","9900000001","1","A","2"
"3","9900000002"
"4","9900000003"

This is not the expected result, problem are,

  1. During each loop, the content is getting appended
  2. In my script, in the else part how to append the variable $updateline.

Thanks....

awk -F"," '{
if(NR == FNR){
  _[$1]=$2
}
else{
  if(exists _[$1]){
    _[$1]="---"
  }
  print $0
  }
}
END{
	for(i in _){
		if(_ != "---" && _ != "")
			print i","_
	}
}' a b | sort -t"," -n

One row from file1 is missing using this logic

output got:

"20","9900000007","1","A","2"
"3","9900000001","1","A","2"
"13","9900000004","0","A","2"
"17","9900000005","1","A","2"
"4","9900000003"

output expected:

"20","9900000007","1","A","2"
"3","9900000001","1","A","2"
"13","9900000004","0","A","2"
"17","9900000005","1","A","2"
"4","9900000003"
"3","9900000002"

Based on your example in post 4 and assuming that you really mean the second column and only the second column

$ join -a 1 -a 2 -t, -j 2 file1 file2 | awk -F, '{ if (NF==6) print $2,$1,$4,$5,$6; else print $2,$1,$3,$4,$5}'
"20" "9900000007" "1" "A" "2"
"3" "9900000001" "1" "A" "2"
"3" "9900000002"
"4" "9900000003"
"13" "9900000004" "0" "A" "2"
"17" "9900000005" "1" "A" "2"

:b:

I try to write a script for compare columns issue :wink:
if wanted more can customized more further (for exa i can add "-F" (field separator so we can change other chars instead of comma) parameter. in this script field separator (column separator) is comma for your examples)

lets we look your examples ..

lets run script for your files.

# cat file1
"20","9900000007"
"3","9900000001"
"3","9900000002"
"4","9900000003"
# cat file2
"20","9900000007","1","A","2"
"3","9900000001","1","A","2"
"13","9900000004","0","A","2"
"17","9900000005","1","A","2"

as far i mean , you want second column compare on files then run the script for compare 2 column.our compare column parameter is 'c'
our command is like below and results..

# ./justdoit c=2
['file1'] contains non-digit chars or columns!
['file2'] contains non-digit chars or columns!
--> Showing compare results..
==================================
"20","9900000007","1","A","2"
"3","9900000001","1","A","2"
"3","9900000002"
"4","9900000003"
"13","9900000004","0","A","2"
"17","9900000005","1","A","2"
=================================

and we can use the script for another compare examples..

below file has same columns then the script will ask add or not.
and if we use as 2 column for compare process then script

in file1 has only one column with "9" line and script fails for compare non-first columns choice..also file1 has second column with non-number (uncomparable)!! .. and also it has multiple lines then the script ask this issue for add or not these..

# cat file1
"20","9900000007"
"3","9900000001"
"3","9900000002"
"4","9900000003"
"4","9900000003","AAAAAAA"
"9"
"4","9900000003","AAAAAAA"
"11","NOTNUMBER","BBBBBB","CCCCCC"
# cat file2
"20","9900000007","1","A","2"
"3","9900000001","1","A","2"
"13","9900000004","0","A","2"
"17","9900000005","1","A","2"
"17","9900000005","1","A","2"
"17","9900000005","1","A","2","CCCCCC"
# ./justdoit c=2
['file1'] contains non-digit chars or columns!
['file2'] contains non-digit chars or columns!
['file1'] file, ['"11","NOTNUMBER","BBBBBB","CCCCCC"'] line, ['2'] numbered ['NOTNUMBER'] column, has contain non-digit chars!!
sed: -e expression #1, char 31: invalid reference \1 on `s' command's RHS
Check your values.. file ['file1'], line ['"9"'], compare column ['2']
"SED will give error!! And the result is partially false!!!"
sed: -e expression #1, char 31: invalid reference \1 on `s' command's RHS
sed: -e expression #1, char 31: invalid reference \1 on `s' command's RHS
Multiple lines have same number of columns for compare column ['9900000003'] !!
"4","9900000003","AAAAAAA"
"4","9900000003","AAAAAAA"
Multiple lines add to results [y/n]? y
--> Showing compare results..
==================================
"4","9900000003","AAAAAAA"
"4","9900000003","AAAAAAA"
"20","9900000007","1","A","2"
"3","9900000001","1","A","2"
"3","9900000002"
"17","9900000005","1","A","2","CCCCCC"
"13","9900000004","0","A","2"
=================================

of course examples are reproduced.
and maybe happens false results for unforeseeable issues that i forgot variations..

#!/bin/bash
## justdoit##
## SED column comparison ##
 
maxcolfind()
{
local col='"[^ ]*"'
for file in file1 file2 ; do
maxcol=0 ; z=0 ; newcol=$col
while [[ $(sed -n "/^$newcol/p" $file) ]] ; do
((maxcol++))
newcol="$newcol,$col" ; newcol=$(echo $newcol|sed 's/ ,/,/g')
done
filec[z]=$maxcol ; ((z++))
done
if [[ ${filec[z-1]} -ge ${filec[z]} ]] ; then maxcol=${filec[z-1]} ; else maxcol=${filec[z]} ; fi
}
maxcolfind ""
 
nondigitsfind()
{
for file in file1 file2 ; do
if [[ $(sed -n '/[^0-9]/p' $file) ]] ; then echo "['$file'] contains non-digit chars or columns!" ; fi
done
}
nondigitsfind ""
 
maxcoltmp=$maxcol; countcompare=$maxcol; countcomparereset=$maxcol; listresults=$maxcol
compcol=$( echo "$1" | sed "s/c=\([1-$maxcol]\)/\1/" )
col='"[^ ]*"' ;maxfmt=() ; charforrhs='"[A-Za-z0-9][A-Za-z0-9]*"' ; colsed='\"\[A-Za-z0-9]\[A-Za-z0-9]\*\",'
 
if [ ! $(echo "$compcol" | sed -n "/^[1-$maxcol]$/p") ] ; then
echo "You entered an invalid value for compare column!!
You have max $maxcol --> $col <-- string!!
Usage $0 c=[1-$maxcol] (c--> Compare Which Column)"
exit 1
fi
 
while [ $(( maxcol -= 1 )) -ge 0 ]
 do
  maxfmt=(${maxfmt[@]} $charforrhs)
 done
fmtarr=( $(echo ${maxfmt[@]} | sed 's/\" /",/g') )
x=0 ; compcolarr[x]=${fmtarr[@]} ; colsedtmp=$colsed
while [ $(( maxcoltmp -= 1 )) -ge 1 ]
do
 ((x++)) ; compcolarr[x]=$(echo ${fmtarr[@]}|sed "s/$colsed//")
 colsed=$colsed$colsedtmp
done
 
>newout
x=0;count=0 ; first=ok
for j in file1 file2
do
while [ $(( countcompare -= 1 )) -ge 0 ] ;
 do
 if [[ $(sed -n "/^${compcolarr[x]}$/p" $j) ]] ; then
 for i in $(sed "" $j)
  do
   if [[ $(echo $i|sed -n "/^${compcolarr[x]}$/p") ]] ; then
      findcompcol=$(echo ${compcolarr[x]}|sed 's/\[A-Za-z0-9]\[A-Za-z0-9]\*/\\(&\\)/'$compcol' ' )
    (echo $i|sed "s/$findcompcol/\1/") 2>&1 >/dev/null ## test columns for compare values according to file content
      if [ $? != 0 ] ; then
         echo "Check your values.. file ['$j'], line ['$i'], compare column ['$compcol'] "
         echo "\"SED will give error!! And the result is partially false!!!\" "
      fi
    if [[ ! $(echo $i|sed "s/$findcompcol/\1/"|sed -n '/[^0-9]/p') ]] ; then
      okline="$i" ; isstring=no
    else
      falsecompcol=$(echo $i|sed "s/$findcompcol/\1/")
      echo "['$j'] file, ['$i'] line, ['$((compcol))'] numbered ['$falsecompcol'] column, has contain non-digit chars!!"
      isstring=yes ;
    fi
     if [ ! $isstring = "yes" ] ; then
     comparr=$(echo $okline|sed "s/$findcompcol/\1/" )
     isadded=no ;
     fi
   fi
 
if [ ! $isadded = "yes" ] && [ ! $isstring = "yes" ] ; then
    if [ "$first" = "ok" ] ; then
      addedarr=(${addedarr[@]} $comparr ) ; first=notok ; isadded=yes
    else
     for k in ${addedarr[@]}
      do
        if [[ $comparr = $k ]] ; then
         isadded=yes ;break
        fi
      done
        if [ ! "$isadded" = "yes" ] ; then
          addedarr=(${addedarr[@]} $comparr )
        fi
    fi
fi
 done
 fi
 ((x++)) ; ((count++))
 done
countcompare=$countcomparereset ;x=0; count=0
done
 
l=0 ; listreset=$listresults
for i in ${addedarr[@]}
 do
   for x in "$(sed -n "/$i/p" file1 file2)"
    do
     while [ $(( listresults -= 1 )) -ge 0 ] ;
     do
     if [[ $(echo "$x"|sed -n "/^${compcolarr[l]}$/p") ]] ; then
       addthis=$(echo "$x"|sed -n "/^${compcolarr[l]}$/p")
       if [[ $(echo "$addthis"|sed -n "/$i/p" | sed = | sed -e '$!N;s/\n/ /g' -ne '$s/^\([0-9]\).*/\1/p' ) != 1 ]] ; then
         echo "Multiple lines have same number of columns for compare column ['$i'] !!"
         echo "$addthis"
         read -p "Multiple lines add to results [y/n]? " ch
         if [ $ch = 'y' ] ; then
          echo "$addthis" >>newout
         fi
       else
          echo "$addthis" >>newout
       fi
         break
     fi
     ((l++))
     done
    done
     listresults=$listreset ; l=0
 done
 
echo "--> Showing compare results..
==================================
$(more newout)
================================="

regards
ygemici