Merging multiple lines into single line based on one column

raju2016 · December 8, 2016, 12:08pm

I Want to merge multiple lines based on the 1st field and keep into single record.

SRC File:

AAA_POC_DB.TAB1
AAA_POC_DB.TAB2
AAA_POC_DB.TAB3
AAA_POC_DB.TAB4
BBB_POC_DB.TAB1
BBB_POC_DB.TAB2
CCC_POC_DB.TAB6

OUTPUT
-----------------

'AAA_POC_DB','TAB1','TAB2','TAB3','TAB4'
'BBB_POC_DB','TAB1','TAB2'
'CCC_POC_DB','TAB6'

RavinderSingh13 · December 8, 2016, 2:03pm

Hello raju2016,

Could you please try following and let me know if this helps you.

awk -F"." -vs1="'" 'FNR==NR{A[$1]=A[$1]?A[$1] OFS s1 $NF s1:s1 $NF s1;next} A[$1]{print s1 $1 s1 OFS A[$1];delete A[$1]}' OFS=,   Input_file  Input_file

Output will be as follows.

'AAA_POC_DB','TAB1','TAB2','TAB3','TAB4'
'BBB_POC_DB','TAB1','TAB2'
'CCC_POC_DB','TAB6'

Thanks,
R. Singh

RudiC · December 8, 2016, 2:25pm

Try also

awk -F. -vFMT="'%s'" '$1 != LAST {printf QRS FMT, $1; QRS = ORS; LAST = $1} {printf "," FMT, $2} END {printf ORS}' file
'AAA_POC_DB','TAB1','TAB2','TAB3','TAB4'
'BBB_POC_DB','TAB1','TAB2'
'CCC_POC_DB','TAB6'

raju2016 · December 8, 2016, 2:34pm

Thanks RudiC and ravinder....its working

greet_sed · December 8, 2016, 5:37pm

Hi,

RudiC & Ravinder solutions are

Here is another try

while IFS=. read a b;
do
if [ "$x" != "$a" ]; then 
x=$a
printf "\n'%s','%s'" "$x" "$b"
else 
printf ",'%s'" "$b"
fi
done < file

This gives following output for given input in post #1 ( one empty line in the beginning !) .

#!/bin/bash

while IFS=. read a b;
do
if [ "$x" == "$a" ]; then 
printf ",'%s'" "$b"
else 
x=$a
printf "'%s','%s'" "$x" "$b"
fi
done < file | sed -e 's/'"'"''"'"'/'"'"'\n'"'"'/g'

Gives the desired output as per post #1.

MadeInGermany · December 9, 2016, 4:35am

While the sed hack can be generalized for Unix

sed -e 's/'\'\''/'\''\
'\''/g'

it is actually not needed:

x=""; while IFS=. read a b
do
  if [ "$x" != "$a" ]; then 
    printf "${x:+\n}'%s','%s'" "$a" "$b"
    x=$a
  else 
    printf ",'%s'" "$b"
  fi
done < file

raju2016 · December 16, 2016, 2:31am

Hi Ravinder/Rudy,

Could you please explain what exactly this AWK is doing

Scrutinizer · December 16, 2016, 3:20am

And the inevitable:

awk -F. '{A[$1]=A[$1] q "," q $2} END{for(i in A) print q i A q}' q=\' file

If the output order is not important (otherwise pipe through sort sort ) and may be helpful in particular if the input is not grouped.

RudiC · December 16, 2016, 5:03am

awk -F. -vFMT="'%s'" '                          # split line at ".", use FMT variable for printing as single
                                                # quote handling is difficult within the awk script
$1 != LAST      {printf QRS FMT, $1             # on changing first field (LAST is the empty string in the first record
                                                # so pattern is TRUE) print it after a temporary record separator (which
                                                # is empty, too, on the 1. record)
                 QRS = ORS                      # now set temp sep to line feed for all remaining records
                 LAST = $1                      # keep track of the first field for the upcoming pattern checksy
                }
                {printf "," FMT, $2             # now keep adding the quoted second fields until first field changes
                }
END             {printf ORS                     # final line feed 
                }
' file

ronaldxs · December 16, 2016, 2:02pm

Just wondering if last printf END {printf ORS} might not be slightly improved as END {printf QRS} to suppress empty line for empty input.

RavinderSingh13 · December 17, 2016, 3:52am

Hello raju2016,

Could you please go through the following and let me know if this helps, also it is not the code in running form, it is only expanded for explanation purposes.

awk                                            #### Starting awk here.
-F"."                                          #### Mentioning -F as .(DOT) here so e could make custom delimiters in awk as per our requirement so making DOT as a delimiter.
-vs1="'"                                       #### defining a variable named s1 whose value is '. So in awk we could define variables by using -v variable_name="value".
'FNR==NR                                       #### mentioning here FNR==NR condition. Where FNR and NR both are awk's in-built keywords and tell us the number of lines in any 
                                                    Input_file but a major difference between FNR and NR is FNR's value will be RESET each time a new Input_file is being read 
                                                    and NR's value will keep on increase till the last Input_file being read(as in awk we could mention mutiple files as an Input).
{A[$1]=A[$1]?A[$1] OFS s1 $NF s1:s1 $NF s1;    #### So creating here an array named A whose index is $1(first field) and whose value is s1 $NF s1 if that element's index is NOT registered in array A.
                                               #### Where s1 as mentioned before is a variable with value of ' and $NF denoted the last field of a line which is being read. So if a value is find in array A
                                                    which is already registered(mentioned in code by ?) then A's that specific index's value will be A[$1] OFS s1 $NF s1, which means appending the current value
                                                    into the previous A's value of current index(which is $1 from each line).
next}                                          #### Now using next which is a in-built keyword of awk. So by mentioning this we are skipping all further statements now.
A[$1]{                                         #### A[$1], so this is a condition which will be executed when 2nd time Input_file is being read and it make sure if any first field is present in array A then execute following statements.
print s1 $1 s1 OFS A[$1];                      #### printing here s1 $1 s1 OFS A[$1], where s1 kis a variable mentioned above, $1 is first field of line and OFS is Output field separator(awk's in-built keyword, whose default value is a space) then A[$1], which will print array A's value whose index is $1 of current line.
delete A[$1]                                   #### deleting the array A's element whose index is $1 so that we will not execute already executed $1 as we are reading Input_file twice.
}' OFS=,   Input_file  Input_file              #### Mentioning the value of OFS as a comma(,) which is awk's in-built keyword denotes as Ouptut field separator, mentioning Input_file 2 times too here.

Thanks,
R. Singh