ksh script to create a generic csv file from different source formats

Hi all,

I have a requirement to create a "superset" file out of a number of different sources with some different and some same columns.

We intend to have a manually updateable SuperSetCols.csv which would look like

"ColA","ColB","ColC","ColD","ColE","ColF","ColG"

so someday we may add ColH ColI etc We will only ever add to the end of the list as new columns are required.

Our sources look like this:

File0.csv
"ColA","ColB","ColC"
"Dat001","Dat002","Dat003"
"Dat011","Dat012","Dat013"
 
File1.csv
"ColA","ColB","ColD"
"Dat101","Dat102","Dat104"
"Dat111","Dat112","Dat114"
 
File2.csv
"ColD","ColE","ColF","ColG"
"Dat204","Dat205","Dat206","Dat207"
"Dat214","Dat215","Dat216","Dat217"

Our intention is to basically fill the gaps where necessary. So File0 needs columns D-G, File1 C & E-G and File2 A-C. The column names are expected to match identically and again remain in order as per the superset.

We would like to output the files to a 'Super' directory as follows:

Super/File0.csv
"ColA","ColB","ColC","ColD","ColE","ColF","ColG"
"Dat001","Dat002","Dat003","","","",""
"Dat011","Dat012","Dat013","","","",""
 
Super/File1.csv
"ColA","ColB","ColC","ColD","ColE","ColF","ColG"
"Dat101","Dat102","","Dat104","","",""
"Dat111","Dat112","","Dat114","","",""
 
Super/File2.csv
"ColA","ColB","ColC","ColD","ColE","ColF","ColG"
"","","","Dat204","Dat205","Dat206","Dat207"
"","","","Dat214","Dat215","Dat216","Dat217"

So that a different process will be able to concatenate them when it is ready and read them as one common file structure:

"ColA","ColB","ColC","ColD","ColE","ColF","ColG"
"Dat001","Dat002","Dat003","","","",""
"Dat011","Dat012","Dat013","","","",""
"Dat101","Dat102","","Dat104","","",""
"Dat111","Dat112","","Dat114","","",""
"","","","Dat204","Dat205","Dat206","Dat207"
"","","","Dat214","Dat215","Dat216","Dat217"

Many thanks in advance to all who reply, I would greatly appreciate anyone's help on this!

Lee

First, you'll have to make a list of column names. It can be done like

head -n1 file*.csv | tr ',' '\n' | sort -u

For the following, please wait...
But i think perl will do the best for your needs, while it handles map arrays.

Generate title file:

awk 'FNR==1' File*.csv |tr "," "\n" |sort -u > title.txt

"ColA"
"ColB"
"ColC"
"ColD"
"ColE"
"ColF"
"ColG"

create the subfolder Super

mkdir Super

Get the result (but each file has comma in the end of each line).

awk -F , '
NR==FNR {title[FNR]=$1;b[$1]=FNR;n=FNR;next} 
FNR==1 {   
           for (i=1;i<=n;i++) 
              { 
                  printf title FS > "Super/" FILENAME
                  sign=0
               }
           printf "\n" > "Super/" FILENAME
           for (i=1;i<=NF;i++)
              sign[b[$i]]=1
           next
       }
{    
     split($0,x,",")
     for (i=1;i<=n;i++)
         { 
           printf (sign=="1")? x[++j] FS:"\"\"" FS > "Super/" FILENAME
         }
     printf "\n" > "Super/" FILENAME
     j=0
}
' title.txt *.csv

$ cat Super/File2.csv
"ColA","ColB","ColC","ColD","ColE","ColF","ColG",
"","","","Dat204","Dat205","Dat206","Dat207",
"","","","Dat214","Dat215","Dat216","Dat217",

If you need clean the end comma from each line

for file in Super/File*.csv
do
  sed 's/,$//' $file > tmp
  mv tmp $file
done

if your sed support -i option:

for file in Super/File*.csv
do
  sed -i 's/,$//' $file > tmp
done

---------- Post updated at 12:28 AM ---------- Previous update was at 12:22 AM ----------

awk 'NR==1||FNR>1' Super/*.csv
1 Like

Many thanks rdcwayx.

Have tested your code quite thoroughly this morning and it seems to work perfectly for our needs.

Much appreciated!