Merge Multiple Files and Transpose

Looking to join three files and then transpose some columns from multiple rows into a single row.

File Info:

FIELD TERMINATED BY '^'
ENCLOSED BY '~'
LINE TERMINATED BY '\r\n'

FIRST FILE (FOOD_DES.txt)

~01001~^~0100~^~Butter, salted~^~BUTTER,WITH SALT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01002~^~0100~^~Butter, whipped, with salt~^~BUTTER,WHIPPED,WITH SALT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01003~^~0100~^~Butter oil, anhydrous~^~BUTTER OIL,ANHYDROUS~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01004~^~0100~^~Cheese, blue~^~CHEESE,BLUE~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01005~^~0100~^~Cheese, brick~^~CHEESE,BRICK~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01006~^~0100~^~Cheese, brie~^~CHEESE,BRIE~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01007~^~0100~^~Cheese, camembert~^~CHEESE,CAMEMBERT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01008~^~0100~^~Cheese, caraway~^~CHEESE,CARAWAY~^~~^~~^~~^~~^0^~~^6.38^4.27^8.79^3.87
~01009~^~0100~^~Cheese, cheddar~^~CHEESE,CHEDDAR~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01010~^~0100~^~Cheese, cheshire~^~CHEESE,CHESHIRE~^~~^~~^~~^~~^0^~~^6.38^4.27^8.79^3.87

Matches to Second File by first field of each

SECOND FILE (NUT_DATA.txt)

~01001~^~203~^0.85^16^0.074^~1~^~~^~~^~~^^^^^^^~~^
~01001~^~204~^81.11^580^0.065^~1~^~~^~~^~~^^^^^^^~~^
~01001~^~205~^0.06^0^^~4~^~NC~^~~^~~^^^^^^^~~^
~01001~^~207~^2.11^35^0.054^~1~^~~^~~^~~^^^^^^^~~^
~01002~^~203~^0.85^16^0.074^~1~^~~^~~^~~^^^^^^^~~^
~01002~^~204~^81.11^580^0.065^~1~^~~^~~^~~^^^^^^^~~^
~01002~^~205~^0.06^0^^~4~^~NC~^~~^~~^^^^^^^~~^
~01002~^~207~^2.11^35^0.054^~1~^~~^~~^~~^^^^^^^~~^
~01003~^~203~^0.28^1^^~1~^~~^~~^~~^^^^^^^~~^
~01003~^~204~^99.48^1^^~1~^~~^~~^~~^^^^^^^~~^
~01003~^~205~^0.00^0^^~1~^~~^~~^~~^^^^^^^~~^
~01003~^~207~^0.00^1^^~1~^~~^~~^~~^^^^^^^~~^
~01004~^~203~^21.40^19^0.339^~1~^~~^~~^~~^^^^^^^~~^
~01004~^~204~^28.74^20^0.399^~1~^~~^~~^~~^^^^^^^~~^
~01004~^~205~^2.34^0^^~4~^~NC~^~~^~~^^^^^^^~~^
~01004~^~207~^5.11^15^0.138^~1~^~~^~~^~~^^^^^^^~~^
~01005~^~203~^23.24^14^0.607^~1~^~~^~~^~~^^^^^^^~~^
~01005~^~204~^29.68^16^0.319^~1~^~~^~~^~~^^^^^^^~~^
~01005~^~205~^2.79^0^^~4~^~NC~^~~^~~^^^^^^^~~^
~01005~^~207~^3.18^14^0.178^~1~^~~^~~^~~^^^^^^^~~^
~01006~^~203~^20.75^7^0.210^~1~^~~^~~^~~^^^^^^^~~^
~01006~^~204~^27.68^4^1.841^~1~^~~^~~^~~^^^^^^^~~^
~01006~^~205~^0.45^0^^~4~^~NC~^~~^~~^^^^^^^~~^
~01006~^~207~^2.70^1^^~1~^~~^~~^~~^^^^^^^~~^
~01007~^~203~^19.80^12^0.367^~1~^~~^~~^~~^^^^^^^~~^
~01007~^~204~^24.26^13^0.608^~1~^~~^~~^~~^^^^^^^~~^
~01007~^~205~^0.46^0^^~4~^~NC~^~~^~~^^^^^^^~~^
~01007~^~207~^3.68^8^0.135^~1~^~~^~~^~~^^^^^^^~~^
~01008~^~203~^25.18^4^0.189^~1~^~~^~~^~~^^^^^^^~~^
~01008~^~204~^29.20^4^1.389^~1~^~~^~~^~~^^^^^^^~~^
~01008~^~205~^3.06^0^^~4~^~NC~^~~^~~^^^^^^^~~^
~01008~^~207~^3.28^4^0.155^~1~^~~^~~^~~^^^^^^^~~^
~01009~^~203~^24.90^42^0.275^~1~^~~^~~^~~^^^^^^^~~^
~01009~^~204~^33.14^106^0.206^~1~^~~^~~^~~^^^^^^^~~^
~01009~^~205~^1.28^0^^~4~^~NC~^~~^~~^^^^^^^~~^
~01009~^~207~^3.93^79^0.146^~1~^~~^~~^~~^^^^^^^~~^
~01010~^~203~^23.37^12^^~1~^~~^~~^~~^^^^^^^~~^
~01010~^~204~^30.60^4^^~1~^~~^~~^~~^^^^^^^~~^
~01010~^~205~^4.78^0^^~4~^~NC~^~~^~~^^^^^^^~~^
~01010~^~207~^3.60^3^0.101^~1~^~~^~~^~~^^^^^^^~~^

Matches to Third File by Second Field of Second File to First Field of Third File

THIRD FILE (NUTR_DEF.txt)

~203~^~g~^~PROCNT~^~Protein~^~2~^~600~
~204~^~g~^~FAT~^~Total lipid (fat)~^~2~^~800~
~205~^~g~^~CHOCDF~^~Carbohydrate, by difference~^~2~^~1100~
~207~^~g~^~ASH~^~Ash~^~2~^~1000~

DESIRED OUTPUT (only going to put together the first two as an example)

1^~01001~^~0100~^~Butter, salted~^0.85^~g~^81.11^~g~^0.06^~g~^2.11^~g~
2^~01002~^~0100~^~Butter, whipped, with salt~^0.85^~g~^81.11^~g~^0.06^~g~^2.11^~g~

EXPLANATION OF DESIRED OUTPUT

counter^~first column, first file~^~second column, first file~^~third column, first file~^~third column, second file~^~second column, third file~

Hope you have fun with this one...

---------- Post updated at 01:21 PM ---------- Previous update was at 01:11 PM ----------

I have this data in a MySQL database. Ideally I'd like to have this as a view, but I'm not very good with SQL and I have no idea how to do the transpose portion of this merge.

Anyway if you're interested in working this out in sql I will include a script to create the tables and import the data.


CREATE DATABASE NDB IF NOT EXISTS;

DROP TABLE IF EXISTS FOOD_DES, NUT_DATA, NUTR_DEF;

CREATE TABLE FOOD_DES
 (
  NDB_No      INT,
  FdGrp_Cd      VARCHAR(8),
  Long_Desc     VARCHAR(400),
  Shrt_Desc     VARCHAR(120),
  ComName     VARCHAR(200),
  ManufacName     VARCHAR(130),
  Survey      VARCHAR(2),
  Ref_Desc      VARCHAR(270),
  Refuse      INTEGER,
  SciName     VARCHAR(130),
  N_Factor      FLOAT,
  Pro_Factor      FLOAT,
  Fat_Factor      FLOAT,
  CHO_Factor      DOUBLE,
    INDEX (NDB_No)
);

CREATE TABLE NUT_DATA
 (
  NDB_No      INT,
  Nutr_No     INT,
  Nutr_Val      DOUBLE,
  Num_Data_Pts      INTEGER,
  Std_Error     DOUBLE,
  Src_Cd      VARCHAR(4),
  Deriv_Cd      VARCHAR(8),
  Ref_NDB_No      INT,
  Add_Nutr_Mark     VARCHAR(2),
  Num_Studies     INTEGER,
  Min     DOUBLE,
  Max     DOUBLE,
  DF      DOUBLE,
  Low_EB      DOUBLE,
  Up_EB     DOUBLE,
  Stat_Cmt      VARCHAR(20),
    INDEX (NDB_No)
);


CREATE TABLE NUTR_DEF
 (
  Nutr_No     INT,
  Units     VARCHAR(14),
  Tagname     VARCHAR(40),
  NutrDesc      VARCHAR(120),
  Num_Dec     VARCHAR(2),
  SR_Order      FLOAT,
     INDEX (Nutr_No)
);

LOAD DATA INFILE 'FOOD_DES.txt' INTO TABLE FOOD_DES FIELDS TERMINATED BY '^' ENCLOSED BY '~' LINES TERMINATED BY '\r\n';

LOAD DATA INFILE 'NUT_DATA.txt' INTO TABLE NUT_DATA FIELDS TERMINATED BY '^' ENCLOSED BY '~' LINES TERMINATED BY '\r\n';

LOAD DATA INFILE 'NUTR_DEF.txt' INTO TABLE NUTR_DEF FIELDS TERMINATED BY '^' ENCLOSED BY '~' LINES TERMINATED BY '\r\n';

Hello, mkastin:

f=file being processed
nunit=unit used for this component of nutritional data (i assume 'g' is for grams)
ndata=nutritional data

The order of the file arguments at the end of the command is critical.

awk '
  BEGIN {FS=OFS="^"}
  FNR==1 {++f}
  f==1 {nunit[$1]=$2}
  f==2 {ndata[$1]=ndata[$1] FS $3 FS nunit[$2]}
  f==3 {print FNR,$1,$2,$3 ndata[$1]}' \
NUTR_DEF.txt NUT_DATA.txt FOOD_DES.txt

Test run using your data:

$ awk '
>   BEGIN {FS=OFS="^"}
>   FNR==1 {++f}
>   f==1 {nunit[$1]=$2}
>   f==2 {ndata[$1]=ndata[$1] FS $3 FS nunit[$2]}
>   f==3 {print FNR,$1,$2,$3 ndata[$1]}' \
> NUTR_DEF.txt NUT_DATA.txt FOOD_DES.txt
1^~01001~^~0100~^~Butter, salted~^0.85^~g~^81.11^~g~^0.06^~g~^2.11^~g~
2^~01002~^~0100~^~Butter, whipped, with salt~^0.85^~g~^81.11^~g~^0.06^~g~^2.11^~g~
3^~01003~^~0100~^~Butter oil, anhydrous~^0.28^~g~^99.48^~g~^0.00^~g~^0.00^~g~
4^~01004~^~0100~^~Cheese, blue~^21.40^~g~^28.74^~g~^2.34^~g~^5.11^~g~
5^~01005~^~0100~^~Cheese, brick~^23.24^~g~^29.68^~g~^2.79^~g~^3.18^~g~
6^~01006~^~0100~^~Cheese, brie~^20.75^~g~^27.68^~g~^0.45^~g~^2.70^~g~
7^~01007~^~0100~^~Cheese, camembert~^19.80^~g~^24.26^~g~^0.46^~g~^3.68^~g~
8^~01008~^~0100~^~Cheese, caraway~^25.18^~g~^29.20^~g~^3.06^~g~^3.28^~g~
9^~01009~^~0100~^~Cheese, cheddar~^24.90^~g~^33.14^~g~^1.28^~g~^3.93^~g~
10^~01010~^~0100~^~Cheese, cheshire~^23.37^~g~^30.60^~g~^4.78^~g~^3.60^~g~

Regards,
Alister

1 Like

Many thanks Alister.

Your guess about the units is correct.

These files are actually excerpts for the USDA Nutrient Database available here:

http://www.nal.usda.gov/fnic/foodcomp/search/