mapping of values in shell scripting

vsachan · September 19, 2011, 11:41am

sample content of file1:

SSTY1         2145228348       652011011715140100000002419005432092074                 008801726143662             VDZX01                                  MIO2                                           008801726143662
SSRTY         2145228349       652011011715163400000000095006543493559                 030729961474195             RTYUI9                                  MMO2                                           030729961474195
       SSI09  2145228354       652011011715124900000003539000447438811497              08860700789                                     IIUYT4                  IMO2                                       041088860700789
SSRTY         2145228355       652011011715152600000001176008860004296                 041289796822734             KIUTY6                                  MMO2                                           041289796822734
SSRTY  SSRTY  2145228357       652011011715154000000001033009711806239                 027279810510494             ITYIP4              HUIPO5              MMR2MMO2                                       027279810510494
       SSRTY  2145228363       652011011715161500000000286001137000707                 025409450590072                                 LRTER4                  MMO2                                       025409450590072
SSRTY         2145228366       652011011715155800000000456001171220000                 030518679540203             OOUTY6                                  FMO2                                           030518679540203
       SSRTY  2145228367       652011011715142700000002162009833314457                 030099569768624                                 NPAXM2                  MMO2                                       030099569768624
SSRTY  SSRTY  2145228368       652011011715160000000000436009811279923                 031129311382325             RTYUI9              VYVXG5              MUR2MUO2                                       031129311382325
SSRTY  SSRTY  2145228368       652011011715160000000000436009811279923                 031129311382325             IUERW3
          VYVXG5              MUR2MUO2                                       031129311382325

cat file1 | awk -f $HOME_NEW/abc.awk | sed -e 's/[\t ]//g;/^$/d;' | sed 's/,,/, ,/g' | sed 's/,,/, ,/g' > $CSVFileName

abc.awk

{
if(substr($0,87,1)=="T" || substr($0,87,1)=="O" || substr($0,87,1)=="M" || substr($0,87,1)=="Y" || substr($0,87,1)=="T" || substr($0,87,1)=="C")
{print sprintf("%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s",substr($0,34,8),substr($0,42,6), "",substr($0,60,28),substr($0,88,28),substr($0,50,4)*3600+substr($0,54,2)*60+substr($0,56,2)+substr($0,58,2)/100,"","",substr ($0,1,7),substr($0,87,20),"","","","I","","","","")}
else if(substr($0,123,1)=="T" || substr($0,123,1)=="O" || substr($0,123,1)=="M" || substr($0,123,1)=="Y" || substr($0,123,1)=="T" || substr($0,123,1)=="C")
{print sprintf("%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s",substr($0,34,8),substr($0,42,6), "",substr($0,60,28),substr($0,88,28),substr($0,50,4)*3600+substr($0,54,2)*60+substr($0,56,2)+substr($0,58,2)/100,"","",substr ($0,8,7),substr($0,123,20),"","","","O","","","","")}
}

the above command was working successfully.and the out put generated was:
output

20110117,151634, ,09899493559,030729961474195,9.5, , ,SSRTY,RTYUI9, , , ,I, , , ,
20110117,151249, ,00447438811497,08860700789,233.9, , ,SSI09,ITYIP4, , , ,O, , , ,
20110117,151540, ,09711806239,027279810510494,63.3, , ,SSRTY,ITYIP4, , , ,I, , , ,
20110117,151615, ,01137000707,025409450590072,28.6, , ,SSRTY,LRTER4, , , ,O, , , ,
20110117,151600, ,09811279923,031129311382325,43.6, , ,SSRTY,RTYUI9, , , ,I, , , ,
20110117,151600, ,09811279923,031129311382325,43.6, , ,SSRTY,IUERW3, , , ,I, , , ,

the above command was working successfully.
*****************************************************************************************************************************
But the problem is now My requirement is to search 9th and 10th field of the input file file1 and then compare with another file that has 9th and 10th fields and collect the Continent name from this file.... then put it in the output file. I am able to do it but the issue is there are around 10,000 records in that file.For less records (around 50)...it is working file
for eg.

id1           id2           Continent name
-----           ------          --------------
SSRTY  RTYUI9  RIGHT1
SSI09  ITYIP4  AUSTRALIA
SSRTY  ITYIP4  ASIA
SSRTY  LRTER4  NORTHUS
SSRTY  RTYUI6  KOREA
SSRTY  IUERW3  ANTARCTICA
.
.
10,000 records are there
.
.

My output should be :
output

20110117,151634, ,09899493559,030729961474195,9.5, , ,SSRTY,RTYUI9, , , ,I, , , ,RIGHT1,
20110117,151249, ,00447438811497,08860700789,233.9, , ,SSI09,ITYIP4, , , ,O, , , , ,
20110117,151540, ,09711806239,027279810510494,63.3, , ,SSRTY,ITYIP4, , , ,I, , , ,ASIA,
20110117,151615, ,01137000707,025409450590072,28.6, , ,SSRTY,LRTER4, , , ,O, , , ,NORTHUS,
20110117,151600, ,09811279923,031129311382325,43.6, , ,SSRTY,RTYUI9, , , ,I, , , ,RIGHT1,
20110117,151600, ,09811279923,031129311382325,43.6, , ,SSRTY,IUERW3, , , ,I, , , ,ANTARCTICA,

I was using the following approach,, but it was giving me below error:

cat $filename | awk -f $HOME_NEW/abc.awk | sed -e 's/[\t ]//g;/^$/d;' |  awk -F"," -f $HOME_NEW/AWK_GENERATE_SCRIPT.awk | awk 'NR > 1 { gsub( / ,$/, "," ); print }' | sed 's/,,/, ,/g' | sed 's/,,/, ,/g' > $CSVFileName

AWK_GENERATE_SCRIPT.awk:

{
if ( $9 =="SSRTY" && $10 == "RTYUI9") CONTINENT_NAME="RIGHT1";
else if ( $9 =="SSI09" && $10 == "ITYIP4") CONTINENT_NAME="AUSTRALIA";
else if ( $9 =="SSRTY" && $10 == "ITYIP4") CONTINENT_NAME="ASIA";
else if ( $9 =="SSRTY" && $10 == "LRTER4") CONTINENT_NAME="NORTHUS";
.
.
.
.
10,000 DIFFERENT ID1 AND ID2
.
.
else CONTINENT_NAME="";
print $0 CONTINENT_NAME,",";
}

eRROR:

+  YACC Stack Overflow The source line is 49.
 The error context is
                else if ( $9 =="IIYTE" && $10 == >>>  "FRTYH4" <<< ) CONTINENT="NORTH KOREA";
        awk: 0602-541 There are 19 missing ) characters.
++

Please help..... if you can solve it..

Shell_Life · September 19, 2011, 1:08pm

You can create a sed file with all 10,000 combinations as:

s/.*,SSRTY,RTYUI9,.*/&RIGHT1,/
s/.*,SSI09,ITYIP4,.*/&AUSTRALIA,/
s/.*,SSRTY,ITYIP4,.*/&ASIA,/
s/.*,SSRTY,LRTER4,.*/&NORTHUS,/

Then run as:

sed -f sed_file Inp_File

If you encounter any problem, keep breaking the sed file until you can manage to run it, ie 2 files of 5,000 records, then 4 files of 2,500, 10 files of 1,000, etc.

vsachan · September 20, 2011, 3:22am

Thx for giving the solution Shell_Life.

But the performance is very poor.I am using below script . Please check if I can improve the performance.

script.sh:
--------

#!/bin/sh
set -x
file_name=$1;
sed -f 111.txt $file_name > temp.txt
sed -f 222.txt temp.txt > temp1.txt
sed -f 333.txt temp1.txt > temp2.txt
sed -f 444.txt temp2.txt > temp3.txt
sed -f 555.txt temp3.txt > temp4.txt
sed -f 666.txt temp4.txt > temp5.txt
sed -f 777.txt temp5.txt > temp6.txt
sed -f 888.txt temp6.txt > temp7.txt
sed -f 999.txt temp7.txt > temp8.txt
sed -f AAA.txt temp8.txt > temp9.txt
sed -f BBB.txt temp9.txt > temp10.txt
sed -f CCC.txt temp10.txt > temp11.txt
 
rm -f temp.txt temp1.txt temp2.txt temp3.txt temp4.txt temp5.txt temp6.txt temp7.txt temp8.txt temp9.txt temp10.txt
 
mv -f temp11.txt $file_name

*******************************************************

where 111.txt,222.txt,333.txt contains the substitue commands.Each file contains 500 substitution commands. and I am passing the file name as 1st parameter to the script. This file contains around 40,000 Records.

One more concern---I have to put a space ,then a comma if there is no substitution in the record.Please check the third line of the output of the following 2 records.

for eg
if I am not able to find a match, then the output should be :

20110117,151615, ,01137000707,025409450590072,28.6, , ,SSRTY,LRTER4, , , ,O, , , ,NORTHUS,
20110117,151600, ,09811279923,031129311382325,43.6, , ,SSRTY,RTYUI9, , , ,I, , , ,RIGHT1,
20110117,151600, ,09811279923,031129311382325,43.6, , ,SSRTY,IUERW3, , , ,I, , , , ,

But currently the output is:

20110117,151615, ,01137000707,025409450590072,28.6, , ,SSRTY,LRTER4, , , ,O, , , ,NORTHUS,
20110117,151600, ,09811279923,031129311382325,43.6, , ,SSRTY,RTYUI9, , , ,I, , , ,RIGHT1,
20110117,151600, ,09811279923,031129311382325,43.6, , ,SSRTY,IUERW3, , , ,I, , , ,

Please help how to improve the performance.

Shell_Life · September 20, 2011, 3:35pm

Here is an entire different approach:

First create a file with all 10,000 combinations as follows:

,,,,,,,,SSRTY,RTYUI9,RIGHT1
,,,,,,,,SSI09,ITYIP4,AUSTRALIA
,,,,,,,,SSRTY,ITYIP4,ASIA
,,,,,,,,SSRTY,LRTER4,NORTHUS

Note that there are 8 (eight) commas before the first data.

Then run the following script:

#!/usr/bin/ksh
sort -A -t',' -k9,10 -k1 Original_File Combination_File > Mixed_File
IFS=','
while read mF1 mF2 mF3 mF4 mF5 mF6 mF7 mF8 mF9 mF10 mF11 mF12 mF13 mF14 mF15 mF16 mF17; do
  if [[ "${mF1}" = "" ]]; then
    mSave9=${mF9}
    mSave10=${mF10}
    mSave11=${mF11}
    continue
  fi
  if [[ "${mF9}" = "${mSave9}" && "${mF10}" = "${mSave10}" ]]; then
    echo "${mF1},${mF2},${mF3},${mF4},${mF5},${mF6},${mF7},${mF8},${mF9},${mF10},${mF11},${mF12},${mF13},${mF14},${mF15},${mF16},${mF17},${mSave11},"
  else
    echo "${mF1},${mF2},${mF3},${mF4},${mF5},${mF6},${mF7},${mF8},${mF9},${mF10},${mF11},${mF12},${mF13},${mF14},${mF15},${mF16},${mF17},,"
  fi
done < Mixed_File