awk record matching

SkySmart · February 24, 2017, 5:55pm

ok. so i have a list of country names which have been abbreviated. we'll call this list A

i have another list that which contains the what country each abbreviated name means. we'll call this list B.

so example of the content of list B:

#delimited by tabs
#ABBR COUNTRY     COUNTRY ABBR
so      Somalia ST. KITTS AND NEVIS     KN
sr      Suriname        ST. LUCIA       LC
st      Sao Tome And Principe   ST. PIERRE AND MIQUELON PM
su      Soviet Union    ST. VINCENT & THE GRENADINES    VC
sv      El Salvador     SUDAN   SD
sy      Syrian Arab Republic    SURINAME        SR
sz      Swaziland       SVALBARD AND JAN MAYEN  SJ
tc      Turks And Caicos Islands        SWAZILAND       SZ
td      Chad    SWEDEN  SE
tf      French Southern Territories     SWITZERLAND     CH
tg      Togo    SYRIAN ARAB REPUBLIC    SY
th      Thailand        TAIWAN  TW
tj      Tajikistan      TAJIKISTAN      TJ
tk      Tokelau TANZANIA, UNITED REPUBLIC OF    TZ
ti      East Timor (new code)   THAILAND        TH
tm      Turkmenistan    TOGO    TG
tn      Tunisia TOKELAU TK
to      Tonga   TONGA   TO
tp      East Timor (old code)   TRINIDAD AND TOBAGO     TT
tr      Turkey  TUNISIA TN
tt      Trinidad And Tobago     TURKEY  TR
tv      Tuvalu  TURKMENISTAN    TM
tw      Taiwan  TURKS AND CAICOS ISLANDS        TC
tz      Tanzania, United Republic Of    TUVALU  TV
ua      Ukraine UGANDA  UG
ug      Uganda  UKRAINE UA
uk      United Kingdom  UNITED A

how can i get awk to tell me the full country name if i supply the country's abbreviation? also note, some abbreviated countries (left column) are in lowercase, and others are in CAPs (right column).

The above list of countries is stored in a variable. so i intend to do something like this:

#!/bin/sh
SuppName=${1}
echo "${ListBContent}" | awk -v SupName="${SupName}" '(($1 ~ SupName) || ($3 ~ SupName)) {print $0}'

My question is, is there a more efficient, better way to do this? i

jim_mcnamara · February 24, 2017, 6:28pm

You probably had to read the file to begin with, in order to place it in a variable. So,

awk '{ awk program} ' filename

is probably the most efficient form.

The use of pattern matching is okay, except that $3 in your example is uppercase in the file and not in the code.

I'm not sure what else you are after. If your list is huge and the country abbreviation is always lowercase and in column 1 then:

to look for country==aa with /^aa/ is more efficient assuming all available country codes exist in the input file at column 1. Or $1="aa" is also fast since there are only two characters in the search pattern.

I would just use grep and skip awk

arr=$( grep -ie '(^aa| aa)'  filename)
[ $? -eq 0 ] && echo "${arr[1]} ${arr[2]}"

Scrutinizer · February 24, 2017, 10:25pm

Try:

awk '
  NR==FNR {
    n=split($3,F," ")
    c=x
    for(i=1; i<=n; i++) c=c toupper(substr(F,1,1)) tolower(substr(F,2)) " "
    A[$1]=$2
    A[tolower($4)]=c
    next
  }
  {
    print $1, A[tolower($1)]
  }
' FS='\t' OFS='\t' listB listA