Lookup subfields from 3 tables and insert

ritakadm · December 29, 2014, 12:30am

Hello masters,

Please help on the following.
I have a tab delimited file with subfields space delimited.

1	a b	x y	hhghd ghgf
2	v t	f g	gdgdgdg hghg

I have 3 lookup table files tab delimited, for fields 2,3 and 4 respectively

Lookup2

a	10
b	20
v	30
t	40

Lookup3

x	50
y	60
f	70
g	80

Lookup4

hhghd	90
ghgf	100
gdgdgdg	110
hghg	120

I want to find the lookup values for cols 2,3 and 4 from the corresponding files and then insert a column next to the looked up subfield value, in the same space delimited format.

1	a b	10 20	x y	50 60	hhghd ghgf	90 100
2	v t	30 40	f g	70 80	gdgdgdg hghg	110 120

Please note that this is a simplified example, in the original file the columns to be looked up are not next to each other. The are $2, $74 and $126 if that information matters.

pravin27 · December 29, 2014, 1:05am

 awk -F"\t" '{if (FILENAME != "master.txt" ) { a[$1]=$2;next}}
{split ($2,P," "); split($3,Q," "); split ($4,R," "); print $1"\t"$2"\t"a[P[1]],a[P[2]]"\t"$3"\t"a[Q[1]],a[Q[2]]"\t"$4,a[R[1]],a[R[2]]}' lookup* master.txt

ritakadm · December 29, 2014, 1:14am

thank you Pravin, I need to make it a little more flexible, there maybe be any number of subfields and not always 2..i`m sorry , i should have indicated this in the input.

pravin27 · December 29, 2014, 1:30am

Hi Ritakdm,

Could this help you ?

awk -F"\t" '{if (FILENAME != "master.txt" ) { a[$1]=$2;next}}
{n2=split ($2,P," "); n3=split($3,Q," "); n4=split ($4,R," "); printf "\n"$1FS$2FS;
for (i=1;i<=n2;i++) { printf n2==i? a[P]: a[P]" "} printf FS$3FS ; for (j=1;j<=n3;j++) { printf n3==j?a[P[j]]:a[P[j]]" "}  printf FS$4FS ; for (k=1;k<=n4;k++) { printf n4==k?a[P[k]]:a[P[k]]" "}} END {print ""}'  lookup* master.txt

ritakadm · December 29, 2014, 1:31am

thank you, i will do the necessary testing and get back to you

Don_Cragun · December 29, 2014, 2:48am

You haven't said what OS you're using, but when I tried pravin27's code on OS X, I get a syntax error (probably from missing format arguments for the printf statements).

The following seems to do what you want with any number of fields you want to process (just update the fields[] array initialized in the BEGIN clause AND be sure that you have one lookup file for each field to be updated, one or more subfields in each field being processed, and every subfield appearing in the appropriate Lookupxxx file:

awk '
BEGIN { FS = OFS = "\t"
        fields[++nf] = 2
        fields[++nf] = 3
        fields[++nf] = 4
}
FNR == 1 {
        f++
}
f <= nf {l[f, $1] = $2
        next
}
{       for(i = nf; i > 0; i--) {
                n = split($fields, sf, " ")
                af = OFS
                for(j = 1; j <= n; j++)
                        af = af l[i, sf[j]] ((j < n) ? " " : "")
                $fields = $fields af
        }
}
1' Lookup2 Lookup3 Lookup4 file

If you don't set elements in the fields[] array in increasing numeric order and provide Lookupxxx files in the same order as the order of the elements in the fields[] array, it will not work correctly. If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

With your sample Lookup * files and file containing:

1	a b	x y	hhghd ghgf
2	v t	f g	gdgdgdg hghg
3	a b v t	f	gdgdgdg

it produces the output:

1	a b	10 20	x y	50 60	hhghd ghgf	90 100
2	v t	30 40	f g	70 80	gdgdgdg hghg	110 120
3	a b v t	10 20 30 40	f	70	gdgdgdg	110

RavinderSingh13 · December 29, 2014, 4:49am

Hello Don,

Thanks a lot for the great command, I have no hesitation to say that, one day I want to be like you in techincal knowledge. You are great(can't say below than that). I am lucky to be here and see/follow/understand your posts.

There is one very famous quote in a Bollywood movie named Don itself :).

Thanks,
R. Singh

ritakadm · December 29, 2014, 10:03am

Don, would you please explain your code.. I plan to build on this in near future. thank you

RavinderSingh13 · December 29, 2014, 10:42am

Hello ritakadm,

Following may help you.

awk '
BEGIN { FS = OFS = "\t"                                                        # Setting Field seperator and Output field seprator as tab
        fields[++nf] = 2                                                           # Creating an array named fields whose index is nf (which has values 1, 2 up to 3) and value is 2, 3 and 4
        fields[++nf] = 3                                                           # Assigning 2nd element as value 3 whose index is 2
        fields[++nf] = 4                                                           # Assigning 3rd element to array fields whose index is 3
}
FNR == 1 {                                                                          # This condition will be TRUE whenever it reads a new file
        f++                                                                             # increasing variable f's value by 1
}
f <= nf {l[f, $1] = $2                                                            # Checking condition here where variable f is less than and equal to variable nf's value, making array named l whose index is variable f and $1, setting it's value to $2, eg. l(1, 1)=a b
        next                                                                           # skipping other actions
}
{       for(i = nf; i > 0; i--) {                                                  # intilizing for loop from variable nf's value to 0
                n = split($fields, sf, " ")                                     # creating an array named sf whose seperator is space and taking it's count to variable n 
                af = OFS                                                             # setting a variable named af's value to OFS(Output field seprator)
                for(j = 1; j <= n; j++)                                          # intilizing a for loop whcih will run till variable j's value is equal to n's value
                        af = af l[i, sf[j]] ((j < n) ? " " : "")                  # setting value of af here eg. (af = af l(3, sf(1)) " ") which is actually [af=af 10 ] then (af = af l(3, sf(2)) " ") which is actually [af=af 10 20]
                $fields = $fields af                                        # Setting individual field's value in line by fields array 
        }
}         
1' Lookup2 Lookup3 Lookup4 file                                             # 1 means allow defualt action which is print here, filenames are given as input

Hello Don,

I just tried to explain code, kindly do correct me if I have missed something or mentoined anything wrong here.

Thanks,
R. Singh

ritakadm · December 29, 2014, 10:50am

f <= nf {l[f, $1] = $2                                                            # Checking condition here where variable f is less than and equal to variable nf's value, making array named l whose index is variable f and $1, setting it's value to $2, eg. l(1, 1)=a b
        next                                                                           # skipping other actions

Thank you R. Singh
Why use the second column? Can any of $2, $3 and $4 be used here?

vgersh99 · December 29, 2014, 10:56am

one could simplify the creation of the 'fields' array:

nf=split("2 3 4", fields, " ")

RavinderSingh13 · December 29, 2014, 10:58am

Hello ritakadm,

This f <= nf {l[f, $1] = $2 condition will be only true while reading first 3 files and first 3 files are Lookup files which we we need their 2nd values as per your requirement. When it comes 4th file the main file this condition will not be true and there is the time when we are formating fields with their new values which we got from other files. You can just go step by step it will be very clear to you I am pretty sure about it, let us know if you have any queries on same.

Thanks,
R. Singh

Don_Cragun · December 29, 2014, 11:31am

Hi Ravinder,
I think you understand what it is doing. Here is a different way to comment the code that takes more lines to describe, but fits in an 80 column screen:

awk '
BEGIN {	# Set input and output field separators to a tab.
	FS = OFS = "\t"
	# Initialize the table of fields to process (fields[]) and increment the
	# number of Lookup files to expect (nf).
	fields[++nf] = 2	# The 1st Lookup file translates field #2.
	fields[++nf] = 3	# The next Lookup file translates field #3.
	fields[++nf] = 4	# The next Lookup file translates field #4.
}
# If this is the 1st line in a file...
FNR == 1 {
	# increment the input file number.
	f++
}
# If we are looking at a line from a Lookup file...
f <= nf {
	# set the lookup table (l[]) entry for this file (f) and input value
	# ($1) to the corresponding subfield data to be added ($2).
	l[f, $1] = $2
	# Skip to next input line and do not perform the following actions...
	next
}
# If we get to this point, we have read and accumulated data from all of the
# Lookup files and are now looking at a line from our data file.
{	# Process each field in our table of fields (starting with the last
	# field first).  Note that when we add a tab to a field, the field
	# numbers of following fields will change.
	for(i = nf; i > 0; i--) {
		# Split the selected field into subfields (sf[]) and get the
		# number of subfields to process (n).
		n = split($fields, sf, " ")
		# Set the initial contents of the added field data to the output
		# field separator.
		af = OFS
		# For each subfield in this field, append the lookup table value
		# for the entry in the Lookup file corresponding to this field
		# and subfield value, and follow each converted subfield value
		# with a space (for the 1st n-1 subfields) or by nothing (for
		# the lsat subfield).
		for(j = 1; j <= n; j++)
			af = af l[i, sf[j]] ((j < n) ? " " : "")
		# Append the accumulated looked up data to the current field
		# contents.
		$fields = $fields af
	}
}
1	# Print the updated contents of the current data file line.
' Lookup2 Lookup3 Lookup4 file	# Specify the Lookup files and the data file to
				# be processed.