awk statement help

brettcasper · November 13, 2017, 5:44pm

There has to be a way to do this with awk or maybe I'm just focusing on the wrong tool and making this harder than it needs to be.

I'm trying to do a file field lookup/join at a very large scale but the output changes has to change dramatically. I have an input file to do a field lookup from and essentially do a field join out output with a one to many relationship of values that will be found. For each result, I need to write out a block of text based on results found.

If results are found, data is pulled together. If there's multiple results found in the data file, then it needs to organize the somewhat like the below. If no results are found, then it just uses field values from the original file. Once that is done to determine fields, then the output has to be output way different on separate lines.

Example:
1) File1.txt (file to process)

Site~Location~date~person
1~15~2017-01-01~me
2~28~2016-05-01~owner
3~68~2015-01-28~supervisor
4~69~2012-10-15~extra

2) File2.txt (file with data to pull in...join field2 from file1 to field1 from file2)

Location~Overriding Sites
15~12
15~13
15~14
15~10
28~99
68~100

3) Output text to write out (site value from file1 is dropped if results found and overriding site values used where a location can have multiple side ids):

Begin Record
Location: 15
Site Id: 12
Site Id: 13
Site Id: 14
Site Id: 10
Date: 2017-01-01
Contact: me
End Record

Begin Record
Location: 28
Site Id: 99
Date: 2016-05-01
Contact: owner
End Record

Begin Record
Location: 68
Site Id: 100
Date: 2015-01-28
Contact: supervisor
End Record

Begin Record
Location: 69
Site Id: 4
Date: 2012-10-15
Contact: extra
End Record

I've looked at this at a few different ways. And I'm getting myself turned around. Can you help?

Yoda · November 13, 2017, 6:22pm

Here is an awk approach:-

awk -F'~' '
        NR == FNR {
                if ( FNR > 1 )
                        A_F1[$2] = $3 FS $4
                next
        }
        FNR > 1 {
                A_F2[$1] = ( A_F2[$1] ? A_F2[$1] FS $2 : $2 )
        }
        END {
                for ( k in A_F1 )
                {
                        n = split ( A_F1[k], T1 )
                        print "Begin Record"
                        print "Location: " k

                        if ( k in A_F2 )
                        {
                                m = split ( A_F2[k], T2 )
                                for ( i = 1; i <= m; i++ )
                                        print "Site Id: " T2
                        }
                        else
                                print "Site Id: NULL"

                        print "Date: " T1[1]
                        print "Contact: " T1[2]
                        printf "End Record\n\n"
                }
        }
' file1.txt file2.txt

Don_Cragun · November 13, 2017, 7:04pm

Hi brettcasper,
Welcome to the UNIX & Linux Forums. When starting a thread here it always helps if you tell us what operating system and shell you're using so we know what capabilities your system has.

In addition to what Yoda suggested, you might also try the following. By reversing the order in which the files are processed, it can process records from File1.txt one record at a time instead of having to store the entire contents of both files in memory.

awk -F'~' '
FNR == 1 {
	next
}
NR == FNR {
	site[$1] = site[$1] "Site Id: " $2 "\n"
	next
}
{	printf("Begin Record\nLocation: %s\n%sDate: %s\nContact: %s\nEnd Record\n\n",
	    $2, ($2 in site) ? site[$2] : "Site Id: " $1 "\n", $3, $4)
}' File2.txt File1.txt

If you're running this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

brettcasper · November 15, 2017, 1:03am

You guys rock. I was doing this within a Cygwin bash shell and within an AIX OS bash shell. I was close to what Yoda was doing but can see now with his example where my code was starting to go wrong. Due to the suggestion of Don, I was focusing on testing that and it worked like a charm. Thanks for the help.