awk to update file based on partial match in field1 and exact match in field2

I am trying to create a cronjob that will run on startup that will look at a list.txt file to see if there is a later version of a database using database.txt as the source. The matching lines are written to output .

$1 in database.txt will be in list.txt as a partial match. $2 of database.txt will also be in list.txt .

If the output file and the database.txt match then "all are current", but if a line or lines between the two files does not match the "newer version of line available"

So using the first line in database.txt as an example, refGene is a partial match to the text in bold in list.txt . The $2 between the two files is the same. There may be multiple lines, as in this case, but the dates will always match.

The awk below seems to find the partial match, but that is as far as I get. Thank you :).

database.txt (always two fields separated by a space, first fields contain the name and the second field is the date)

refGene 20151211
clinvar 20170215
popfreq_all 20150413
dbnsfp 20170123
spidex 20150827

list.txt (file can be variable in length but the name is a partial match in $1 and the date is in $2, file is tab-delimeted)

hg19_clinvar_20130905.txt.gz	20140527	415781
hg19_clinvar_20130905.txt.idx.gz	20140527	73218
hg19_clinvar_20131105.txt.gz	20140527	580838
hg19_clinvar_20131105.txt.idx.gz	20140527	167090
hg19_clinvar_20140211.txt.gz	20140527	694067
hg19_clinvar_20140211.txt.idx.gz	20140527	181049
hg19_clinvar_20140303.txt.gz	20140527	773948
hg19_clinvar_20140303.txt.idx.gz	20140527	182842
hg19_clinvar_20140702.txt.gz	20140712	1111503
hg19_clinvar_20140702.txt.idx.gz	20140712	367271
hg19_clinvar_20140902.txt.gz	20140911	1503198
hg19_clinvar_20140902.txt.idx.gz	20140911	389069
hg19_clinvar_20140929.txt.gz	20141002	1521398
hg19_clinvar_20140929.txt.idx.gz	20141002	389735
hg19_clinvar_20150330.txt.gz	20150413	1988285
hg19_clinvar_20150330.txt.idx.gz	20150413	426235
hg19_clinvar_20150629.txt.gz	20150724	2211904
hg19_clinvar_20150629.txt.idx.gz	20150724	428773
hg19_clinvar_20151201.txt.gz	20160303	1978309
hg19_clinvar_20151201.txt.idx.gz	20160303	188549
hg19_clinvar_20160302.txt.gz	20160303	2070491
hg19_clinvar_20160302.txt.idx.gz	20160303	195824
hg19_clinvar_20161128.txt.gz	20161205	2762808
hg19_clinvar_20161128.txt.idx.gz	20161205	239561
hg19_clinvar_20170130.txt.gz	20170215	4756134
hg19_clinvar_20170130.txt.idx.gz	20170215	312735
hg19_dbnsfp30a.txt.gz	20151015	2916074880
hg19_dbnsfp30a.txt.idx.gz	20151015	4981998
hg19_dbnsfp31a_interpro.txt.gz	20151223	147102844
hg19_dbnsfp31a_interpro.txt.idx.gz	20151223	2445036
hg19_dbnsfp33a.txt.gz	20170123	3610182452
hg19_dbnsfp33a.txt.idx.gz	20170123	5034641
hg19_popfreq_all_20150413.txt.gz	20150413	1059027804
hg19_popfreq_all_20150413.txt.idx.gz	20150413	212518299
hg19_refGeneMrna.fa.gz	20151211	41379833
hg19_refGene.txt.gz	20151211	5304233
hg19_refGeneVersion.txt.gz	20151211	131417
hg19_spidex.zip	20150827	2991981619

desired output

refGene 20151211
clinvar 20170215
popfreq_all 20150413
dbnsfp 20170123
spidex 20150827

awk used to generate list.txt

awk 'FNR==NR{a[$1]; next} {for (i in a) if (index($0, i)) print}' database hg19_avdblist.txt > list
awk '
        NR == FNR {
                A[$1] = $2
                next
        }
        {
                for ( k in A )
                {
                        if ( $1 ~ k && $2 == A[k] )
                                F[k] = $2
                }
        }
        END {
                for ( k in F )
                        print k, F[k]
        }
' database.txt list.txt
1 Like

Thank you very much :).