Field matching in two data files

palex · April 7, 2017, 11:38pm

Hello,
I am looking to output all of the lines from file2 whose 11th field is present in the first field in file1. Then the second field from file1 should be appended as such:

file1:

2222 0.35
4444 0.25
5555 0.75

file2:

col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 1111
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 2222
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 3333
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 4444
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 5555

Desired output:

col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 2222 0.35
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 4444 0.25
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 5555 0.75

Thanks so much!

RavinderSingh13 · April 8, 2017, 12:29am

Hello palex,

Could you please try following and let me know if this helps you.

awk 'FNR==NR{A[$1]=$0;next} ($NF in A){print $0,A[$NF]}'  Input_file1   Input_file2

Thanks,
R. Singh

palex · April 8, 2017, 5:49pm

Column 11 of file2 was duplicated in the output, but this works for me. Thank you so much!

RudiC · April 9, 2017, 3:29am

Try, then,

awk 'FNR==NR{A[$1]=$2;next} ($NF in A){print $0,A[$NF]}' file1   file2
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 2222 0.35
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 4444 0.25
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 5555 0.75

drl · April 9, 2017, 7:54am

Hi.

If the input files are sorted on the fields to be matched, then one can use:

join [options] <files>

like this:

#!/usr/bin/env bash

# @(#) s1       Demonstrate blending matched-field files, join.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C join pass-fail

E=expected-output.txt

# Remove old results file.
rm -f f1

pl " Input data files data1, data2:"
head data[12]

pl " Expected output:"
cat $E

# output all of the lines from file2 whose 11th field is present
# in the first field in file1
pl " Results:"
format="2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10,1.1,1.2"
join -t " " -1 1 -2 11 -o "$format" data1 data2 |
tee f1

pl " Verify results if possible:"
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.7 (jessie) 
bash GNU bash 4.3.30
join (GNU coreutils) 8.23
pass-fail (local) 1.9

-----
 Input data files data1, data2:
==> data1 <==
2222 0.35
4444 0.25
5555 0.75

==> data2 <==
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 1111
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 2222
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 3333
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 4444
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 5555

-----
 Expected output:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 2222 0.35
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 4444 0.25
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 5555 0.75

-----
 Results:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 2222 0.35
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 4444 0.25
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 5555 0.75

-----
 Verify results if possible:

-----
 Comparison of 3 created lines with 3 lines of desired results:
 Succeeded -- files (computed) f1 and (standard) expected-output.txt have same content.

See man join and experiment.

Best wishes ... cheers, drl