Text match in two files

cmccabe · February 7, 2015, 10:09am

Trying to match the text from file1 to file2 and print what matches in a new file (match.txt) and what does not in another (missing.txt).

 awk -F'|' 'NR==FNR{c[$1$2]++;next};c[$1$2] > 0' flugent.txt IDT.txt > match.txt

Thank you :).

Walter_Misar · February 7, 2015, 10:52am

Something like the following might do for an overview:

sed 's/,[ ]*/\n/g' flugent.txt |comm - IDT.txt

Or

sed 's/,[ ]*/\n/g' flugent.txt |comm -12 - IDT.txt

for the matching list.

MadeInGermany · February 7, 2015, 11:04am

Why | ? Your first input file has only ,
With awk you best use RS (record separator) to split into lines.

awk 'NR==FNR{c[$0]; next} ($0 in c)' RS="," flugent.txt RS="\n" IDT.txt > match.txt
awk 'NR==FNR{c[$0]; next} !($0 in c)' RS="," flugent.txt RS="\n" IDT.txt > missing.txt

cmccabe · February 7, 2015, 1:32pm

Thank you

RudiC · February 7, 2015, 2:09pm

Building on MadeInGermany's proposal try

awk 'NR==FNR{c[$0]; next} ($0 in c) {print >"match.txt"; next}    {print > "missing.txt"' RS="," flugent.txt RS="\n" IDT.txt

cmccabe · February 7, 2015, 2:15pm

I get the below error:

 awk 'NR==FNR{c[$0]; next} ($0 in c) {print >"match.txt"; next}    {print > "missing.txt"' RS="," flugent.txt RS="\n" IDT.txt
awk: cmd. line:1: NR==FNR{c[$0]; next} ($0 in c) {print >"match.txt"; next}    {print > "missing.txt"
awk: cmd. line:1:                                                                                    ^ unexpected newline or end of string

Thank you :).

---------- Post updated at 01:15 PM ---------- Previous update was at 01:14 PM ----------

Never mind I forgot I removed some files (IDT.txt and flugent.txt)

drl · February 7, 2015, 2:19pm

Hi.

Minor alternate change to read file just once:

#!/usr/bin/env bash

# @(#) s1	Demonstrate special case of matching to separate files, awk.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C pll specimen awk

pl " Data files data[12]:"
pll data1
pe
specimen 3 data2

pl " Results:"
awk 'NR==FNR{c[$0]; next} { if ($0 in c) {print > "match.txt"} else {print} } ' RS="," data1 RS="\n" data2 > missing.txt

wc match.txt missing.txt
pe
specimen 3 match.txt missing.txt

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
pll (local) 1.24
specimen (local) 1.17
awk GNU Awk 3.1.5

-----
 Data files data[12]:
 (Longest line: 28129; fit into lines of length 78)
         1         2         3     ...9        2810        2811        2812
12345678901234567890123456789012345...3456789012345678901234567890123456789
A2M,A4GALT,A4GNT,AAAS,AADAC,AADACL2...F750,ZNF75D,ZNF804A,ZNF81,ZNHIT6,ZPBP

Edges: 3:0:3 of 4631 lines in file "data2"
A2M
A4GALT
A4GNT
   ---
ZNF81
ZPBP
ZPBP2

-----
 Results:
 3863  3863 23201 match.txt
  768   768  4837 missing.txt
 4631  4631 28038 total

Edges: 3:0:3 of 3863 lines in file "match.txt"
A2M
A4GALT
A4GNT
   ---
ZNF750
ZNF75D
ZNF81

Edges: 3:0:3 of 768 lines in file "missing.txt"
AAGAB
ABL1
ACAD10
   ---
ZNF80
ZPBP
ZPBP2

Best wishes ... cheers, drl

RavinderSingh13 · February 7, 2015, 2:19pm

cmccabe:

I get the below error:
 awk 'NR==FNR{c[$0]; next} ($0 in c) {print >"match.txt"; next}    {print > "missing.txt"' RS="," flugent.txt RS="\n" IDT.txt
awk: cmd. line:1: NR==FNR{c[$0]; next} ($0 in c) {print >"match.txt"; next}    {print > "missing.txt"
awk: cmd. line:1:                                                                                    ^ unexpected newline or end of string 
Thank you :).

---------- Post updated at 01:15 PM ---------- Previous update was at 01:14 PM ----------

Never mind I forgot I removed some files (IDT.txt and flugent.txt)

Hello cmccabe,

Adding a brace to RudiC's code as follows it should work.

awk 'NR==FNR{c[$0]; next} ($0 in c) {print >"match.txt"; next}    {print > "missing.txt"}' RS="," flugent.txt RS="\n" IDT.txt

Thanks,
R. Singh

cmccabe · February 7, 2015, 2:30pm

Thank you :).