compare 2 files and extract the data which is not present in other file with condition

rajniman · February 3, 2012, 12:26pm

I have 2 files whose data's are as follows :

fileA

00    lieferungen   
00    attractiop
01    done
02    forness
03    rasp
04    alwaysisng
04    funny
05     done1

fileB

alwayssng
dkhf
fdgdfg
dfgdg
sdjkgkdfjg
funny
rasp
done1

I want to compare from fileA 2nd column values with rows of fileB . if that value is not present in fileB then write that value in other file. but only those value of fileA is compared with fileB whose fileA value of 1st column is greater and equal than 03.

eg 1st value picked up from file A "lieferungen" whose 1st column is 00 . so it will not be compare..
next value like "funny" whose 1st value 04 is greater than 3 so it is compared with fileB
as it is not present in file B . it is written in third file.

output

fileC
03    rasp
04    funny
05    done1

i have a huge database of more than 80000 records. plz help if it can be done using awk

ahamed101 · February 3, 2012, 12:30pm

awk 'NR==FNR{a[$0]++;next}a[$2] && $1>=3' fileB fileA

--ahamed

rajniman · February 6, 2012, 7:11am

Hi ahamad ,
thnaks for your reply.
But how do i write answer in third file.?

kalpeer · February 6, 2012, 7:27am

This also works

grep -f fileB fileA | awk ' $1 >= 3' > fileC

rajniman · February 7, 2012, 9:23am

thanks kalpeer for replying

i have one more file as
FileD

funny       mou1
funny       mou2
raspe       mou4
damn       mou1

this file is a csv with /t.
i want to find all values with the first column of this file in fileB and then check for in that line if the second column vvalues are present or not .

for eg suppose in fileB there is a line

funnymou1:20112
funnymou2:34470
raspmou3:nhdhv
raspmou4:38748

fileD is 1st content is of 1st column is "funny"
so funny would be checked in fileB if found then "funny" second column will be checked i.e mou1 will be checked in that line if present than ok. if not present than it will be written in other fileE.if first column of fileD is present than only second column of fileD is checked in that line .if first column field can be present in more than one rows in fileB all lines of fileB should be then compared with second column of fileD if found then not written ... if not found then written in fileE
so for
raspe mou4 : as raspe is not present in fileB it will be written in fileE as output

so
funny mou1
is present so it will not be written in fileE
next
funny mou2 is also present in fileB so not writen
now
raspe mou4
damn mou1 will be wriiten in fileE as it is not present in fileB.

o/p of fileE

funny   mou1
funny   mou2

can it be done using awk ... i have records more than 80000

---------- Post updated at 09:22 AM ---------- Previous update was at 08:52 AM ----------

hi kalpeer,

the code
grep -f fileB fileA | awk ' $1 >= 3' > fileC
is not working
as fileA contents may be present in a line of fileB(means in between of that line in fileB and not as a whole) but not vice versa..
means total line of fileB cannot be present in fileA
if fileB contains
2011890done1
3235235funny

as fileA
03 done
04 funny

so the code isnt working ..
please help!

ctsgnb · February 7, 2012, 9:41am

You can redirect the output to a third file for example fileC:

awk 'NR==FNR{a[$0]++;next}a[$2] && $1>=3' fileB fileA >fileC

awk 'NR==FNR{a[$0];next}($2 in a)&& $1>=3' fileB fileA >fileC

If your file really is <tab> separated (which is not the case of the fileA you've initially posted) :

awk -F"\t" 'NR==FNR{a[$0];next}($2 in a)&& $1>=3' fileB fileA >fileC

rajniman · February 8, 2012, 8:33am

Hi ctsgnb

its not working ($2 in a) is searching whole pattern in fileB. but that pattern can be present in between of the line of fileB .

my files are as follows

fileA sepearated by tab /t
00 lieferungen
00 attractiop
01 done
02 forness
03 rasp
04 alwaysisng
04 funny
05 done1

fileB
funnymou120112
funnymou234470
raspmou3nhdhv
raspmou438748

so all those record which are greater than 3 and which are not present in fileB are to be redirected to third file.
eg : as in above file three records
03 rasp
04 alwaysisng
04 funny
05 done1
are greter than 3 . so rasp is compared to in each line of fileB . as rasp is present in any line of fileB .it is not redirrected in output file .. if rasp is not present in any line of fileB then it is redirected to output file . so output file for above will look like

o/p
04 alwaysisng
05 done1

plz help.

ctsgnb · February 8, 2012, 9:35am

nawk -F"\t" 'NR==FNR{a[$0];next}$1>=3{f=0;for(i in a)if(i~$2)f=1;printf !f?$0 RS:z}' fileB fileA >fileC