compare two files

fredao · January 27, 2007, 11:28am

I have file1 and file2:

file1:

11 xxx kksd ...
22 kkk kdsglg...
33 sss kdfjdksa...
44 kdsf dskjfkas ...
hh kdkf kdkkd..
jg dkf dfkdk ...
...

file2:

jg
22
hh
...

I need to check each line of file1. if the field one is in file2, I will keep it; if not, the whole line will be discarded. The result file will be:

jg dkf dfkdk ...
22 kkk kdsglg...
hh kdkf kdkkd..
...

please tell me how I can do this, thanks!

aigles · January 27, 2007, 12:23pm

A possible solution :

awk 'NR==FNR { keys[$1]++ ; next } $1 in keys' file2 file1

Jean-Pierre.

fredao · January 27, 2007, 1:21pm

It works on the above question. however, my real problem is more complicated: the file1 is actually an XML file like this:

...
<object
type="user"
id="000039BF228B"
encryptedPassword=""
maxConnections=""
>
<checkListAttributes>
</checkListAttributes>
</object>

...
<object
type="user"
id="0000E2801BFD"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
...

and file2 is a list of id, as:

...
000039BF228B
0000E2801BFD
...

I want to delete all the blocks whose id is not in file2, and keep those with id in file2. I think we can change the RS (record separator to </object>), but I do not know how to do the whole job. would you help again?

radoulov · January 27, 2007, 2:44pm

With GNU awk!

patt="$(printf "id=\"%s\"|" $(<file2))"
awk '$0 ~ patt{print $0RS}' RS="</object>" patt="${patt%|}" file1

fredao · January 27, 2007, 2:59pm

how can I enter these two lines as a command, can you make it clear?

radoulov · January 27, 2007, 3:07pm

I'm not sure that I understand the question, but:

$ cat file1
<object
type="user"
id="000039BF228B"
encryptedPassword=""
maxConnections=""
>
<checkListAttributes>
</checkListAttributes>
</object>
<object
type="user"
id="0000E2801BFD_NOO"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
<object
type="user"
id="0000E2801BFD"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>

$ cat file2
000039BF228B
0000E2801BFD

$ patt="$(printf "id=\"%s\"|" $(<file2))"
$ awk '$0 ~ patt{print $0RS}' RS="</object>" patt="${patt%|}" file1
<object
type="user"
id="000039BF228B"
encryptedPassword=""
maxConnections=""
>
<checkListAttributes>
</checkListAttributes>
</object>

<object
type="user"
id="0000E2801BFD"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>

fredao · January 27, 2007, 3:13pm

I think there is a misunderstanding, as I only want to keep the blocks whose id has a match in the second file. If I have a block as:

object
type="user"
id="999999999999"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>

and 999999999999 is not in the second file, the whole block should be discarded. but after I run your code, it is still there. any idea?

radoulov · January 27, 2007, 3:26pm

I said GNU awk,
are you using GNU awk?
It's hard to troubleshoot, unless I can see the entire file1 and file2 content.
Could you also post the output from this commands:

patt="$(printf "id=\"%s\"|" $(<file2))" ; echo "${patt%|}"

$ awk --version| head -2
GNU Awk 3.1.5
Copyright (C) 1989, 1991-2005 Free Software Foundation.
$ cat file1
<object
type="user"
id="0000E2801BFD"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
<object
type="user"
id="999999999999"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>

$ cat file2
000039BF228B
0000E2801BFD

$ awk '$0 ~ patt{print $0RS}' RS="</object>" patt="${patt%|}" file1
<object
type="user"
id="0000E2801BFD"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>

fredao · January 27, 2007, 3:33pm

sorry, I was wrong to enter the command incorrectly. It works, comsumes a lot of computational power though. thanks!
would you explain your code?

radoulov · January 27, 2007, 3:42pm

Yep,
it's an "ugly" and "buggy" code (think what happens if your file2 is big :)).
I'm not able to write a good code in 2 minutes .
The first command generate your pattern list with a various "or" ("|").
The second tests all the records (RS="</object>" assumed) in file1 against it.

fredao · January 27, 2007, 9:36pm

nawk 'NR==FNR { keys[$1]++;next }; RS="</object>"; $3 in keys' file2 file1

fredao · January 28, 2007, 12:53am

patt is a shell variable or gawk variable? as the 1st command seems to have nothing to do with awk? is there any link for this grammar?

radoulov · January 28, 2007, 5:22am

patt is a shell variable here:

patt="$(printf "id=\"%s\"|" $(<file2))"

... and becomes an awk variable here:

awk '$0 ~ patt{print $0RS}' RS="</object>" patt="${patt%|}" file1

radoulov · January 28, 2007, 5:38am

And, of course, (given your input format) with GNU grep:

grep -B2 -A5 -f file2 file1

fredao · January 28, 2007, 12:39pm

what does this mean?

radoulov · January 29, 2007, 5:11am

Means this (only for fixed XML tags):

$ cat file1
<object
type="user"
id="0000E2801BFA"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
<object
type="user"
id="999999999999"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
<object
type="user"
id="0000E2801BFB"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>

$ cat file2
0000E2801BFA
0000E2801BFB

$ grep --version|head -1
grep (GNU grep) 2.5.1

$ grep -B2 -A5 -f file2 file1
<object
type="user"
id="0000E2801BFA"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
--
<object
type="user"
id="0000E2801BFB"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>

To get rid of the '--' notation:

grep -B2 -A5 -f file2 file1|grep -v '^--'

... or (with bash/ksh93):

grep -v '^--' <(grep -B2 -A5 -f file2 file1)

fredao · January 29, 2007, 8:37am

my real situation is more complicated, as the block can be 11 lines or 10 lines, namelly it should always -B2, but sometime -A7, and sometime -A8. As said, the blocks always begin with "<object" and end with "</object>". It is XML style file. Is there an easy way to change the code for this, with grep?

radoulov · January 31, 2007, 6:39am

Don't know about grep, but reading Jean-Pierre (aigles)
last post here I thought that this may solve your problem:

GNU awk

awk 'NR==FNR {f2[$0];next}
substr($3,5,12) in f2{print $0RS}' file2 RS="</object>" file1

with your nawk

nawk 'NR==FNR {f2[$0];next}
substr($3,5,12) in f2{print "<"$0RS}' file2 RS="</object>" file1