fredao
January 27, 2007, 11:28am
1
I have file1 and file2:
file1:
11 xxx kksd ...
22 kkk kdsglg...
33 sss kdfjdksa...
44 kdsf dskjfkas ...
hh kdkf kdkkd..
jg dkf dfkdk ...
...
file2:
jg
22
hh
...
I need to check each line of file1. if the field one is in file2, I will keep it; if not, the whole line will be discarded. The result file will be:
jg dkf dfkdk ...
22 kkk kdsglg...
hh kdkf kdkkd..
...
please tell me how I can do this, thanks!
aigles
January 27, 2007, 12:23pm
2
A possible solution :
awk 'NR==FNR { keys[$1]++ ; next } $1 in keys' file2 file1
Jean-Pierre.
fredao
January 27, 2007, 1:21pm
3
It works on the above question. however, my real problem is more complicated: the file1 is actually an XML file like this:
...
<object
type="user"
id="000039BF228B"
encryptedPassword=""
maxConnections=""
>
<checkListAttributes>
</checkListAttributes>
</object>
...
<object
type="user"
id="0000E2801BFD"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
...
and file2 is a list of id, as:
...
000039BF228B
0000E2801BFD
...
I want to delete all the blocks whose id is not in file2, and keep those with id in file2. I think we can change the RS (record separator to </object>), but I do not know how to do the whole job. would you help again?
With GNU awk!
patt="$(printf "id=\"%s\"|" $(<file2))"
awk '$0 ~ patt{print $0RS}' RS="</object>" patt="${patt%|}" file1
fredao
January 27, 2007, 2:59pm
5
how can I enter these two lines as a command, can you make it clear?
I'm not sure that I understand the question, but:
$ cat file1
<object
type="user"
id="000039BF228B"
encryptedPassword=""
maxConnections=""
>
<checkListAttributes>
</checkListAttributes>
</object>
<object
type="user"
id="0000E2801BFD_NOO"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
<object
type="user"
id="0000E2801BFD"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
$ cat file2
000039BF228B
0000E2801BFD
$ patt="$(printf "id=\"%s\"|" $(<file2))"
$ awk '$0 ~ patt{print $0RS}' RS="</object>" patt="${patt%|}" file1
<object
type="user"
id="000039BF228B"
encryptedPassword=""
maxConnections=""
>
<checkListAttributes>
</checkListAttributes>
</object>
<object
type="user"
id="0000E2801BFD"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
fredao
January 27, 2007, 3:13pm
7
I think there is a misunderstanding, as I only want to keep the blocks whose id has a match in the second file. If I have a block as:
object
type="user"
id="999999999999"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
and 999999999999 is not in the second file, the whole block should be discarded. but after I run your code, it is still there. any idea?
I said GNU awk,
are you using GNU awk?
It's hard to troubleshoot, unless I can see the entire file1 and file2 content.
Could you also post the output from this commands:
patt="$(printf "id=\"%s\"|" $(<file2))" ; echo "${patt%|}"
$ awk --version| head -2
GNU Awk 3.1.5
Copyright (C) 1989, 1991-2005 Free Software Foundation.
$ cat file1
<object
type="user"
id="0000E2801BFD"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
<object
type="user"
id="999999999999"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
$ cat file2
000039BF228B
0000E2801BFD
$ awk '$0 ~ patt{print $0RS}' RS="</object>" patt="${patt%|}" file1
<object
type="user"
id="0000E2801BFD"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
fredao
January 27, 2007, 3:33pm
9
sorry, I was wrong to enter the command incorrectly. It works, comsumes a lot of computational power though. thanks!
would you explain your code?
Yep,
it's an "ugly" and "buggy" code (think what happens if your file2 is big :)).
I'm not able to write a good code in 2 minutes .
The first command generate your pattern list with a various "or" ("|").
The second tests all the records (RS="</object>" assumed) in file1 against it.
fredao
January 27, 2007, 9:36pm
11
nawk 'NR==FNR { keys[$1]++;next }; RS="</object>"; $3 in keys' file2 file1
fredao
January 28, 2007, 12:53am
12
patt is a shell variable or gawk variable? as the 1st command seems to have nothing to do with awk? is there any link for this grammar?
patt is a shell variable here:
patt="$(printf "id=\"%s\"|" $(<file2))"
... and becomes an awk variable here:
awk '$0 ~ patt{print $0RS}' RS="</object>" patt="${patt%|}" file1
And, of course, (given your input format) with GNU grep:
grep -B2 -A5 -f file2 file1
Means this (only for fixed XML tags):
$ cat file1
<object
type="user"
id="0000E2801BFA"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
<object
type="user"
id="999999999999"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
<object
type="user"
id="0000E2801BFB"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
$ cat file2
0000E2801BFA
0000E2801BFB
$ grep --version|head -1
grep (GNU grep) 2.5.1
$ grep -B2 -A5 -f file2 file1
<object
type="user"
id="0000E2801BFA"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
--
<object
type="user"
id="0000E2801BFB"
encryptedPassword=""
>
<checkListAttributes>
</checkListAttributes>
</object>
To get rid of the '--' notation:
grep -B2 -A5 -f file2 file1|grep -v '^--'
... or (with bash/ksh93):
grep -v '^--' <(grep -B2 -A5 -f file2 file1)
fredao
January 29, 2007, 8:37am
17
my real situation is more complicated, as the block can be 11 lines or 10 lines, namelly it should always -B2, but sometime -A7, and sometime -A8. As said, the blocks always begin with "<object" and end with "</object>". It is XML style file. Is there an easy way to change the code for this, with grep?
Don't know about grep, but reading Jean-Pierre (aigles)
last post here I thought that this may solve your problem:
GNU awk
awk 'NR==FNR {f2[$0];next}
substr($3,5,12) in f2{print $0RS}' file2 RS="</object>" file1
with your nawk
nawk 'NR==FNR {f2[$0];next}
substr($3,5,12) in f2{print "<"$0RS}' file2 RS="</object>" file1