How to remove the values inside the html tags?

Hi,

I have a txt file which contain this:

<a href="linux">Linux</a>
<a href="unix">Unix</a>
<a href="oracle">Oracle</a>
<a href="perl">Perl</a>

I'm trying to extract the text in between these anchor tag and ignoring everything else using grep. I managed to ignore the tags but unable to remove the "href" and its values in my output. This is the code I used

grep -oP '(?<=<a).*(?=</a)' file.txt

When I run this codes, this is the output I have

href="linux">Linux
href="unix">Unix
href="oracle">Oracle
href="perl">Perl

Any reason to insist on grep instead of sed or awk for parsing your input...
sed 's/\(.*>\)\(.*\)\(<.*\)/\2/g' file
or
awk -F"[<>]" '{print $3}' file

1 Like

No, not really. I just want to learn how to use grep better.

Would you use the butter knife to carve the turkey at dinner time?
grep is not the tool for what you want to learn.
What you want to learn is Regular Expressions, which ironically, it is not the best tool neither to parse html, other than simple instances.

Any questions?

1 Like

Hi KCApple,

Following awk solution may help you which is very easy too.

 awk -F["><"] '{print $3}' Input_file

Output will be as follows.

Linux
Unix
Oracle
Perl

EDIT: Just saw shamrok has given above solution, so one more soluiton on same.

awk '{gsub(/.*\">/,X,$0);gsub(/<.*/,Y,$0);print $0}' Input_file

Output will be as follows.

Linux
Unix
Oracle
Perl

Thanks,
R. Singh

1 Like

Try

[akshay@nio tmp]$ cat file
<a href="linux">Linux</a>
<a href="unix">Unix</a>
<a href="oracle">Oracle</a>
<a href="perl">Perl</a>
[akshay@nio tmp]$ grep -oP '(?<=>).*(?=</a>)' file
Linux
Unix
Oracle
Perl

---------- Post updated at 12:30 PM ---------- Previous update was at 12:23 PM ----------

Some more (g)awk

$ awk 'match($0,/(<a.*>)(.*)(<\/a>)/,m){print m[2]}' file

---------- Post updated at 12:32 PM ---------- Previous update was at 12:30 PM ----------

Perl

$ perl -nle 'm/<a.*?>(.+)<\/a/ig; print $1' file

---------- Post updated at 12:34 PM ---------- Previous update was at 12:32 PM ----------

$ perl -lpe 's/<a.*?>(.+)<\/a>/$1/g;' file
1 Like

Thanks a lot! This work like charm :smiley:

[/CODE]

[akshay@nio tmp]$ grep -oP '(?<=>).*(?=</a>)' file
Linux
Unix
Oracle
Perl

Didn't know that all I have to do is to remove the "a" in the first tags and here I'm trying to put several combination of regular expression in the first tag. There's more for me to learn.