How to remove the values inside the html tags?

KCApple · October 15, 2014, 12:40am

Hi,

I have a txt file which contain this:

<a href="linux">Linux</a>
<a href="unix">Unix</a>
<a href="oracle">Oracle</a>
<a href="perl">Perl</a>

I'm trying to extract the text in between these anchor tag and ignoring everything else using grep. I managed to ignore the tags but unable to remove the "href" and its values in my output. This is the code I used

grep -oP '(?<=<a).*(?=</a)' file.txt

When I run this codes, this is the output I have

href="linux">Linux
href="unix">Unix
href="oracle">Oracle
href="perl">Perl

shamrock · October 15, 2014, 12:53am

Any reason to insist on grep instead of sed or awk for parsing your input...
sed 's/$.*>$$.*$$<.*$/\2/g' file
or
awk -F"[<>]" '{print $3}' file

KCApple · October 15, 2014, 12:57am

No, not really. I just want to learn how to use grep better.

Aia · October 15, 2014, 1:13am

Would you use the butter knife to carve the turkey at dinner time?
grep is not the tool for what you want to learn.
What you want to learn is Regular Expressions, which ironically, it is not the best tool neither to parse html, other than simple instances.

Any questions?

RavinderSingh13 · October 15, 2014, 1:20am

Hi KCApple,

Following awk solution may help you which is very easy too.

 awk -F["><"] '{print $3}' Input_file

Output will be as follows.

Linux
Unix
Oracle
Perl

EDIT: Just saw shamrok has given above solution, so one more soluiton on same.

awk '{gsub(/.*\">/,X,$0);gsub(/<.*/,Y,$0);print $0}' Input_file

Output will be as follows.

Linux
Unix
Oracle
Perl

Thanks,
R. Singh

Akshay_Hegde · October 15, 2014, 2:04am

Try

[akshay@nio tmp]$ cat file
<a href="linux">Linux</a>
<a href="unix">Unix</a>
<a href="oracle">Oracle</a>
<a href="perl">Perl</a>

[akshay@nio tmp]$ grep -oP '(?<=>).*(?=</a>)' file
Linux
Unix
Oracle
Perl

---------- Post updated at 12:30 PM ---------- Previous update was at 12:23 PM ----------

Some more (g)awk

$ awk 'match($0,/(<a.*>)(.*)(<\/a>)/,m){print m[2]}' file

---------- Post updated at 12:32 PM ---------- Previous update was at 12:30 PM ----------

Perl

$ perl -nle 'm/<a.*?>(.+)<\/a/ig; print $1' file

---------- Post updated at 12:34 PM ---------- Previous update was at 12:32 PM ----------

$ perl -lpe 's/<a.*?>(.+)<\/a>/$1/g;' file

KCApple · October 15, 2014, 2:16am

Thanks a lot! This work like charm

[/CODE]

[akshay@nio tmp]$ grep -oP '(?<=>).*(?=</a>)' file
Linux
Unix
Oracle
Perl

Didn't know that all I have to do is to remove the "a" in the first tags and here I'm trying to put several combination of regular expression in the first tag. There's more for me to learn.