how to get tags content by grep

visitor123 · February 17, 2012, 3:55pm

1) Is it possible to get tags content by grep -E ? For example title. Source text "<title>My page<title>"; to print "My page".

2) which bash utility to use when I want to use regex in this format?
(?<=title>).*(?=</title)

bartus11 · February 17, 2012, 4:03pm

Perl.

perl -nle 'print $& if /(?<=title>).*(?=<\/title)/' file

Corona688 · February 17, 2012, 4:11pm

grep will not work across lines, so HTML tags that cross multiple lines of data won't match. Neither will other line-based tools like sed.

For a problem like this I'd use awk. It has powerful regexes like sed and grep's, but is an actual programming language where you get to pick exactly what gets printed when, remember things with variables, etc.

$ echo -e "<title>stuff\na\nb\nc</title>" |
awk -v RS="<" '
        /^title>/ { sub(/^title>/, "", $0); P=1 }
        /^\/title>/ { P=0 }
        P'
stuff
a
b
c

$

visitor123 · February 17, 2012, 4:13pm

Nice. Do you think I could use it with gnuwin32? I just downloaded GnuWin perl and there are pcregrep.exe and pcretest.exe. I would like to run it on Win XP.

Corona688 · February 17, 2012, 4:17pm

You should run these things in a bash/ksh/zsh shell or what have you. Windows CMD has awful quoting problems -- quoting is more or less left as a problem for the utility itself, not something CMD does -- which means every utility seems to handle quoting slightly differently. Sometimes there's just no way to control when an argument gets split or passed raw.

Which makes it extremely difficult to pass a regular expression into any program inside single quotes.

If you can install awk and bash in gnuwin32, I don't see why it wouldn't work.

bartus11 · February 17, 2012, 4:18pm

For situations like this:

perl -ln0e '$,="\n";print /(?<=<title>).*?(?=<\/title)/sg' file

Corona688 · February 17, 2012, 4:25pm

Appears to work.

What do the commandline options actually mean? 'man perl' helpfully tells me they're not documented in 'man perl' but doesn't say where they are documented...

bartus11 · February 17, 2012, 4:30pm

man perlrun

BTW my man perl said where to find those options (in Reference Manual section):

           perlrun             Perl execution and options

visitor123 · February 17, 2012, 4:34pm

Yet I have a problem that I don't know how to process the data from file to perl command. So in cmd interpreter I tried this:

for /f "delims=" %a in ('dir /b *.a') do (
pcretest.exe -ln0e '$,="\n";print /(?<=<title>).*?(?=<\/title)/sg' < "%a"
)

It tells me that < was not expected on this place.... Is it OK here or should I ask rather in DOS forum?

Edit: the testing file a.a contains some html text. e.g.:
something.txt
<title>Hello title</title>
balbalaba

bartus11 · February 17, 2012, 4:41pm

I would strongly suggest you to install some Linux distribution (for example in VirtualBox) and do your pattern matching there.

visitor123 · February 17, 2012, 4:50pm

And what about grep? Can it do it for one line? Because grep is my favourite tool but don't know if it can filter out text that is on the line. So just the "Hello title" would stay.

Corona688 · February 17, 2012, 4:50pm

You're hitting the exact problem I just explained to you: quoting in CMD is a horrid botch. If you can install an actual shell to use in your system you'll have a better chance in it.