Match text and print/pipe only that text

amx401 · March 30, 2015, 8:29pm

I'm trying to pull an image source url from a html source file. I'm new with regex. I'm in BaSH. I've tried

grep -E 'http.*jpg' file

which highlights the text, but gives me 2 problems:

1) Results aren't stand alone and can't be piped to another command. (I believe it includes everything in results between line breaks)

2) It includes spaces. This is an issue when there's an anchor before the image, and it counts from href='http.....'><img src.....jpg

I've tried putting lookbehind in there

grep -E '(?<=src\=\[\'|\"])http.*jpg' file
grep -E '(?<!href..)http.*jpg' file
grep -E '(?<=src..)http.*jpg' file

I either get errors or nothing returned. I don't know if it's something simple or not, but any help would be appreciated. I'm not opposed to sed or awk, but my knowledge of them is basically clean slate.

Also, this is *NOT* a homework assignment. It's me trying to learn by doing and hitting a wall repeatedly. Thanks for the help!!!

Don_Cragun · March 30, 2015, 8:46pm

Please show us a sample input file (including samples of the lines that are giving you problems with spaces) and the output you are trying to produce.

What operating system are you using. Some implementations of grep have a non-standard -o option that prints only the text matched by the search pattern; not the entire line containing the matched text.

amx401 · March 30, 2015, 9:32pm

I am using ubuntu 14.04. Thanks! The -o option was a big part of what I was looking for. Thank you!

Currently, I'm getting

http://site.com/image/word/tag/funny"><img src="http://images.site.com/pic/19434-3201cd0ed412e26b2c06cf00a0803c64.jpg
http://images.site.com/uploaded_pics/thumbs/19434.jpg

Desired result is

http://images.site.com/pic/19434-3201cd0ed412e26b2c06cf00a0803c64.jpg

Whenever I use lookbehinds I get no results at all. ex:

grep -Eo '(?<=")http.*jpg' file
grep -Eo '(?<=\")http.*jpg' file
grep -Eo "(?<!\')http.*jpg' file

I was able to get the desired result with

grep -Eo 'http\S*\/pic\S*jpg' file

I am still not sure why the lookbehinds didn't work. Please let me know if you have any insight into what I'm doing wrong and thank you so much for the help!

Don_Cragun · March 30, 2015, 10:24pm

The system I use doesn't have lookbehinds, so I can't experiment and definitively say why your lookbehind attempts were failing. The \S is not standard in REs either. A standard ERE that matches a string starting with http: , containing /pic/ , ending with jpg , and containing no spaces is:

http:[^ ]*/pic/[^ ]*jpg

This same string used as a BRE produces the same results, so I would just use:

grep -o 'http:[^ ]*/pic/[^ ]*jpg' file

durden_tyler · March 30, 2015, 10:37pm

amx401:

...Whenever I use lookbehinds I get no results at all. ex:
grep -Eo '(?<=")http.*jpg' file
grep -Eo '(?<=\")http.*jpg' file
grep -Eo "(?<!\')http.*jpg' file
...I am still not sure why the lookbehinds didn't work. Please let me know if you have any insight into what I'm doing wrong ...

I have GNU grep and it doesn't work on it.
Most likely grep's "-E" option does not support lookarounds. The man page or the GNU page at: GNU Grep 2.21
do not mention anything about lookarounds.
The "-E" option for "Extended REs" provides the special meaning to metacharacters like "+", "|", "{" etc. unlike BREs.

$
$ cat f31
http://site.com/image/word/tag/funny"><img src="http://images.site.com/pic/19434-3201cd0ed412e26b2c06cf00a0803c64.jpg
http://images.site.com/uploaded_pics/thumbs/19434.jpg
$
$ grep -E '(?<!")http://images' f31
$
$

If you have GNU grep, you could try the experimental "-P" option which provides support for Perl-compatible regular expressions:

$
$ grep -Po '(?<!")(http://images.*jp)' f31
http://images.site.com/uploaded_pics/thumbs/19434.jp
$
$

Or you could use Perl:

$
$ perl -ne 'printf("Line = %d, Matched Text = %s\n", $., $1) if /(?<!")(http:\/\/images.*jpg)/' f31
Line = 2, Matched Text = http://images.site.com/uploaded_pics/thumbs/19434.jpg
$
$

amx401 · April 1, 2015, 4:34am

I was operating under the incorrect assumption that lookarounds were universal. Thank you both for your help and responses!