Grep expression between double quotes

glev2005 · December 17, 2009, 4:00pm

I need a quick expression to be able to pull out all the data in a text file that looks like "http:// some random url etc" So it should grab any string that begins with "http:// and ends with " There are other double quotes in the file but I only want the ones that start with "http:// and the closing quotes for each incident of that.

---------- Post updated at 04:00 PM ---------- Previous update was at 03:46 PM ----------

I should mention the double quotes are actually in the file, I wasn't adding them myself.

Scrutinizer · December 17, 2009, 4:10pm

How about:

grep -o '"http://[^"]*"' infile

glev2005 · December 18, 2009, 8:32am

Yes, perfect. Thank you!

---------- Post updated 12-18-09 at 08:32 AM ---------- Previous update was 12-17-09 at 07:30 PM ----------

Could someone please explain why it worked? I thought [^"] would mean to grab a string that begins with a double quote. I see also the dbl quotes enclosed in single quotes. I would love to hear a rundown of how it all worked though.

CRGreathouse · December 18, 2009, 9:24am

[^"] means "any character other than a quote", just like [^q9] means "any character other than q or 9". [^"]* means "zero or more characters that aren't quotes", and so [^"]*" means "any number of non-quotes followed by a quote".

jim_mcnamara · December 18, 2009, 9:27am

You may want to google 'regular expresssion'. IF you want to become proficient in unix, regex. as it is called is a very important tool. It has spilled over into Windows programming in the past few years as well.

aaiaz · December 20, 2009, 11:57am

You may use egrep or grep as well and result here will be the same.

egrep ""http//.*"" 
or
egrep "\"http//.*\"" 
or
egrep '"http//.*"'

:D:p:D:)

Scrutinizer · December 20, 2009, 2:41pm

This is not correct. You may want to search for "greedy matching", and look up the -o option. Also your quoting alternatives will prohibit the regex from being properly evaluated.

bbala · December 22, 2009, 9:46am

This one works perfectly i think
grep -o '"http://.*"' input_file
can anyone give a failing case?

Scrutinizer · December 22, 2009, 10:12am

No that will only work if there is one address per line and if there are no further "-characters on that line. You may want to search for "greedy matching".

thegeek · December 22, 2009, 11:12am

Give more explanation, and some examples for good response....

Scrutinizer · December 22, 2009, 11:20am

Sure:

$ echo '"http://a.b.c" blablabla "http://c.d.e"' |grep -o '"http://.*"'
"http://a.b.c" blablabla "http://c.d.e"

$ echo '"http://a.b.c" blablabla "http://c.d.e"' |grep -o '"http://[^"]*"'
"http://a.b.c"
"http://c.d.e"

$ echo '"http://a.b.c" blablabla ""' |grep -o '"http://.*"'
"http://a.b.c" blablabla ""

$ echo '"http://a.b.c" blablabla ""' |grep -o '"http://[^"]*"'
"http://a.b.c"

If you do not limit greedy matching grep will try to find the longest match possible, hence the use of [^"] instead of .

aaiaz · December 23, 2009, 11:13am

scrutinizer:

Sure:

$ echo '"http://a.b.c" blablabla "http://c.d.e"' |grep -o '"http://.*"'
"http://a.b.c" blablabla "http://c.d.e"

$ echo '"http://a.b.c" blablabla "http://c.d.e"' |grep -o '"http://[^"]*"'
"http://a.b.c"
"http://c.d.e"

$ echo '"http://a.b.c" blablabla ""' |grep -o '"http://.*"'
"http://a.b.c" blablabla ""

$ echo '"http://a.b.c" blablabla ""' |grep -o '"http://[^"]*"'
"http://a.b.c"

If you do not limit greedy matching grep will try to find the longest match possible, hence the use of [^"] instead of .

Scrutinizer you are wrong because see below:-
Iam using SunOS server2 5.10 Generic_118833-36 sun4u sparc SUNW,Netra-210

bash-3.00$ echo '"http://abc" blablabla "http://cvd"' | grep '"http://.*"'
o/p:- "http://abc" blablabla "http://cvd"

and when using the 2nd way:-

bash-3.00$ echo '"http://abc" blablabla "http://cvd"' | grep '"http://[^"]*"'

o/p:- "http://abc" blablabla "http://cvd"

both ways are acting greedy.

Scrutinizer · December 23, 2009, 3:21pm

Aaiaz, that is a peculiar conclusion given that I posted my output above that proved my point.

Besides: you left out the -o option! Without it you always print the entire line.

aaiaz · December 24, 2009, 2:49am

-o option is not supported in grep command in my OS "SUN Solaries".

Scrutinizer · December 24, 2009, 3:22am

Hi aaiaz, if your grep does not support the -o option then you have to either download a grep that can or use a sed/awk/shell script. It can not be accomplished with a regular grep statement.

aaiaz · December 24, 2009, 4:31am

ok thanks man
:):):)

---------- Post updated at 04:31 AM ---------- Previous update was at 04:30 AM ----------

but how can I achieve this using sed or awk. it still not direct reach.

Scrutinizer · December 24, 2009, 5:21am

Hi aaiaz, it was a bit complicated, but I've come up with this as a grep -o replacement:

sed 's|"http://[^"]*"|&\n|g' infile | sed -n 's|.*\("http://[^"]*"\).*|\1|p'

-or-

pat='"http://[^"]*"'
sed "s|$pat|&\n|g" infile | sed -n "s|.*\($pat\).*|\1|p"

radoulov · December 24, 2009, 5:49am

Or use Perl:

perl -nle'print$1while m|("http://[^"]*")|g' infile

aaiaz · December 24, 2009, 5:56am

Scrutinizer it is not working:-

bash-3.00$ echo '"http://a.b.c" blablabla "http://b.b.f"' | sed 's|\"http://[^\"]*\"|&\n|' | sed -n 's|.*\("http://[^"]*"\).*|\1|p'

o/p
"http://b.b.f"

:(:(

and radoulov unfortunately I do not have perl.

radoulov · December 24, 2009, 6:07am

Ha! A Solaris machine without a Perl interpreter?

Try with nawk:

nawk -F\" '{ 
  for (i=0; ++i<=NF;)
    if ($i ~ /^http:\/\//) print FS $i FS
  }' infile

All my examples assume the urls do not span more than one line, otherwise you'll need something different.

---------- Post updated at 12:07 PM ---------- Previous update was at 11:58 AM ----------

Or:

nawk  '/^http:/&&$0=RS$0RS' RS=\" infile