Parsing file data

silverdust · February 16, 2013, 6:37pm

Hey Guys,

I'm a novice at shell scripts and i need some help parsing file data.

Basically, I want to write a script that retrieves URLs.

Here is what I have so far.

#!/bin/bash

echo "Please enter start date (format: yyyy-mm-dd):\c"
read STARTDATE
echo "Please enter end date (format: yyyy-mm-dd):\c"
read ENDDATE
wget -O filename download_location

So this downloads a page from download_location and saves it as filename. I need to parse the downloaded page and retrieve the URLs. Below is a small snippet of the data I'm parsing:

<a title='http://149.47.192.185/b6caba9f46bef1d14f/w.php'<nobr><center>2013-02-16 21:56:52</center></nobr></td><td align='center'><b>0 / 2</b></td><td><a title='http://199.204.210.238/eda89353bf8789202d999ee8e832c/w.php'

The url is after "a title=" and is enclosed in single quotes ('url_here'). I want to grab the data enclosed in the quotes and discard the rest.

Thanks for your help, I'm really bad at this stuff.

Yoda · February 16, 2013, 6:50pm

awk -F'=' ' {
                for(i=1;i<=NF;i++) {
                        if($i ~ /a title/) {
                                url=$(i+1);
                                gsub(/'\''| .*/,x,url);
                                print url;
                        }
                }
}' filename

silverdust · February 16, 2013, 6:55pm

hi bipinajith,

Thanks for your quick response. I don't suppose you could explain this a bit to me so I could understand what's going on here? Sorry, I'm new to scripting.

Also:
What is NF?
What does the ~ represent?

Thank you.

Yoda · February 16, 2013, 7:02pm

Here is the explanation of code:

awk -F'=' '                                     # Set = sign as field separator.
{
        for(i=1;i<=NF;i++)                      # for i <= NF ( NF is a special variable in awk and it means number of fields in the current record )
        {
                if($i ~ /a title/)              # if $i ~ /a title/ ( ~ operator matches a pattern or regex )
                {
                        url=$(i+1);             # Set variable: url = $(i+1) which is next field value
                        gsub(/'\''| .*/,x,url); # Remove single quotes and everything followed by blank space from url variable value
                        print url;              # Print value of variable: url
                }
        }
}' filename

Check awk manual pages for further reference:

man awk

silverdust · February 17, 2013, 12:11am

Thanks a lot!

---------- Post updated 02-17-13 at 01:11 AM ---------- Previous update was 02-16-13 at 08:04 PM ----------

Sorry, one more question for you.

Suppose that I have URLs that have parameters which include a '='. Since the previous code separates fields with a equals(=), my URLs are wrong.

How could I fix this?

Yoda · February 17, 2013, 12:28am

Set single quotes ' as field separator instead of equal to = and see if it works.

Replace: awk -F'=' with awk -F\'

Also replace existing gsub function to sub(/ .*/,x,url);

silverdust · February 17, 2013, 12:31am

Works great, you taught me a lot. Thanks again.

RudiC · February 17, 2013, 2:35am

Try

$ grep -o "http[^']*" file
http://149.47.192.185/b6caba9f46bef1d14f/w.php
http://199.204.210.238/eda89353bf8789202d999ee8e832c/w.php

It will remove the trailing tags as well, as it extracts from "http" up to but excluding the next single quote. Not all systems may accept that construct.

Or, if you need to refer to the "a title" precursor, try this, built on and simplifying bipinajith's proposal

awk    '{ for(i=1;i<NF;i++) if ($i ~ /a title/) print $(i+1) } ' FS="\'" file