I'm a novice at shell scripts and i need some help parsing file data.
Basically, I want to write a script that retrieves URLs.
Here is what I have so far.
#!/bin/bash
echo "Please enter start date (format: yyyy-mm-dd):\c"
read STARTDATE
echo "Please enter end date (format: yyyy-mm-dd):\c"
read ENDDATE
wget -O filename download_location
So this downloads a page from download_location and saves it as filename. I need to parse the downloaded page and retrieve the URLs. Below is a small snippet of the data I'm parsing:
Thanks for your quick response. I don't suppose you could explain this a bit to me so I could understand what's going on here? Sorry, I'm new to scripting.
awk -F'=' ' # Set = sign as field separator.
{
for(i=1;i<=NF;i++) # for i <= NF ( NF is a special variable in awk and it means number of fields in the current record )
{
if($i ~ /a title/) # if $i ~ /a title/ ( ~ operator matches a pattern or regex )
{
url=$(i+1); # Set variable: url = $(i+1) which is next field value
gsub(/'\''| .*/,x,url); # Remove single quotes and everything followed by blank space from url variable value
print url; # Print value of variable: url
}
}
}' filename
It will remove the trailing tags as well, as it extracts from "http" up to but excluding the next single quote. Not all systems may accept that construct.
Or, if you need to refer to the "a title" precursor, try this, built on and simplifying bipinajith's proposal