Hi,
I have an url.txt I need check them and grep some data.
url.txt
domain.com
domain2.com
domain3.com
.....
All sites urls have in source this patterns:
"web=pattern1"
"net++pattern2"
"office**pattern3"
I need this output:
domain.com: pattern1,pattern2,pattern3
domain2.com: pattern1,pattern2,pattern3
If there is no a pattern:
domain.com: pattern1,zero,pattern3
domain2.com: pattern1,pattern2,zero
sea
2
What have you tried so far?
I can not for multiple url and multiple pattern. Thanks.
my code:
wget -q www.domain.com -O - | grep -o -E -m 1 '"web=([^"#]+)"' | cut -d'=' -f2
RudiC
4
Well, give this a try:
wget -i url.txt -O - |
awk '/<(link rel=\"canonical\"|base) href/ {if (L++) {for (i=1; i<=3; i++)
{printf "%s%s", DL, P?P:"zero"; DL=","}
printf "\n"
}
delete P; DL=""
gsub (/href="http:\/\/|\/"\/*>/, ""); printf "%s: ", $NF
}
match ($0, /"web=[^"]*"/) {P[1]=substr($0,6,RLENGTH-6)}
match ($0, /"net++[^"]*"/) {P[2]=substr($0,7,RLENGTH-7)}
match ($0, /"office\*\*[^"]*"/) {P[3]=substr($0,10,RLENGTH-10)}
END {for (i=1; i<=3; i++)
{printf "%s%s", DL, P?P:"zero"; DL=","}
printf "\n"}
'
and report back. Finding the domain from an html file might be trickier than assumed in above.