I have list of files in a directory 'dir'. Each file is of type HTML. I need to read each file and get the string which starts with 'http' and write them in a new text file. How can i do this shell scripting?
file1.html
<head>
<url>http://www.google.com</url>
</head>
file2.html
<head>
<url>http://www.yahoo.com</url>
</head>
text.txt
http://www.google.com
http://www.yahoo.com
ctsgnb
February 15, 2012, 4:13am
2
Assuming your file*.html just contain what you mentionned in your example:
grep -ho "http:[^<]*" file*.html >>text.txt
ygemici
February 15, 2012, 4:37am
3
you can try this
# sed -n '/^<url>/s/<[^>]*>//gp' file*.html >>text.txt
# cat text.txt
http://www.google.com
http://www.yahoo.com
regards
ygemici
---------- Post updated at 11:37 AM ---------- Previous update was at 11:34 AM ----------
maybe you can add "-h" to grep for suppress filenames
2 Likes
How to content where it ends with &CS=3
Ex:
http://www.google.com/test/&CS=3
http://www.google.com/sample/&CS=3
http://www.google.com/hello/&CS=3
text.txt
http://www.google.com/test/
http://www.google.com/sample/
http://www.google.com/hello/
ygemici
February 15, 2012, 5:02am
6
just remove it
# sed 's/&CS=3//' file1>text.txt