URL extraction from JSON file

busyboy · March 6, 2013, 3:32am

I'm trying to export URLs from within a JSON file which in turn resulted from export of Mozilla-Firefox bookmarks. Its single line file with below given values from awk

$ awk 'END { print NR }' bookmarks.json
1
$ awk 'END { print NF }' bookmarks.json
2706
$ awk -F, 'END { print NF }' bookmarks.json
4754

using sed, it gives me only 1st occurrence and rest is missed.

$ sed  's/.*"\(http:.*\)"/\1/' bookmarks.json
http://www.oracle.com/us/products/servers-storage/servers/blades/index.html","charset":"UTF-8}]}]}
$

extract from json file is something like below:

{"title":"","id":1,"dateAdded":1331548812311000,"lastModified":1331549028262000,"type":"text/x-moz-place-container","root":"placesRoot","children":[{"title":
"Bookmarks Menu","id":2,"parent":1,"dateAdded":1331548812311000,"lastModified":1342096853234000,"type":"text/x-moz-place-container","root":"bookmarksMenuFold
er","children":[{"title":"Recent Tags","id":925,"parent":2,"annos":[{"name":"Places/SmartBookmark","flags":0,"expires":4,"mimeType":null,"type":3,"value":"Re
centTags"}],"type":"text/x-moz-place","uri":"place:sort=14&type=6&maxResults=10&queryType=1"},{"index":1,"title":"Recently Bookmarked","id":924,"parent":2,"a
nnos":[{"name":"Places/SmartBookmark","flags":0,"expires":4,"mimeType":null,"type":3,"value":"RecentlyBookmarked"}],"type":"text/x-moz-place","uri":"place:fo
lder=BOOKMARKS_MENU&folder=UNFILED_BOOKMARKS&folder=TOOLBAR&sort=12&excludeQueries=1&maxResults=10&queryType=1"},{"index":2,"title":"","id":26,"parent":2,"da
teAdded":1243009025055489,"lastModified":1331549044829000,"annos":[{"name":"placesInternal/GUID","flags":0,"expires":4,"mimeType":null,"type":3,"value":"{445
36f3f-1d99-4e6d-8b77-d5e89c334d2d}2"}],"type":"text/x-moz-place-separator"},{"index":3,"title":"Get Bookmark Add-ons","id":27,"parent":2,"dateAdded":12430090
25055489,"lastModified":1331549044829000,"annos":[{"name":"placesInternal/GUID","flags":0,"expires":4,"mimeType":null,"type":3,"value":"{44536f3f-1d99-4e6d-8
b77-d5e89c334d2d}3"}],"type":"text/x-moz-place","uri":"https://en-us.add-ons.mozilla.com/en-US/firefox/bookmarks/"},{"index":4,"title":"","id":28,"parent":2,
"dateAdded":1243009025055489,"lastModified":1331549044829000,"annos":[{"name":"placesInternal/GUID","flags":0,"expires":4,"mimeType":null,"type":3,"value":"{
44536f3f-1d99-4e6d-8b77-d5e89c334d2d}4"}],"type":"text/x-moz-place-separator"},{"index":5,"title":"Mozilla Firefox","id":29,"parent":2,"dateAdded":1243009025
055489,"lastModified":1331549044845000,"annos":[{"name":"placesInternal/GUID","flags":0,"expires":4,"mimeType":null,"type":3,"value":"{44536f3f-1d99-4e6d-8b7
7-d5e89c334d2d}5"}],"type":"text/x-moz-place-container","children":[{"title":"Help and Tutorials","id":30,"parent":29,"dateAdded":1243009025055489,"lastModif
ied":1331549044845000,"annos":[{"name":"placesInternal/GUID","flags":0,"expires":4,"mimeType":null,"type":3,"value":"{44536f3f-1d99-4e6d-8b77-d5e89c334d2d}6"
}],"type":"text/x-moz-place","uri":"http://en-us.www.mozilla.com/en-US/firefox/help/"},{"index":1,"title":"Customize Firefox","id":31,"parent":29,"dateAdded"
:1243009025055489,"lastModified":1331549044845000,"annos":[{"name":"placesInternal/GUID","flags":0,"expires":4,"mimeType":null,"type":3,"value":"{44536f3f-1d
99-4e6d-8b77-d5e89c334d2d}7"}],"type":"text/x-moz-place","uri":"http://en-us.www.mozilla.com/en-US/firefox/customize/"},

Regards,

Yoda · March 6, 2013, 3:37am

awk -F'"' '{ for(i=1; i<=NF; i++) { if($i ~ /^http/) print $i } } ' bookmarks.json

hanson44 · March 6, 2013, 4:03am

Could this work:

 grep -o 'http:[^"]*' file

busyboy · March 6, 2013, 4:12am

both are perfect..Thanks buddies

Yoda · March 6, 2013, 4:13am

Yes, will work on GNU grep

Also require minor modification to include secure http:

grep -o 'http*:[^"]*'

busyboy · March 6, 2013, 4:14am

awk -F'"' '{ for(i=1; i<=NF; i++) { if($i ~ /^http/|| $i ~ /https/ ) print $i } } ' bookmarks

For http and https both

Yoda · March 6, 2013, 4:16am

if($i ~ /^http/) regexp should cover both http & https