Noob trying to improve

Ardzii · December 22, 2016, 11:17am

Hi everyone!

This is my first post here, I hope that I will not already be violating any rule! I also would like to apologize in advance as my post will definitely be a noob post... please have patience and faith :rolleyes:!

Now that I have set the ground rules :D:D, my objective is trying to understand how to write a sort of script which could search and extract information out of the web and put that information into a CSV file.

To be completely transparent, I sort of already have a specific idea of what kind of information I'd like to get: The prices and details of some asset listings on the web.

I have checked the web and found out that with a while loop with a bunch of curl and grep commands could do the trick. Is there anyone who can help me in building this sort of "web-crawler".

Thanks in advance to you all!

Ardzii

RudiC · December 22, 2016, 11:30am

Welcome to the forum.

It is always beneficial to post the OS and shell version you are using as well as tools and their versions (e.g. awk, sed, ...) available.

This is not a new request; it might be worthwhile to search these forums for similar problems to get a starting point for your special solution. Any attempts/ideas/thoughts from your side? Do you have any preferences as for the tools to be deployed? Sample input and output data would help as well!

Ardzii · December 23, 2016, 4:40am

Hey RudiC!

Thanks for your quick reply!
Indeed, with this little information, it is hard to help right?
OK, let's then precise this a little

First off, I'm using macOS Sierra V.10.12.2 and regarding the Shell version, I seem to be working on:

GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16)
Copyright (C) 2007 Free Software Foundation, Inc.

For the tools I'm currently trying to use, I've got:

$ grep --version
grep (BSD grep) 2.5.1-FreeBSD
$ curl --version
curl 7.51.0 (x86_64-apple-darwin16.0) libcurl/7.51.0 SecureTransport zlib/1.2.8
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz UnixSockets

I can't seem to find my "sed" version (as the command "sed --version" doesn't exist apparently but I'm using the "basic" one that comes with macOS.

For now, I was only able to get and "echo" a listing in my terminal (for 20 entries) by executing the following:

$ curl "https://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=20&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | egrep "href.*view more"

To tell you the truth a friend helped me with the commands, I'm not even really sure to understand the difference between egrep and grep as I get the same results using "grep"... I mean, I've read the manual and I understand that egrep can handle more complex expression than grep but I'm not sure of what an "extended regular expression" is... :S

I will definitely search the forum to get more info, but what I was thinking was to get into a while loop so that I can get the price (and more details) of each listing that my command already provides.
I was thinking of first printing the output into an XML but can't seem to find the command to output my grep.

EDIT: I forgot, I also thought of using "sed" to get extract only the info that I'm looking for.

Thanks again!

RudiC · December 23, 2016, 5:37am

Thanks for the system details. I'm not familiar with the macOS, and the bash version is somewhat aged, but you might get along.

So, now we have a text to work upon. What do you want to extract from it? Above just lists several URLs, but no technical details nor prices.

Extending the grep regex with an alternation like

egrep "href.*view more|Asking Price"

gives you a link and a price

			<a href="/listing/bone-densitometer/ge/lunar-dpx/2299124"> view more </a>
			<p style="margin-top:11px"><span style="font-size:11px;">Asking Price:<br /></span>$20,000 USD

I'm not sure this would suffice to fulfill your needs; you might need to dive into the next level URLs, now.

While we can start a nice ping pong conversation about every single detail needed and how to get at it, I firmly believe it were better if you took a step back and rephrase the specification, giving the framework of what you really want, in which shape, where and how to get it, and ideas about what to do to transform the input data into the output. We then can help to implement and/or improve that.

Ardzii · December 23, 2016, 5:53am

Hey there RudiC!

This is indeed a great step! So now I can get the make, model and price from the search directly. But you're right:

To take a step back then: my objective is to get for each type of medical equipment within the website:

The make and model (which apparently appear in the http link)
the Asking price (if available)

but also some information that is detailed once we get into each listing:

Year of manufacture
Country
Publication date
Availability (if available)
Condition

And send all this information to a CSV or any kind of database file (even spreadsheet).

All the best!

Ardzii

RudiC · December 23, 2016, 6:05am

Please back these details up with input samples and how and where to get those! How to identify the data you need. You don't expect people in here to crawl through all those sites, do you?

Ardzii · December 26, 2016, 9:12am

Hey there RudiC!

Sorry for not answering earlier and, as you'll see, I deleted all the of the "https" from my reply as the forum doesn't let me post URL until I have at least 5 posts.
You're right, I don't expect people to crawl though the site. I'm sure to understand what you mean though. :(

To get the data, you need to generate a listing through this link:

$ ://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=20&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES

The URL is pretty easy to adapt and I think that I adapted it to my current needs. This will get a Densitometer equipment listing and afterwards I could easily adapt the URL myself to get to the other equipments (as the structure is the same across all equipments).

A few comments on the link itself though:

&limit=20

Obviously limits the output to 20 equipments. I am using 20 right now so that the requests are fast and easy but I change it to 200 to get much more informations and listings afterwards

&price_sort=descending

I'm mostly interest in listing where the price is mentioned, so I decided to sort by descending prices so that I get the listings with prices first (more relevant to me).

&country=ES

I chose Spain as a filter, but it's not much of a relevance. I'd rather have EU listing first which is why I chose Spain.

Now back to the command:
With "curl" I'm getting the listing (I could import that listing locally into an HTML file but since that's not the objective, I get right away with the grep command).
The grep then lists the links available for the listing I specified and that's it for now.

The expected part:
The remaining of the info I mentionned earlier is now located in each URL.
What I need to do now is, based on the previous "grep":

0    <a href="/listing/bone-densitometer/ge/lunar-dpx/2299124"> view more </a>
            <a href="/listing/bone-densitometer/osteosys/dexxum-t/2299556"> view more </a>
            <a href="/listing/bone-densitometer/hologic/discovery-c/1184884"> view more </a>
            <a href="/listing/bone-densitometer/ge/prodigy/1184904"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-idxa/2246457"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-prodigy/1668884"> view more </a>
            <a href="/listing/bone-densitometer/hologic/qdr-4500-elite/1738541"> view more </a>
            <a href="/listing/bone-densitometer/hologic/discovery-c/1405820"> view more </a>
            <a href="/listing/bone-densitometer/alara/metriscan/653936"> view more </a>
            <a href="/listing/bone-densitometer/sunlight/omnisense-7000s/470081"> view more </a>
            <a href="/listing/bone-densitometer/hologic/delphi-c/99115"> view more </a>
            <a href="/listing/bone-densitometer/lunar/dpx-nt/2310470"> view more </a>
            <a href="/listing/bone-densitometer/hologic/qdr-4500/2219929"> view more </a>
            <a href="/listing/bone-densitometer/norland/excell/1184892"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-dpx-duo/875678"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-dpx-nt/2284643"> view more </a>
            <a href="/listing/bone-densitometer/hologic/discovery-qdr-10041/2257994"> view more </a>
            <a href="/listing/bone-densitometer/sunlight/mini-omni-por/2183339"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-dpx-bravo/2225055"> view more </a>

for each link (for instance, starting with the first on in my list:

<a href="/listing/bone-densitometer/ge/lunar-dpx/2299124"> view more </a>

, go and get:

The price:

$ curl ://www.dotmed.com//listing/bone-densitometer/osteosys/dexxum-t/2299556 | fgrep -e "id=\"price"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 43 38179   43 16384    0     0  16698      0  0:00:02 --:--:--  0:00:02 16684"<ul><li class="left">Price:</li><li class="right" id="price"><span itemprop='price' content='19990.00'>$19,990.00 <span itemprop='currency'>USD</span> <a style='font-size: 5pt' href='#' title='Convert the Currency' onClick='javascript:window.open("/listings/currency.html?amount=19990.00&currency_from=USD", "listing", config="height=200,width=500,toolbar=no,menubar=no,scrollbars=yes,resizable=no,location=no,directories=no,status=yes"); return false;'>[convert]</a></span></li></ul>

The condition:

$ curl ://www.dotmed.com//listing/bone-densitometer/osteosys/dexxum-t/2299556 | fgrep -e "id=\"condition"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 43 38179   43 16384    0     0   9241      0  0:00:04  0:00:01  0:00:03  9240    <ul><li class="left">Condition:</li><li class="right" id="condition"><span itemprop='condition' content='new'>New</span></li></ul>

The date_updaed:

$ curl ://www.dotmed.com//listing/bone-densitometer/osteosys/dexxum-t/2299556 | fgrep -e "id=\"date_updated"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 38179    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0    <ul><li class="left">Date updated:</li><li class="right" id="date_updated">December  09, 2016</li></ul>

Obviously, my objective is to try to generate a loop that will get me this info for each link in the listing, see how I could clean up the info and send it to CSV or any other similar file to stock the information.o

I hope that some of this long post contains the info you were looking for? If not I apologize and please, if you could detail a little more, that'd be great!

Thanks again and as usual!

Ardzii

RudiC · December 26, 2016, 11:33am

This may serve as a starting point (file contains the web content downloaded before):

awk '/href.*view more/ {sub (/^[^<]*<a href="/, "curl -s https://www.dotmed.com")
                        sub (/">.*$/, "")
                        print}
' file | sh | awk '
/<\/*title>/ ||
/id=\"price/ ||
/id=\"condition/ ||
/id=\"date_updated/     {gsub (/<[^>]*>/, _)
                         if (length) print
                        }
' 

 
Used GE Lunar DPX Bone Densitometer For Sale - DOTmed Listing #2299124: 
			Price:$20,000.00 USD [convert]
			Condition:Used - Excellent
			Date updated:December  18, 2016
 
New OSTEOSYS DEXXUM T Bone Densitometer For Sale - DOTmed Listing #2299556: 
			Price:$19,990.00 USD [convert]
			Condition:New
			Date updated:December  09, 2016
 
Used HOLOGIC DISCOVERY C Bone Densitometer For Sale - DOTmed Listing #1184884: 
			Price:$19,000.00 USD [convert]
			Condition:Used - Good
			Date updated:December  07, 2016
.
.
.

Ardzii · December 26, 2016, 12:11pm

Hey RudiC!

Thanks for that! It looks really great! I have got a lot of work ahead to understand your lines though:D:D:D

I'll let you know once I'm able to get the results as you did!

Best!

Ardzii

RudiC · December 26, 2016, 1:39pm

A wee bit improved so you can add the search words at the end as parameters separated by pipe symbols:

curl -s "https://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=20&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" |
awk '/href.*view more/ {sub (/^[^<]*<a href="/, "curl -s https://www.dotmed.com")
                        sub (/">.*$/, "")
                        print}
' |
sh |
awk '
match ($0, "id=\"(" IDS ")\"")  ||
/<\/*title>/    {gsub (/<[^>]*>/, _)
                 print
                }
' IDS="price|condition|date_updated" 

Used GE Lunar DPX Bone Densitometer For Sale - DOTmed Listing #2299124: 
			Price:$20,000.00 USD [convert]
			Condition:Used - Excellent
			Date updated:December  18, 2016
 
New OSTEOSYS DEXXUM T Bone Densitometer For Sale - DOTmed Listing #2299556: 
			Price:$19,990.00 USD [convert]
			Condition:New
			Date updated:December  09, 2016
 
Used HOLOGIC DISCOVERY C Bone Densitometer For Sale - DOTmed Listing #1184884: 
			Price:$19,000.00 USD [convert]
			Condition:Used - Good
			Date updated:December  07, 2016
.
.
.

Ardzii · December 26, 2016, 5:06pm

OK! I got most of it... o:o

For the first step you get the listing:

curl -s "https://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=5&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | awk '/href.*view more/ {sub (/^[^<]*<a href="/, "curl -s https://www.dotmed.com")
                        sub (/">.*$/, "")
                        print}

The second one you created a variable IDS that looks for the price, condition and date_updated and print the results.

curl -s "https://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=20&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | awk '/href.*view more/ {sub (/^[^<]*<a href="/, "curl -s https://www.dotmed.com")                         sub (/">.*$/, "")                         print} ' | sh | awk ' match ($0, "id=\"(" IDS ")\"")  || /<\/*title>/    {gsub (/<[^>]*>/, _)                  print                 } ' IDS="price|condition|date_updated"

I added a >> "/Users/myuser/Desktop/test.csv" to get the print exported to a CSV file.
I've been looking around for the past hour now and I can't seem to find how I can put each listing in a line with a ";" dividing the "description" (or "title") and the price, condition and date_updated instead of having 4 lines create per entry.

I know that something has to change between the "||" after the match and before the print, but I have no idea where and how...
Could you help me once more?:(:(

Thanks as usual!!:D:D:D:b:

Ardzii

RudiC · December 27, 2016, 6:32am

If you'd accept a trailing comma (removal would need additional measures), set the output record separator to comma: ORS="," . As ALL info would come in a long line, then, we need to find out how to separate a single machine's data from the next. I used the begin of a HTML doc for this. Try adding the following to your script

.
.
.
/^<!DOCTYPE/    {printf RS
                }
END             {printf RS
                }
' IDS="price|condition|date_updated|in_stock" ORS=","

Please be aware that any comma INSIDE fields will lead to misinterpretation if the result is read somewhere else based on comma separated fields.

Ardzii · December 27, 2016, 9:26am

rudic:

If you'd accept a trailing comma (removal would need additional measures), set the output record separator to comma: ORS="," . As ALL info would come in a long line, then, we need to find out how to separate a single machine's data from the next. I used the begin of a HTML doc for this. Try adding the following to your script
.
.
.
/^<!DOCTYPE/    {printf RS
   }
END             {printf RS
   }
' IDS="price|condition|date_updated|in_stock" ORS="," 
Please be aware that any comma INSIDE fields will lead to misinterpretation if the result is read somewhere else based on comma separated fields.

Hey RubiC!

Thanks a lot for your followup! I'm sorry but I think that you're too advanced for me...
I have no idea on how to combine bash commands with HTML and where to insert the new code into my script.:(:(

For now I have that:

curl -s "https://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=20&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | 
awk '
/href.*view more/ {sub (/^[^<]*<a href="/, "curl -s https://www.dotmed.com")
                         sub (/">.*$/, "")
                         print} ' | 
sh | 
awk ' match ($0, "id=\"(" IDS ")\"")  ||
 /<\/*title>/    {gsub (/<[^>]*>/, _)
                  print >> "/Users/MyUser/Desktop/test.txt"
                 } ' IDS="price|condition|date_updated"

I tried already to replace the portion:

/<\/*title>/    {gsub (/<[^>]*>/, _)
                  print >> "/Users/MyUser/Desktop/test.txt"
                 } ' IDS="price|condition|date_updated"

with your new code:

/^<!DOCTYPE/    {printf RS
                 }
END             {printf RS
                 } ' IDS="price|condition|date_updated|in_stock" ORS=","

But it yielded a blanc "page" on my terminal. Plus I'm not sure at all on how to export to a file?
Again, I'm truly sorry to be such a burden and would totally understand if you weren't able to help me further! ;);)

Oh! And one last thing: Having a coma is perfect, I'll try to deal with the "iner" comas afterwards. Using Excel, I can still fine tune that pretty easily I guess...

Thanks anyways and all the best,

Ardzii

RudiC · December 27, 2016, 10:29am

Well, I said "add", not "replace". Add the lines after the print statement.

EDIT: And, yes, replace this line:

                 } ' IDS="price|condition|date_updated"

Ardzii · December 29, 2016, 3:38am

It worked like a charm RubiC!
I mean, I'm certain you expected it to work...

I even replaced the "," by a ";" which help me import the txt or csv to excel!
Now I'll be working on building something to have the looooonnnngggg line converted into columns!

Thanks a lot for your patience and your kind help!

Best!

Ardzii

EDIT: Oh yes, one quick word! I'm directly taking a Bash course on Udemy... hopefully I'll come back stronger next time!

RudiC · December 29, 2016, 4:55am

What "looooonnnngggg line"? There should be admittedly somewhat lengthy lines with a title and max. four more fields if you specify four IDS (N fields for N IDS).

EDIT: But - hold on - I see you redirecting print >> "/Users/MyUser/Desktop/test.txt" ? If so, you need to redirect the printf RS as well! Or, redirect the entire output of the pipe.

Ardzii · January 12, 2017, 1:28pm

Hey RudiC!

It's been a while I know, but as I said I was busy learning bash :D:D

Not saying I got it all, I still got a long way to go...
I just wanted to post here what I've been able to do all on my own until now.
It will definitely seem barbaric to you :o and less elegant that what you did earlier with the awk command but as I'm not sure how to control it, I'm taking another road ;):

#!/bin/bash

#setting variable for the link construction. This will be the part that comes after the www.dotmed.com"$link" for the second curl
set link

#Setting the index for the while loop. The limit U (constant in the while loop)  will define the amount of equipment to "crawl"
i=1

#Setting the offset variable that helps passing from one href to the next. This variable is used in the first curl link
offset=0

#Starting the loop for the crawl
while [ $i -lt 5 ]
do

#Getting the listing and assigning each listing to the variable "link"
        link=$(curl "https://www.dotmed.com/equipment/2/5/2693/all/offset/$offset/all?key=&limit=1&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | egrep "href.*view more" | sed -n 's/.*href="\([^"]*\).*/\1/p')

#Getting information from each listing
        curl "https://www.dotmed.com$link" | fgrep -e "id=\"price"

#Reseting for next iteration
        unset link      
        (( i++ ))
        (( offset++ ))
done

The great thing is that I can run it on any Linux machine plus I'm getting into each listing with this script to get info from there...
Now I've got to learn more about sed and grep to extract the information I need automatically and I'll be done :D.
Easy right? hopefully I will be able to do it soon.

If you have any comment on the script please be my guest! still trying to learn! :):)o:o

All the best!

bakunin · January 12, 2017, 5:31pm

Even if i am not RudiC: you do quite fine.

In (german) medicine there is a proverb: who heals is right. In programming the same is true: as long as a program is doing what it is supposed to do it is kinda hard to argue ...

A few suggestions, though:

#setting variable for the link construction. 
local link=""

There is a difference between an unset variable and one that has a value of "" (empty string) or zero (for numbers). What you want is to declare the variable, so you can give some (meaningful) value to it, which is - if this yet to be determined - an empty value. In bash the keyword to declare variables is "local" or "declare" (or even "typeset", perhaps in an effort to be compatible to the Korn shell).

local -i i=1
local -i offset=1

see above. As a suggestion: always give variables meaningful names. Once your script grows to some length and you juggle around several indexes at the same time you might want to have one i.e. "fooidx" and one "baridx" instead of "i" and "j".

#Reseting for next iteration
        link=""      
        (( i++ ))
        (( offset++ ))

You don't want to unset (that is: the opposite of define) the variable, just clear its content. So, like in the declaration, you just assign an empty string instead of unsetting it.

As a suggestion: i put commentary always at the same line as the line which it belongs to and always at a fixed horizontal position. Hence, instead of your loop, I'd write:

#Starting the loop for the crawl
while [ $i -lt 5 ] ; do                     # crawling loop
                                            # getting the link
     link=$( curl "your-link-here" |\
             egrep "href.*view more" |\
             sed -n 's/.*href="\([^"]*\).*/\1/p' \
           )
                                            # extract link
     curl "https://www.dotmed.com$link" | fgrep -e "id=\"price"

        link=""                             # Reset for next iteration
        (( i++ ))
        (( offset++ ))
done

For my eyes this is easier to read, but again: whatever helps you you should do. In the pipeline:

     link=$( curl "your-link-here" |\
             egrep "href.*view more" |\
             sed -n 's/.*href="\([^"]*\).*/\1/p' \
           )

You can do all in sed without an additional egrep :

     link=$( curl "your-link-here" |\
             sed -n '/href.*view more/ s/.*href="\([^"]*\).*/\1/p' \
           )

As a rule of thumb: grep/sed/awk | grep/sed/awk is always wrong because it can be done in the respective tool chosen.

I hope this helps and have (more) fun programming.

bakunin

MadeInGermany · January 13, 2017, 2:40am

Another comment.
"href.*view more" works the same with ERE and RE, so egrep can be replaced by grep.
And because sed takes an RE and does not yet have one, you can move it to sed

link=$(
  curl "your-link-here" |
  sed -n '/href.*view more/ s/.*href="\([^"]*\).*/\1/p'
)

If lines break at certain logical points one does not need \ at the end.
--
Just seeing bakunin got the sed trick, too. Maybe my explanations add some value.

Ardzii · January 13, 2017, 5:12am

Hey Bakunin! Hi MadeinGermany!

Thanks a lot you guys for your interventions!

@Bakunin:
I knew that declaring the variable so that it can be used elsewhere. However, in the course I'm taking they didn't extend on the value added of it. But I will look KornShell up now thanks to your intervention!
I see that you used local (which I had not seen yet) rather than declare (this was the one I knew about): Any particular reason? Regarding the "-i" I guess it stands for integer?

I loved your comment regarding indexes! You're right the clearer the better! Same for the comment... I'll give your option a try and see how it works for me.

@Bakunin@MadeinGermany:
About sed|grep|awk you must be both right! To tell you the truth, I haven't got to the grep and sed sections yet. I tried some grep from lines I found on the web and as for the sed: I literally copied and pasted it to my script from the web (someone was trying to get rid of the same thing).

Let me keep on going with my Bash course and I'll come back with an update better version.

Thanks again to you both and have a good one!

EDIT: Just tried with the few changes you suggested and for some reason my terminal returns "local: can only be used in a function" and there is definitely no function in my script ...:o I will try using declare, what do you think?