Noob trying to improve

There is more about declaring variables, as there is more about programming than just making it work correctly: well written programs work correctly as well as badly written programs (everything else is NOT a program but a mess) but a well written program will also be easy to understand, therefore easy to maintain and well documented (this plays up the same alley as "easy to understand"). What is "well written" versus "badly written" shows not always immediately but usually when trying to create version 10 from version 9. This is true for every programming language and every programming paradigma. Whatever the technology you use: strive to write well-written programs.

To declare variables has several values: first, you can assign sensible initial values so that you lessen the risk of using a variable before you are ready to use it. Second, declaring the variables up front builds automatically a sort-of "data dictionary" so that you quasi-automatically describe to yourself what you are using to variable for, what it is expect to hold, etc.. You build up a common point of reference that way and once your script became several hundred lines long and you have just used it for some months instead of rewrite it you will be thankful to yourself for the explanations to your thoughts you have left.

Programming is about organising your thoughts and the better you organise them the better the results are.

Yes, "-i" is for integer. Regarding the hiccup between "local" and "declare": probably "declare" is correct. I don't use bash much, so you can safely assume that to be my error, not yours.

I hope this helps.

bakunin

local means: the variable exists only within the current scope.
In Bash: within the current function.
Bash is not a declarative language, and has loose type binding.
For example you can add a 1 to a string "5", and get "51" or 6 dependent on the operator.
With declare -i you can limit the misuse of variables. For example

i="a"; [ $i -lt 5 ] && echo "$i is less than 5"

gives a syntax error. With declare -i i the "a" will be casted to a 0 , and there will not be a syntax error (but might still result in a malfunction in the following code).
declare makes most sense for special variables like arrays.

1 Like

Hey again guys!

Thanks so much for your time and precisions!:b::b::b: Great material right there (especially for a beginner!).

@Bakunin: Your comments were heard! Hahahaha! Next version of the script I'll post will, hopefully, be well organized (if not correct me! :D).

@MadeinGermany: The difference between the 2 is clear! Thanks a lot for that.

I'd like to come back to an earlier comment from Bakunin though:

I've been working on grep exclusively lately to try to find the correct syntax to extract the info right away but I can't seem to find a way (other than using a SED) to extract a specific portion of a line:confused::confused:.
Take as an example, your correction of the link extraction:

Could you have done that with a grep instead of a sed? I'm trying to determine the limitation of each of these 3 commands :o (don't worry, you don't have to explain that to me here, I'll figure it out on my own! :b:;)).

Thanks as usual!

Best!

I don't think so, generally. grep is not intended nor designed to replace or remove patterns or partial strings. Be aware that sed has similar powerful matching algorithms as grep has.
In your special case, a (deprecated) pipe of three grep s could do the job:

echo "<a href=\"/listing/bone-densitometer/osteosys/dexxum-t/2299556\"> view more </a>" | grep '.*href="[^"]*"> view more.*$' | grep -o '"[^"]*"' | grep -o '[^"]*'
/listing/bone-densitometer/osteosys/dexxum-t/2299556
1 Like

Hey RudiC!

Nice to read you again! I hope you're doing great.
OK then, gotcha! Obviously, I'm not so much trying to force the use of grep rather than understand the limitations between the 3 commands (grep | sed | awk).

I'm starting to understand how sed works however I'm still looking at baby versions of what you guys can do!
It might seem trivial to you but it's kind of a "bitchy" command :D. The syntax is not (in my opinion) easy to get at all... :eek: I'm actually kind of struggling on this one a little :o:o:o.

Let me see what I can get and do from there and I'll come back to you soon with an improved version of this damn script! :D:D;):wink:

Thanks again!

grep stands for 'g/re/p' of sed [where g is Global , re is RegularExpression and p is Print ]

2 Likes

As vgersh99 already noted "grep" comes from "g/re/p", which is a (schematised) sed-command. Let us see if i can help to improve your understanding:

grep is basically a line filter: you feed it a stream of input (or files) and it displays all the lines matching a certain pattern. Options can i.e. reverse this matching (so effectively all lines NOT matching the pattern can be displayed), you can count the found lines ("-c"), etc. but basically that is it: filtering out lines containing some text pattern from text.

grep is a good tool to find out about the existence of certain text and all that is related to this:

grep -c "pattern" /some/file - count the number of lines containing "pattern"
grep -v "pattern" /some/file - display all lines NOT containing "pattern"
if grep -q "pattern" /some/file ; then - grep -q makes grep not display any lines, matched or otherwise. But since grep will exit with "0" if it has found anything and with "1" if it hasn't this will execute the following code if "pattern" was anywhere in the file.

Now sed : sed is for "stream editor" and this is exactly what it is: a highly programmable text editor. You feed it some input text (from a file or a stream of data) but instead of just searching it (like grep) you can also manipulate and change it. If you want to filter out certain lines AND at the same time change the text found by some rules (like cutting out a certain part of the line, but also more complicated things) sed is the tool to turn to.

I am halfways through mustering the energy to write a sed -introduction, so watch out for "the most incomplete introduction to sed" (i only write "most incomplete" articles). To write it in a few sentences is just too complicated, i am sorry. What makes sed similar to grep is that it uses the same "regular expressions" to describe text patterns. So once you learn how to use grep and its powerful pattern-matching engine you can use this knowledge in sed too.

Lastly awk : awk stands for "Aho, Weinberger, Kernighan", its three primary authors. It is a regular programming language with some reminiscences of C and it sports a very similar regular expression engine as sed (the sed variant is called UNIX BRE - basic regular expressions, the awk variant UNIX ERE - extended regular expressions). Again, its scope overlaps with sed and for many problems here you will find a sed-solution along with an awk-solution.

awk has a built-in structure for the evaluation of data files: each awk-program consists of three parts: one that is executed before any input is read, one that is executed for every line of input and one that is executed after all input is processed. If you want to set up the program and draw some header, then process the input line by line, finally do some end-processing like drawing footers, sums and the like this is ideal.

OK, this is a quick and very incomplete overview of what the tools do. You will find that all three of them have their purpose. If you picture the UNIX toolbox as an orchestra waiting for a gifted conductor to make them sound great (you), especially sed and awk are the exceptionally gifted soloists. You will find that they have their quirks (as all the great artists do) but work with them can be immensely rewarding and five minutes of truly guiding them to their limit will compensate for the weeks of hard work in rehearsal.

I hope this helps.

bakunin

3 Likes

Hey Bakunin!

Very nice intro to the tools! Thanks!
In the end by reading you guys, I determined that GREP (even though it's still a cool tool) won't help me much with what I'm looking for. :cool:
Since there are quite a few examples and tutos to use SED on the web I started with that.
I understand now very basic concepts of the tool such as the "-n" and "-i"/"-i.bak" options or the p/s/d commands. :stuck_out_tongue:
From that (very) basic understanding I tried to get a better grasp of the cool use you made of it earlier:

sed -n '/href.*view more/ s/.*href="\([^"]*\).*/\1/p'

What I got from your command there is that you first specify the line (since SED works with lines exclusively right?) in which you will be applying the substitution:

'/href.*view more/

And you then substitute the portion

*href="\

and

"

at the end with nothing (I'm not sure where is that "nothing" portion in the code you're substituting the text with?)
Then through the print command you keep the only portion that matters:

([^"]

with "^" meanning everything from the beginning to the

"

I'm guessing that the escape characters you use are meant to specify that the

"

is part of the expression to be considered, for instance in:

*href="\

.
If so why does the end of the expression

[^"]

has no escape character?

Finally the -n and the p-command are used to exclusively print the portion that has been edited.

I tried to "re-use" your command based on my understanding but it doesn't print anything... :confused::confused:
I actually get a ">" on the next line, as if one of my

"

wasn't closed...

The txt I'm working on is a curl from the same website:

<div id="category_listing" itemscope itemtype="http://data-vocabulary.org/Product">
        
        <div id="category_bg">
        <div class="title">
            <h1 itemprop='name'>For Sale <span itemprop='brand'>HITACHI </span> <span itemprop='name'>AIRIS 1  Magnet</span></h1></div>
            <meta itemprop="category" content="Business & Industrial>Medical Medical Equipment" />
        <!-- end div title -->
                <div class="listing_num">LISTING #2229540</div>
           </div> 
        <div style='border-bottom: dotted 1px #666' class="clr"></div>
        <div id="category_listing_body">
            
<div id="list_detail">

and the command goes like:

sed -n '/*itemprop='\brand'\*span>/ s/.*brand'\([^<]*\).*/\1/p' result2.tx

The objective for now, would be to extract from the text above:

  • the brand
  • the name

Thanks as usual for your enlightenments!

Best!

[^"] is a character that is not a quote
[^"]* is any consecutive number of non-quote characters
\( \) does not mean a character but is a group mark, for later reference

s/.*href="\([^"]*\).*/\1/p

\1 is the reference. It becomes the string that matched within the \( \) . The leading and trailing .* ensure that the entire line is matched, i.e. is deleted+substituted by the back-reference.
\1 actually referes to the 1st \( \) ; \2 would refer to the 2nd...
The -n sed option suppresses the default print. the /p at the end of the substitution is a print if there was a match. So non-matching lines are not printed.

1 Like

OK! Thanks MadeInGermany!
This changes the deal quite a bit! But it gives me a better view of the substitution being made!

I got:

substitution command / text that is going to be substituted / substitution / print

Now what I'm not sure to grasp is how it manages to stop at the

? Is that thanks to the

thingy? Does the deal go like: Start at

up to the next quote character?

Also why are there

in the structure?
s/.*href="\([^"]*\).*/\1/p

Exactly. The first character that matches in the trailing .* is a quote.
As I said, the leading and trailing .* are needed to "match away" the entire line. Otherwise only the matching portion would be substituted.

---------- Post updated at 12:15 ---------- Previous update was at 11:44 ----------

Now to your second requirement. Can give a headache even for experienced guys.
In your example the ' is a problem for the shell, in which you call

sed -n '...'

There is no problem if you save the sed code in a separate file and run it with

sed -n -f sed-script result2.txt

And the contents of the sed-script

/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p

You can add another match in a second line

/itemprop='name'/ s/.*'name'>\([^<]*\).*/\1/p

but it won't match if the first match was successful and the input line was substituted.
It is necessary to save and restore the line.

h
/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p
g
/itemprop='name'/ s/.*'name'>\([^<]*\).*/\1/p

Another aspect is greediness. The * wants to match as much it can. A leftmost * is most greedy.
That means /.*'branch'/ matches the rightmost 'branch' .
--
Last but not least, the shell method to print a ' within a ' ' string goes like this

 echo 'left'\''right'

Actually it is a concatenation of 'left' and 'right' with a \' in between.
For an embedded sed script it is enough to remember to exchange each literal ' by '\'' .

1 Like

Very good! sed might be "love at third sight", but it is an immensely mighty tool.

Here is a very short introduction to my favourite topic:

Regular Expressions

Regular expressions are basically a way to describe text patterns. Because they describe text by (other) text they consist of two classes of characters: "characters" and "metacharacters". Characters are only stand-ins for themselves:

bla-foo

is a valid regular expression - one that matches the string "bla-foo" and nothing else. Now that is not very helpful in itself, but we could use this to use sed as a grep -equivalent. The following commands will do the same:

grep "bla-foo" /some/file
sed -n '/bla-foo/p' /some/file

Therefore there is another class of characters, so-called "metacharacters". These are expressions that either modify other characters (or groups thereof - it is possible to group) or match classes of characters. For modifiers we have two:

* - the character before may be repeated zero or more times
\{m,n\} - the character before has to be repeated between m and n times (m and n being integers)

Let us try an example: we look for the word "colour". The regexp for this (regexps are traditionally enclosed in slashes, which are not part of the regexp):

/colour/

Now suppose several people have written this text, some British, some American, so "colour" is sometimes written "color" and sometimes "colour". The regexp for this is:

/colou*r/

The asterisk makes the "u" optional (zero or more). The downside is that the hypothetical word "colouuuuur" would also be matched, but more on that later. Whenever you construct regular expressions you need to answer several questions:

  1. will it match the lines i want matched?
  2. will it not match lines i want to be matched too? (false negatives)
  3. will it match lines i don't want to be matched? (false positives)

Now, suppose we would want to match not only "colour" and "color" but every word starting with a "c" and ending with an "r". For this we use another metacharacter:

. - matches any one character

This is the biggest problem for beginners, btw., especially when they come from DOS-derived systems or have only used "file-globs": "/some/file*" means any file named "file" and whatever trails it. In regexps "" has a different meaning and what comes closest to "" is ".". You can use it in conjunction with "*" to match strings of unknown composition:

/c.*r/

This will match: a "c", followed by any amount ("*") of any character(s) ("."), followed by an "r". Will this be a solution for our expample?

Sorry: no. Yes it will match "color" and "colour" and "colonel-major" and "conquistador" but - because "any character" also includes a space - it will also produce matches for "chicken hawk breeder" and the like. We would need to limit our wildcard to only non-whitespace.

For this there is the "character-class" and its inverted counterpart:

[a1x] - will match exactly one occurrence of either "a", "1" or "x"
[^a1x] - negation - will match any character except "a", "1" or "x"

Note,that the "^" to be used as negation it has to be the first character inside the brackets. [^^] is "anthing else than a caret sign" and [X^] is "either "X" or a caret sign. There are also predefined classes: [[:upper:]] (all capitalized characters) and [[:lower:]] (all non-capitalized characters) and so on. You can also specify sequences: [A-D1-3] (all capitalized characters from "A" through "D" or a number from 1 to 3 or: one of "A", "B", "C", "D" "1", "2" or "3").

With this we stand better chance of constructing our regexp:

/c[a-z]*r/

This is coming closer but now "colonel-major" is not matched any more. It is a matter of definition of hyphenated words should count as "one word", but suppose they do: you can sure change the regexp yourself now to include hyphenated words, no? Like this:

/c[a-z-]*r/

Alas, there is another sort false positives we haven't touched upon yet. How about the word "escrow": would it be matched by our regexp or not? Unfortunately: yes. Its middle part consists of "c" followed by zero or more non-capitalized characters including the hyphen, followed by an "r": "escrow". We would now have to make sure that the "c" is indeed at the beginning of the word and the "r" is at its end. This is in fact quite complicated because naively adding leading and trailing blanks:

/ c[a-z-]*r /

Would help for words in the middle of the line, but fail for words at the line end or the line start. Furthermore it might happen that a word is followed by punctuation like here where "colour, intended as an example" would fail because we look for trailing blanks only.

But let us make it simple: suppose all our text consists of one word per line. We could use the line start and the line end as a sort-of "anchor" for our regexp then. Fortunately this is possible:

^ - beginning of line
$ - end of line

Notice that "^" has two different functions: inside the brackets it means negate the class and at the beginning of a regexp it means beginning of line. Now we finally can construct our regexp:

/^c[a-z-]*r$/

This means (you can already read that yourself, but let me prove that i can read it too, for the record): beginning of line, followed by a "c", followed by any number of any non-capitalized characters or hyphens, followed by a "r" and the line end.

One last thing you need to know: regexps are greedy! That means: of there are more than one possible matching, regexps will always take the longest possible match (non-greedy would be the shortest-possible). For instance, consider following regexp: /a.*b/ . Here is some text, the matched part is marked bold:

this is a blatant example of how greedy matches will be for beginners

If this regexp would be used to change the matched text and if you only want to match the "a b" at the beginning of the match you would need to use a negated character class:

/a[^b]*b/
this is a blatant example of how greedy matches will be for beginners

I hope this helps.

bakunin

1 Like

My god Bakunin, you're a master in sed! :b::b:
Thank you so much for taking the time to write these lines! :):slight_smile:

OK then, let me try on your sed to see if I understood:

sed -n '/href.*view more/ s/.*href="\([^"]*\).*/\1/p')

I'll stay with my understanding of the first part of the command. You're actually not passing any command yet to sed. So what you're looking for in:

'/href.*view more/

is the line that matches "href [any kind of character in between]view more" or to put it another way:

SED, find me the line that has "href" + some string and "view more"in it.

you get that line:

<a href="/listing/magnet/ge/ramp-shim/2322185"> view more </a>

Now comes the good part:

s/.*href="\([^"]*\).*/\1/p'

Within that line, substitute: "[any kind of character before] href=" [following string omitting the possible " characters within the string] by [this same string without " characters that you just found] and print.

but how come the "> view more </a>" portion of the line was left out of the sed? because from what I understand you're including .* which still should include all the characters at the end of the line, shouldn't it?

Thanks as usual!

Best!

EDIT-----

I just tried:

sed -n '/href.*view more/ s/.*href="\(.*\)/\1/p'

and it gave me:

/listing/magnet/ge/ramp-shim/2322185"> view more </a>

So I guess that what's happening with your code is that when you tell sed to exclude the " it simply stops at it and do not go on with the rest of the line.

---------- Post updated at 06:08 PM ---------- Previous update was at 04:43 PM ----------

Hey MadeinGermany!

Bakunin's explaination helped me a lot go through your answer but I still got a few questions:

Why is it a problem for the shell?
When I paste the previous "PHP" (I assume it's PHP) code into a txt (examplesed.txt) for testing the command:

sed -n '/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p' examplesed.txt

it yields nothing. There's no output whatsoever...

On that one, I'm not sure to follow either... My objective is to integrate these commands within a loop. So I will have the first iteration and it'll write the output to a file, then the second command (with 'name' for instance) > echo to a file and go to the third iteration etc...
Wouldn't that work under that setting?

I guess that in my case, my problem child would be 'name' that has a first appearance at the beginning of the line.
But in that case couldn't I use 2 right before the 'p' (print command).
I learnt on the web that putting a 1 or a 2 before the p would yield the first or second appearance of the term I'm looking for... wouldn't that work?

All the best!

The /2 option does not work if the .* has already matched too much. For example

echo "name something name something" | sed -n 's/.*name/XXXX/p'
XXXX something
echo "name something name something" | sed -n 's/.*name/XXXX/2p'

There is no 2nd match.
But it does work without the .*

echo "name something name something" | sed -n 's/name/XXXX/p'
XXXX something name something
echo "name something name something" | sed -n 's/name/XXXX/2p'
name something XXXX something
1 Like

The force flows strong in me, LOL!

Actually you came very close. What you didn't get was the part i left out in my little introduction, so here is part two:

Grouping
To combine several characters or metacharacters into a single expression which you can handle together there is grouping: it works like grouping in mathematical expressions:

(x+y+z) * 3 =

The * 3 affects all that is inside the brackets as a single entity. The same works for regular expressions, just that the brackets are "escaped" (you put a backslash in front of them, otherwise they would be simple characters) and you can do really cool things with it:

/\(aa\)*/

Because the asterisk now addresses what is inside the brackets this matches any even number of a's (zero, two, four, ...), but not an odd number. Try the following file:

xy
xay
xaay
xaaay
xaaaay
xaaaaay
xaaaaaay

and apply this sed-command to it. Watch the output:

sed -n '/x\(aa\)*y/p' /your/file

Always keep in mind, btw.: i told you regexps are greedy in nature ("greedy" is really the term for it. The opposite is "non-greedy [matching]". More often then not if regexps do not do what you expect them to do this is the problem - they are matching more than you expect them to match.) This means i.e. that /\(aa\)*/ on its own would also match a line with 3 a's - it would match the 2 a's and just ignore the third one -> false positives, i warned you!

Grouping also has another use: you can use it for so-called backreferences. Backreferences are are parts of the matched line which you can use in a substitution command to put the matched part back into the substituted portion.

The most basic backreference is the & , but let us first examine the "s"-command of sed :

sed 's/<regexp1>/<regexp2>/'

This will scan the text (line by line) and try to match <regexp1> . Whenever it does, it substitutes <regexp2> for it, then the line is shipped to output.

"&" now can be used in <regexp2> to put there everything regexp1 has matched. Lets try something very simple: the regexp to match everything in a line is /^.*$/ . We want to output all the input but put => and <= around every line. Here it is:

sed 's/^.*$/=> & <=/' /some/file

Cool, no?

Another form of backreference is "\n" where "n" is a number: 1, 2, 3, ... It will signify the portion of the <regexp1> , which is surrounded by the first (second, third, ...) pair of brackets. Suppose the input file from above with the "xa*y"-lines. Suppose we would want to exchange the first and last characters (and suppose they weren't fixed "x"s and "y"s). Here it is:

sed 's/^\(.\)\(a*\)\(.\)$/\3\2\1/' /path/to/file

We use the grouping here only to fill our various backreferences: first, we split the input into three parts: ^\(.\) (beginning of line, followed by a single character), \(a*\) (any number of a's) and \(.\)$ (again a single character, followed by the line end). In the substitution part we put them together reversed, first the third part, then the second one (the a's), then the former first part.

Most of the original sed -script should be clear by now, but we need to establish a few more things for the last bits:

When you write a substitute-command like about it is implied that it should be applied to every line. In fact, sed works like this:

  • read the first/next line of input and put it into the so-called "pattern space"
  • apply the first command of the script to this pattern space, it might change it (or not)
  • apply the next command of the script to the changed pattern space, changing it further (or not)
  • and so on, until the last command. If sed was started without the "-n" option print the pattern space now to output
  • if this was not the last line of input, go to the start again and repeat
  • if it was the last line, exit.

Ranges
Coming back to the substitute-commands: in their simplest form they are applied to every line. Here is some input file:

old
= old1
== Start ==
= old2
old3
== End ==
old4
= old5

The following will change all the "old" strings to "NEW":

sed 's/old/NEW/' /path/to/file
NEW
= NEW1
== Start ==
= NEW2
NEW3
== End ==
NEW4
= NEW5

But we could limit this command to only take place on lines starting with a "=":

sed '/^=/ s/old/NEW/' /path/to/file
old
= NEW1
== Start ==
= NEW2
old3
== End ==
old4
= NEW5

The first regexp /^=/ works like an "if"-statement: if the line (or something in it) matches the expression, then the substitute-command is applied, otherwise not.

There is also another form, where you can define a range of lines where the following command(s) are applied:

sed '/^== Start.*$/,/^== End.*$/ s/old/NEW/' /path/to/file
old
= old1
== Start ==
= NEW2
NEW3
== End ==
old4
= old5

Instead of regexps you can also use line numbers. This will apply the substitute-command only on lines 1,2 and 3:

sed '1,3 s/old/NEW/' /path/to/file

Was that all? No! One last thing: modifiers. Per default a substitute-command only changes the FIRST occurrence of a pattern:

$ echo "old old old" | sed 's/old/NEW/'
NEW old old

If you add some number at the end, this is the number of matching instance, which will be changed. If you add a "g" (global) all occurrences will be changed:

$ echo "old old old" | sed 's/old/NEW/'
NEW old old

$ echo "old old old" | sed 's/old/NEW/2'
old NEW old

$ echo "old old old" | sed 's/old/NEW/g'
NEW NEW NEW

Finally, there is one more modifier: "p". It prints the result of the substitution to the output. So far we have only had scripts consisting of only one command so that hasn't affected us but look above how sed works: what a command gets is basically what the command before has produced:

echo "white white white" | sed 's/white/blue/g
                                s/blue/green/g
                                s/green/red/g'
red red red

The second command would do nothing if they would get the input text without the first command already processing it and the same goes for the third command. but suppose you want to have the intermediary steps displayed: you can use the p-modifier for that (note that for the last line the "p" is implied):

echo "white white white" | sed 's/white/blue/gp
                                s/blue/green/gp
                                s/green/red/g'
blue blue blue
green green green
red red red

The p-modifier comes especially handy when you switch off the automatically implied printing at the end with the "-n" switch for sed : This way you do not need to filter out lines you do not want, you just print explicitly the ones you are interested in - a technique we used to filter out all lines not interesting in your text.

OK, was that all? No, not even close! sed is such a mighty tool i still am finding new ways to use it every day.

But - hey, in for a penny, in for a pound - here is a last one: you can use the ranges i talked about above and apply more than one command to them by using curly braces:

sed '/<regex1>/,/<regex2>/ {
                 s/<regex3>/<regex4>/
                 s/<regex5>/<regex6>/
                 s/<regex7>/<regex8>/
             }' /path/to/file

Now, the three substitutions will only be applied to a range of lines starting with <regex1> and ending with <regex2>. You can also negate/invert that:

sed '/<regex1>/,/<regex2>/ ! {
                 s/<regex3>/<regex4>/
                 s/<regex5>/<regex6>/
                 s/<regex7>/<regex8>/
             }' /path/to/file

Apply the three substitutions to all lines except for a range of lines starting ..... Of the same goes for the other forms of range specifications i showed you above.

I hope this helps.

bakunin

1 Like

Hey Bakunin!

Thanks for the followup on your tuto! Again, I know it takes a lot of your time to write everything down so thank you very very much for that!

I tried out almost all of your explanations (except for the last multicommand part)!

The portion on sed greediness:

As far as I understand: The more I detail what I'm looking for to the command, the more I will be able to extract what I really want.
As you said: I tried with the \\\(aa\\\)* alone on your text and indeed I got more things that I really wished for:

sed -n '/\(aa\)*/p' sedgroupingtest.txt 
xy
xay
xaay
xaaay
xaaaay
xaaaaay
xaaaaaay

However, I'm not sure to understand why both xy and xay were matched as well? From what I understood, \\\(aa\\\) looks for at least 2 "a"s in each line doesn't it?
I also found a way to get what I was looking for ie. each line that has at least 2 "a"s:

ardzii@debian:~$ sed -n '/\(aa\).*/p' sedgroupingtest.txt 
xaay
xaaay
xaaaay
xaaaaay
xaaaaaay

The "selection" portion was particularly interesting:

If I read it correctly and with my sed knowledge now :stuck_out_tongue: it goes:

the portion of text that is located in between the lines that start with "== Start + anything else to the end of the line ($)" and "== End + anything else to the end of the line ($)"

Now why my command doesn't work?
I've got a text file (that I personally called "examplesed.txt" which contains:

<div id="category_listing" itemscope itemtype="http://data-vocabulary.org/Product">
        
        <div id="category_bg">
        <div class="title">
            <h1 itemprop='name'>For Sale <span itemprop='brand'>HITACHI </span> <span itemprop='name'>AIRIS 1  Magnet</span></h1></div>
            <meta itemprop="category" content="Business & Industrial>Medical Medical Equipment" />
        <!-- end div title -->
                <div class="listing_num">LISTING #2229540</div>
           </div> 
        <div style='border-bottom: dotted 1px #666' class="clr"></div>
        <div id="category_listing_body">
            
<div id="list_detail">  

Now it seems that sed doesn't find for some reason the line I'm looking for:

> sed -n '/^<h1 itemprop='name'>For Sale.*$/p' examplesed.txt
> 

so obviously when I try to do:

sed -n '/^<h1 itemprop='name'>For Sale.*$/ s/^.*itemprop='brand'>\([^<]*\).*/\1/p' examplesed.txt

The same happens: ie. NOTHING Hahahaha!

Why doesn't sed find this line correctly?
I though that maybe the command was considering the tabs that exist before the "<h1 itemprop='name'>For Sale" as a bunch of spaces and therefore I tried:

sed -n '/.*<h1 itemprop='name'>For Sale.*/p' examplesed.txt

But still nothing...

Thanks for your much appreciated help yall!

Best!

ardzii

Your first pattern doesn't match for two reasons, of which you found and (roughly) eliminated one (congrats!): As the pattern is anchored at begin-of-line with the ^ [/ICODE, you need to allow for the leading white space in front of the <h1 sub pattern. While you matched any character .* allowing for matches towards line end as well, an exact match with e.g. character classes like [[:blank:]]* should be preferred, allowing matches of spaces and <TAB>s only.
The other reason your pattern fails is quoting. As the sed first parameter, the script, is enclosed by single quotes. So the quotes around 'name' unquote and requote the parameter, factually removing the quotes from the string. Try either allowing for one wild card character ".name." if you're sure no other patterns will match, or use double quotes (with mayhap other side effects on the parameter) around the script including the pattern. Like:

sed -n "/^[[:blank:]]*<h1 itemprop='name'>For Sale.*$/p" file
            <h1 itemprop='name'>For Sale <span itemprop='brand'>HITACHI </span> <span itemprop='name'>AIRIS 1  Magnet</span></h1></div>
1 Like

Great!! Thanks RudiC!!

Yes - and no. Yes, the better you define what you want the better results you will get. No, this has nothing to do with greedyness. Greedyness is the fact that if there several possible matches for a certain regexp always the LONGEST POSSIBLE one will be used.

In a regexp like /xa*y/ the a* will match all a's there are, regardless of how many there are. This is sometimes a desired effect and sometimes not. Here is an example for when it is not desired. Consider this text:

<tag>bla foo</tag> <othertag>more text</othertag>
<newtag>happy text</newtag> <moretag>just to fill in</moretag>

The task is to remove all the tags and just leave the text. The end result is like this:

bla foo more text
happy text just to fill in

Lets see: a "tag" is basically: a "<", followed by text, followed by ">". Hold on, there is an optional "/" after the opening "<" for the ending tag, but that is it, yes? Ok, this regexp will match that (the slash ("/") has to be escaped here, so that it is not confused with the "/" delimiting the regexp):

/<\/*.*>/

OK? Now let us try a simple sed-command. We will - for testing purposes - not delete the tags but overwrite them with "BLOB" to make sure we got everything right:

sed 's/<\/*.*>/BLOB/g' /path/to/file

That did really work well, did it? :wink:

Question: why were both lines changed to a single "BLOB"? Answer: because of the greedyness of regexps! What is the longest possible match for <\/*.*> in the first line?

The "<" matches the "<" at the beginning o the line.
The "\/" matches nothing, but it is optional, so that doesn't matter.
The ".
" matches everything, until the penultimate character of the line. This is the longest possible match and the problem.
And the ">" matches - again, longest possible - the last ">" in the line, which happens to be at lines end.

Solution? Instead of ".", which matches everything, match only non-">" characters with a negated character-class:

sed 's/<\/*[^>]*>/BLOB/g' /path/to/file

Now, by encountering the first ">" the character-class "[^>]" (everything except ">") will not cover that and therefore the longest possible match is the first ">", not the last one.

No. As i said at the beginning "*" means "zero or more of what is before". Before that are two a's, hence the string "aa". This string, zero times, is? ;-))

In fact, the regexp would match absolutely everything, because it effectively matches the empty string.

If you want to match at least one instance of something, you write it two times and make one optional:

/x\(aa\)*y/            # any even number of a's, including 0
/xaa\(aa\)*y/          # any even number of a's, starting with 2
/xaa*y/                # any number of a's but at least one
/xa*y/                 # any number of a's, even none at all

Yes, but the reason why this worked is not what you probably believe it to be: you search for 2 a's in a row (grouped, but you could leave out the grouping here, it serves no purpose), followed by any number ("*") of any character ("."). You could have left out the .* and get the same.

I hope this helps.

bakunin

PS: if you are discouraged now and think "i'll never get that damn thing into my head" - don't be! It took all of us weeks and months to bend our brains hard enough to finally get it around thinking in sed -terms. That you dont get it in days - is, in fact, expected. Just keep trying and you will soon be able to finish my little tutorial for the next newbie for me.

1 Like

Hey guys!

I wanted to let you know that I was able to finish up my script. It gives less information that I really need but I'm amazed of what I was able to do all by myself. :smiley:
Let's be realistic though: I couldn't have done it without you!:b::b::b: I learned so much this is crazy... from 0 to not the best but something at least! :cool::cool::cool:

So I guess: THANK YOU for your patience, your support and all the time you invested in showing me the way!!!

Obviously you can use the script. All you have to do is get a proper link: So that you can get an example you can use:
https://www.dotmed.com/equipment/2/26/2974/all
with no more than 116 scroll (figure as of 8.2.17) and, as it says in the script, you will need to create a dir called "DotMedListings" in:
~/

I guess that you will find it sort of messy but it works for now and it's a good basis! :p:p
I'm definitely open to your comments and suggestions, as you can imagine! (for instance my progress is not very friendly) :b::wink:

here goes the script:

#!/bin/bash
#
#
#
# For this script to work, you will first need to create a DotMedListings dir in your /home/XXXX/ directory.
#
#
#

declare link=""                #Will store the link for each iteration
declare linkStart=""            #Defines the type of equipment to crawl. To be found in Find Listings For Sale or Wanted On DOTmed.com
declare brand=""            #output
declare price=""            #output
declare currency=""            #output
declare condition=""            #output
declare dateListing=""            #output
declare country=""            #output
declare title=""            #output
declare description=""            #output
declare equipment=""            #output
declare yom=""                #output
declare -i totalCrawl=1            #Variable to define the scope of the crawl (total number of listing to crawl)
declare fileNameBase=""            #Used for the name of the output file via curl: Corresponds to the name of the equipment
declare fileName=""            #Definitve name of the Output file: dateCrawl + fileBaseName
declare dateCrawl=$(date +"%d-%m-%y")    #Date of the crawl used for the name
declare -i offset=0            #Base iteration of the offset. Gets +1 after each iteration
declare -i firstIndex=1            #index for the while - Gets +1 after each iteration but starts on 1 instead of 0 (for the offset).
declare nameToHome=$(cd ~ ; pwd | sed -n 's/.home.\([^\/]*\).*/\1/p')    #name for the path of the file to search if already created

echo
echo 
echo "************* Give the link to the equipment type, from https:// to /all included (last '/' excluded): *************"
read linkStart
echo
echo "************* Now, the total number of listings for the equipment: *************"
read totalCrawl
echo 
echo

#
# Naming the output file
#

fileNameBase=$(curl -s "$linkStart/offset/$offset/all?key=&limit=1&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | sed -n '/^[[:blank:]]*<li itemscope itemtype="http:..data-vocabulary.org.Breadcrumb"><span itemprop="title">.*$/ s/^.*itemprop="title">\([^<]*\).*$/\1/p')

fileName=$dateCrawl"-"$fileNameBase".csv"

#
# Looking if it already exists
#

if test -f "/home/$nameToHome/DotMedListings/$fileName"
then
    echo
    echo "************* WARNING ************* WARNING *************"
    echo "************* You already crawled that today! *************"
    echo "************* Delete the file or try another *************"
    echo "************* WARNING ************* WARNING *************"
    echo
#
# If not, starting the script
#

else

    echo
    echo
    echo
    echo "************* You will find your result in ~/DotMedListings/$fileName *************"
    echo
    echo
    echo

    echo "brand;equipment;title;description;price;currency;condition;dateListing;country;YoM" >> ~/DotMedListings/"$fileName"    #defining each category for the crawl.


    while [ $firstIndex -le $totalCrawl ]        #Starting the crawling loop
    do

        awk -v t1="$firstIndex" -v t2="$totalCrawl" 'BEGIN{print (t1/t2) * 100}'        # Prints the percentage advancement instead of having the curl info.

        link=$(curl -s "$linkStart/offset/$offset/all?key=&limit=1&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | sed -n '/href.*view more/ s/.*href="\([^"]*\).*/\1/p')    # Get the corredponding listing. If it's the first iteration then it will get the first listing for the equipment.

        curl -s "https://www.dotmed.com$link" -o ~/curl"$totalCrawl".xml        #Saves one curl for the first listing to avoid various curls for the same listing

#
# Getting the info out of the curl
#

        brand=$(sed -n "/^[[:blank:]]*<h1 itemprop='name'.*/ s/^.*itemprop='brand'>\([^<]*\).*/\1/p" ~/curl"$totalCrawl".xml)
        equipment=$(sed -n '/^[[:blank:]]*<meta property="og:url".*$/ s/.*"https:\/\/www.dotmed.com\/listing\/\([^\/]*\).*/\1/p' ~/curl"$totalCrawl".xml)
        price=$(sed -n "/^[[:blank:]]*<ul><li class=.left.>Price.*$/ s/^.*amount=\([^&]*\).*/\1/p" ~/curl"$totalCrawl".xml)
        currency=$(sed -n '/^[[:blank:]]*<ul><li class=.left.>Price.*$/ s/^.*currency_from=\([^"]*\).*/\1/p' ~/curl"$totalCrawl".xml)
        condition=$(sed -n '/^[[:blank:]]*<ul><li class="left">Condition:.*$/ s/^.*content=.used.>\([^<]*\).*/\1/p' ~/curl"$totalCrawl".xml)
        dateListing=$(sed -n '/^[[:blank:]]*<ul><li class="left">Date updated.*$/ s/^.*id="date_updated">\([^<]*\).*/\1/p' ~/curl"$totalCrawl".xml)
        country=$(sed -n "/^[[:blank:]]*<p class=.nation.>.*$/ s/^.*'This listing comes from \([^']*\).*/\1/p" ~/curl"$totalCrawl".xml)
        title=$(sed -n '/^[[:blank:]]*<meta property="og:title".*$/ s/.*content="\([^-]*\).*/\1/p' ~/curl"$totalCrawl".xml)
        description=$(sed -n '/^[[:blank:]]*<meta property="og:description".*$/ s/.*content="\([^-]*\).*/\1/p' ~/curl"$totalCrawl".xml)
        yom=$(sed -n '/^.*Specifications: Year of Manufacture.*$/ s/^.*Specifications: Year of Manufacture,\([^,]*\).*/\1/p' ~/curl"$totalCrawl".xml)

# 
# Sending the info to the output file
#


        echo $brand";"$equipment";"$title";"$description";"$price";"$currency";"$condition";"$dateListing";"$country";"$yom >> ~/DotMedListings/"$fileName"

        rm ~/curl"$totalCrawl".xml    # Deleting the curl file to leave space for the next iteration. FYI, I nammed the curl file with the number of crawls to be done to be able to launch simutaniously the script and be able to crawl various equipments at a time.

#
# Resetting for the next iteration.
#

        link=""    
        brand=""
        price=""
        currency=""
        condition=""
        dateListing=""
        country=""
        title=""    
        description=""
        equipment=""
        yom=""

        (( firstIndex++ ))
        (( offset++ ))
    done

    echo
    echo
    echo 
    echo "************* Done! Again, you will find the result in ~/DotMedListings/$fileName *************"
    echo
    echo
    echo
fi

Thanks again to you guys!

All the best!