Extract specific line in an html file starting and ending with specific pattern to a text file

Hi
This is my first post and I'm just a beginner. So please be nice to me.

I have a couple of html files where a pattern beginning with "http://www.site.com" and ending with "/resource[number].dat" is present on every 241st line. How do I extract this to a new text file?
I have tried

sed -n 241,241p sample.htm >>store.txt

But it results in the complete line getting extracted.

Please help.

Try:

sed -n '241s|.*\(http://www.site.com.*resource\[[0-9]*\]\).*|\1|p' file

This assumes that the numbers are enclosed in square brackets. If not you can leave out \[ and \]

That did not work. Maybe my problem needs more description. There's a lot of text before and after the pattern. And the pattern itself repeats on other lines. I need to limit the processing to the 241st line and extract text starting and ending with the specified patterns. The 241st line always begins with <div class="video_container" data-errors="/videos/1149/view_errors" data-fallback-file="http://www.site.com

Can you post a sample of the 241st line ?

<div class="video_container" data-errors="/videos/1149/view_errors" data-fallback-file="http://s3.amazonaws.com/learnmath-app.production/0bbbe7abce1523ee6b4e82b115d0b6fa67331779-video-1149/web.mp4" data-file="http://dxdo2x6i0oxgk.cloudfront.net/0bbbe7abce1523ee6b4e82b115d0b6fa67331779-video-1149/web.mp4" data-poster="http://dxdo2x6i0oxgk.cloudfront.net/0bbbe7abce1523ee6b4e82b115d0b6fa67331779-video-1149/web-default-33.png" data-views="/videos/1149/views"><p id="video_tag_29315">Loading the player...</p><div class="captionbox"><div class="captionbox-body"><span class="phrase" data-begin="1.84" data-end="9.04">

There's more text actually. But its too long This is the relevant part. Rest is just formatted html text

It's all relevant if you expect us to process and extract from it. We can write you a thousand programs that don't work given incomplete input. "line" 241 might not even be "line" 241 depending on how you are viewing the HTML.

It would also help to know your system. uname -a if you don't know.

<div class="video_container" data-errors="/videos/1149/view_errors" data-fallback-file="http://s3.amazonaws.com/learnmath-app.production/0bbbe7abce1523ee6b4e82b115d0b6fa67331779-video-1149/web.mp4" data-file="http://dxdo2x6i0oxgk.cloudfront.net/0bbbe7abce1523ee6b4e82b115d0b6fa67331779-video-1149/web.mp4" data-poster="http://dxdo2x6i0oxgk.cloudfront.net/0bbbe7abce1523ee6b4e82b115d0b6fa67331779-video-1149/web-default-33.png" data-views="/videos/1149/views"><p id="video_tag_29315">Loading the player...</p><div class="captionbox"><div class="captionbox-body"><span class="phrase" data-begin="1.84" data-end="9.04">Arithmetic and fractions, positive and negative numbers. In this video, we will discuss how</span> <span class="phrase" data-begin="7.28" data-end="11.8">to add and subtract positive and negative numbers. Now this a very basic topic.</span> <span class="phrase" data-begin="11.8" data-end="17.45">Again, if you are proficient in this topic, do not feel compelled to watch this entire video.</span> <br/><br/><span class="phrase" data-begin="18.73" data-end="25.67">This is a video designed to get people comfortable with this topic if it is unfamiliar or they have some uncertainty with the topic.</span> <span class="phrase" data-begin="27.34" data-end="32.43">So wherever you are starting, I will assume that you are proficient with two basic cases, how to add</span> <span class="phrase" data-begin="32.43" data-end="39.53">two positive integers or how to subtract two positive integers when they are in the form (bigger) minus (smaller).</span> <br/><br/><span class="phrase" data-begin="39.53" data-end="48.87">First thing I'll say is if you need practice with this, practice every day. It's very important to be proficient in two digit addition and subtraction.</span> <span class="phrase" data-begin="48.87" data-end="57.86">Be able to do that as mental math that will make the test much smoother. Now the good news is, if you can do these two things, you can do anything else.</span> <span class="phrase" data-begin="57.86" data-end="66.71">This entire topic is very easy if you know these two things. There are many ways to discuss this material.</span> <br/><br/><span class="phrase" data-begin="66.71" data-end="71.38">Let's begin with subtraction. Some mathematicians would say subtraction doesn't really exist.</span> <span class="phrase" data-begin="71.38" data-end="77.28999999999999">What does that mean? Well, subtraction of any number can be rewritten as</span> <span class="phrase" data-begin="77.28999999999999" data-end="86.11">the addition of a number of the opposite sign. And so, some mathematician would say that this addition is actually the true form.</span> <br/><br/><span class="phrase" data-begin="86.11" data-end="90.2">So, let's make sure we understand this. Subtraction of any number can be rewritten as</span> <span class="phrase" data-begin="90.2" data-end="96.16">the addition of a number of the opposite sign. Here are four different instances of subtraction.</span> <span class="phrase" data-begin="96.16" data-end="104.47999999999999">We have a positive minus a positive, a negative minus a positive, a positive minus a negative, and a negative minus a negative.</span> <br/><br/><span class="phrase" data-begin="104.47999999999999" data-end="111.74000000000001">In all four cases we could rewrite that subtraction as addition of a number of the opposite sign.</span> <span class="phrase" data-begin="111.74000000000001" data-end="116.27000000000001">In the cases where we are subtracting a positive that's the same as adding a negative</span> <span class="phrase" data-begin="116.27000000000001" data-end="121.98">in the cases where we are subtracting a negative that's the same as adding a positive.</span> <br/><br/><span class="phrase" data-begin="121.98" data-end="127.06">Well, notice that we get some simplification, but it's not a simplification in every case.</span> <span class="phrase" data-begin="127.06" data-end="133.5">For example in the first one, it looks like we are better off where we started, we are better off without changing it to addition.</span> <span class="phrase" data-begin="133.5" data-end="140.96">In the third one, it looks like we clearly made things better off by changing it to the addition of two positive numbers.</span> <br/><br/><span class="phrase" data-begin="140.96" data-end="147.5">In that fourth one, notice that now its addition, its commutative so we can switch the order around</span> <span class="phrase" data-begin="147.5" data-end="153.19">and when we switch the order around, we can rewrite it as subtraction and that is much easier.</span> <span class="phrase" data-begin="153.19" data-end="162.66">So sometimes this is really an important move for simplification but not always. You don't always have to rewrite subtraction as addition but it</span> <span class="phrase" data-begin="162.66" data-end="166.06">can be a very good simplifying trick to have up your sleeve.</span> <br/><br/><span class="phrase" data-begin="168.67000000000002" data-end="173.535">Notice in particular for the case positive minus negative, this trick will always simplify it.</span> <span class="phrase" data-begin="173.535" data-end="179.62">It will always become positive plus positive, which is one of those fundamental things that I assume you know how to do already.</span> <span class="phrase" data-begin="181.96" data-end="189.29">Now let's look at that tricky double negative case, which could appear in the form negative minus positive, or in the form negative plus negative.</span> <br/><br/><span class="phrase" data-begin="191.17" data-end="196.789">The big idea is we can always factor out a negative sign. Now what does this mean exactly?</span> <span class="phrase" data-begin="198.19" data-end="201.75">Let's look at that first one. Negative 46 minus 37.</span> <span class="phrase" data-begin="201.75" data-end="208.21">We can factor out a negative sign, if we factor out a negative sign everything inside becomes positive.</span> <br/><br/><span class="phrase" data-begin="208.21" data-end="213.25">So, it just becomes 46 plus 37. Addition of two positive numbers.</span> <span class="phrase" data-begin="213.25" data-end="220.39">So you perform that addition and then just stick a negative in front of the sum. You might want to try these others on the page.</span> <span class="phrase" data-begin="220.39" data-end="224.55">Pause the video here. Try these others and then you can compare your answers to mine.</span> <br/><br/><span class="phrase" data-begin="227.11" data-end="237.25">Here are the answers. One other case folks find tricky is the case small positive minus</span> <span class="phrase" data-begin="237.25" data-end="247.02">big positive, which also shows up as small positive plus big negative. Here, the big idea is factoring out</span> <span class="phrase" data-begin="247.02" data-end="257.94">a negative sign reverses the order of subtraction. So what is this mean exactly, suppose I have 23 minus 64 by factor out a negative</span> <span class="phrase" data-begin="257.94" data-end="263.231">sign then what I get is subtraction in the reverse order, 64 minus 23.</span> <br/><br/><span class="phrase" data-begin="263.231" data-end="268.11">Well, now that's bigger minus smaller that we can do that's one of the fundamental skills.</span> <span class="phrase" data-begin="268.11" data-end="279.3">So do that subtraction, and then just stick a negative sign in front of it. Let's look at another one, 26 minus 63, factor out the negative, and</span> <span class="phrase" data-begin="279.3" data-end="289.92">we get a negative in front of the reversed order of subtraction, 63 minus 26. Perform the subtraction and stick a negative sign in front of it.</span> <br/><br/><span class="phrase" data-begin="289.92" data-end="295.03">Here's some more. You might wanna pause the video here and practice these on your own.</span> <span class="phrase" data-begin="297.73" data-end="303.59">Here are the answers I get. Very important that you are able to do things like this, and it's very</span> <span class="phrase" data-begin="303.59" data-end="312.56">good practice for mental math to be able to do this in your head. These ideas allow you to change any addition or subtraction to either</span> <span class="phrase" data-begin="312.56" data-end="316.96">the sum of two positives or the difference of larger minus smaller.</span> <br/><br/><span class="phrase" data-begin="316.96" data-end="323.63">Here, I just discussed integers for simplicity, but all these same ideas would also be true for positive and negative decimals and fractions.</span> <span class="phrase" data-begin="325.08" data-end="333.82">The core skills are addition of two positives or larger positive minus smaller negative, smaller positive.</span> <span class="phrase" data-begin="335.18" data-end="341.25">From here if we're doing positive minus negative, we can change that to positive plus positive.</span> <br/><br/><span class="phrase" data-begin="341.25" data-end="349.36">If we have the double negative case, we can factor out a negative sign. And whenever we have smaller minus bigger,</span> <span class="phrase" data-begin="349.36" data-end="354.18">we can factor out the negative sign, reverse the order of subtraction and then what's inside</span> <span class="phrase" data-begin="354.18" data-end="356.12">bigger minus smaller that's something we can do.</span> <br/><br/></div><div class="btn-group"><a class="btn btn-small">Show Transcript</a></div></div></div>

---------- Post updated at 11:29 AM ---------- Previous update was at 11:28 AM ----------

Thats the complete line

And the rest of the HTML? Attach it as a file perhaps.

The rest of the file is attached here.
Btw, I'm using OS X 10.8

'resource' is only found in the HTML twice as <a href="/resources"> . What did you actually want?

Sorry! :frowning:
I'm new to unix and I didn't know such things mattered. I was just using "/resources" as an example.
Actually what I want is just extract "http://s3.amazonaws.com/learnmath-app.production/0bbbe7abce1523ee6b4e82b115d0b6fa67331779-video-1149/web.mp4"
and content like this in other html files

So you want the fallback-file, not the data-file? OK.

$ cat xml.awk

BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# These should be special variables for match() but aren't.
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($0, /^[^\/ \r\n\t>]+/) {
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if(!SPEC)
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS);
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        sub("^.*" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
}

$ awk -f xml.awk -e '"DATA-FALLBACK-FILE" in ARGS { printf("%s\n", ARGS["DATA-FALLBACK-FILE"]); }' ORS="" in.html

http://s3.amazonaws.com/learnmath-app.production/0bbbe7abce1523ee6b4e82b115d0b6fa67331779-video-1149/web.mp4

$
1 Like

Thanks a lot. That solved my problem :slight_smile:

---------- Post updated at 02:26 PM ---------- Previous update was at 12:55 PM ----------

While the last method by Corona688 worked, I found a simpler way to do the same.
sed -n 's/.*data-fallback-file=\"\(.*\)\data-file=\"http/\1/p' <in.html |cut -d'"' -f1 >>hh.txt

Glad it works for you. I usually avoid commands that depend on HTML having linebreaks in specific pages, very minor changes will cause the code to stop working.