To get the id value dynamically

Kumar786 · March 6, 2019, 5:36am

Hi,

I have .html file and need to get the id value based on the latest modified date from the html file.

output : 1456 .

Thanks in advance

Here is the .html file

<html>
<head><title>Index</title>
</head>
<body>
<h1> BI EI Team</h1>
<pre>ID   Last modified      Size</pre><hr/>
<pre>
<a href="1456/">1456/</a>  01-Mar-2019 15:49    108MB
<a href="4561/">4561/</a>  28-Feb-2019 11:08    121MB
</pre>
<hr/>
</body>
 </html>

RudiC · March 6, 2019, 5:44am

Welcome to the forum.

Please become accustomed to provide decent context info of your problem.

It is always helpful to carefully and detailedly phrase a request, and to support it with system info like OS and shell, related environment (variables, directory structures, options), preferred tools, adequate (representative) sample input and desired output data and the logics connecting the two including your own attempts at a solution, and, if existent, system (error) messages verbatim, to avoid ambiguities and keep people from guessing.

So - what have you tried so far?
Do you have GNU date available, or another version that allows a to-be-operated-upon-date parameter?

Kumar786 · March 6, 2019, 6:02am

Thanks Rudic,

I tried with below command but not accurate value am getting.

cat file.html |grep "<a href=" | head -2 |awk -F ">" '{print $2}

output : 1456/</a

RudiC · March 6, 2019, 6:22am

There is a closing single quote missing in your command pipeline, and the output will be the two lines with data from your input file:

1456/</a
4561/</a

Please be aware that none of the usual *nix text tools is well suited to handle *ml or similar data; any solution based on those will be crooked.

What is the essential criterion to identify the lines to extract from your html file? I used the field count (NF > 4) in this approach:

awk -F"[- :/]*" '
        {gsub (/<[^>]*>/, "")
        }
!NF     {next
        }
NF > 4  {print $4, $3, $2, $5, $6, $1
        }
' file | LC_ALL=C sort -nr -k1,1 -k2,2Mr -k3 | head -1 | cut -d" " -f6
1456

It depends on sort (GNU coreutils) to sort abbreviated month names.

Kumar786 · March 6, 2019, 6:23am

Hi ,

I achieved the result as my own. Is there any better way to do this.

cat file.html | grep "<a href=" | head -2 |awk -F ">" '{print $2}'| awk -F "/" '{print $1}'

Thx

joker · March 6, 2019, 6:59am

If you really want the id of the newest match, take RudiCs snippet. If you just assume that the first match is the newest, you can go simpler by using that:

awk 'match($0,/<a href="([0-9]+)/,res) {print res[1];exit}' data.html

If someone wants to use a parser for more robust operation, one can use this as a starting point:

xmlstarlet sel -t -v "/html/body/pre" data.html  | awk '/^\s*[0-9]+\// {print $1,$2}'


# output 

1456/ 01-Mar-2019
4561/ 28-Feb-2019

Note: I assume only gnu awk supports match(subject,pattern,array) Other variants of awk only support 2 parameter match like match(subject,pattern), which will cause a syntax error here.