parsing a webpage - perl lwp

jacobs.smith · September 28, 2012, 8:42am

I am requesting for the text parsing section below. Any helps are highly appreciated.

<tr valign="top"><td nowrap>Source name</td>
<td style="text-align: justify">Sample Name<br></td>

I want Sample Name from above.

In the same file, I have to search for another pattern like this

<td><a href="http://www.unix.com/sra?term=SRX12345">SRX12345</a></td>

I want SRX12345 from this pattern.

Now, my final output will be

SRX12345 Sample Name

---------- Post updated 09-28-12 at 08:42 AM ---------- Previous update was 09-27-12 at 03:03 PM ----------

Hi Friends,

I worked on this task and reached till this point.

Could someone please enhance it?

For the first searching, I used this

cat input | awk -F'>' '/nowrap>Source/ {getline; print $2}'

The output was

Sample Name<br

For the second pattern, I wrote this

cat input |awk '/^<td><a href=/'| grep -o 'http://unix.com/sra?term=[^"]*'| awk -F'=' '{print $2}'

and the output was

SRX12345

But, I would like to join both of them together and the expected final output is

Sample Name SRX12345

I can't use join or other awk scripts because, I am running them in two separate instances and the order is changing. I have more than 500 search patterns to search this way and I want both of them together in two columns.

pamu · September 28, 2012, 8:56am

try something like this...

awk -F "[<>]" '/nowrap>Source/ {getline; printf $(NF-4)}
/<td><a href=/{print $(NF-4)}'  file