Perl : to get all the hyperlinks from the xlsx sheet(hyperlinks not visible in excel sheet directly)

Hi folks,

I have a requirement in perl to print all the hyperlink from the spreadsheet(xlsx).
Spreadsheet contains few lines of hyperlink data (pic attached).

P.S. Hyperlink is behind the data and not visible in excel sheet directly.

Now using perl script I need to copy the hyperlinks in seperate excel sheet.

I have browsed CPAN modules but havnt found the module that suits my requirement.

Could you please help me on this ?

Force a zip tool to look at the xlsx file, for xlsx is a zip archive of many files. Many windows explorer version will open it if you rename it whatever.zip ! You need to filter the listing of the zip for usable file types. You do not want to text-filter any images! Most of the internal files are xml is text. I see vml and rels files labeled as XML by PKZIP. You will see the patterns around URLs. Almost any text tool can extract them: awk, sed, PERL. You can get a list of internal files of interest from a unzip list and tell unzip to extract and pipe them to stdout, where you filter out the URLS.

1 Like

Many thanks...

Could you please explain step by step to be followed or please let me know the modules that are required to fulfil the task.

Thanks in advance...

Well, I would start with man unzip and find how to get a file listing of the xlsx on stdout so I could filter out which are text-like, usually xml. Then I can use unzip to extract each of those files to stdout, where I can used sed to find and strip out the URLs I want. First look at it in pg or the like. Find the URL you know you want. There may be many URLs on a line, so you need to separate them onto different lines and dispose of non-URL lines and line bits. Something like

unzip <list_options> xxx.xlsx | pg
 
unzip <list_options> xxx.xlsx | egrep <patterns_you_like> | pg
 
unzip <list_options> xxx.xlsx | egrep <patterns_you_like> | xargs <run_only_if_input_opts> unzip <unzip_to_stdout_options> xxx.xlsx | pg
 
unzip <list_options> xxx.xlsx | egrep <patterns_you_like> | xargs <run_only_if_input_opts> unzip <unzip_to_stdout_options> xxx.xlsx | sed '<script_to_delete_separate_trim_URLs>' | pg

If you want to stay in PERL, there are unzip APIs IO::Uncompress::Unzip - Read zip files/buffers - Perldoc Browser

And direct XLSX access APIs: Spreadsheet::XLSX - Perl extension for reading MS Excel 2007 files; - metacpan.org