Perl: batch replace a portion of text in files

Hi all,

What I would like to achieve is to batch change the code below in every pdf in a given directory (each pdf is uncompressed so that can be easily edited).
An example of the javascript code:

if (this.hostContainer) { try { this.hostContainer.postMessage(['newPage', 'pp_216', 15259]); } catch(e) { console.println(e); } };

The portion pp_216 can vary and and the number 216 is just an example. After pp_ there can be two possibilities: (1) an ordinary number (Arabic), e.g. 1, 2 216 etc. If this is the case, I would like to subtract 16 from this number. For example:

this.zoomType = zoomtype.pref;
this.pageNum = 200;

Second option is a Roman number

if (this.hostContainer) { try { this.hostContainer.postMessage(['newPage', 'pp_v', 15259]); } catch(e) { console.println(e); } };

In this case I would like to have this Roman numbered change into Arabic and then take 2 of it.

For example:
this.zoomType = zoomtype.pref;
this.pageNum = 3;

At first I tried to use bash, but it seems that it does not allow for what I am looking for. Perl supports Roman numbers [no link because I do not have 5 points] and regex. When it comes to regex before I was told that it would not be good idea to use it, I had come up with this piece of code:

use warnings; use strict; our @array = `find -P $path -type f -name \'*.pdf\'`; foreach my $p (@array){ open(my $source, $p) or die "Cannot open a file"; while(my $line = <$source>){ if($line =~ (?<=pp_)\d+(?:\'\d+)?){

but it is possibly buggy and supports only Arabic (0-9) numbers.

All the code above is also in attached .txt since I cannot set the correct formatting. In an attachment you can also find sample 'unpacked' pdf with two entries edited by me (line 6242 and 6246) to show what I am looking for.

I'm unsure if a uncompressed pdf document is plain text. If so you could use awk for this problem:

awk '
BEGIN {
    rn="MDCLXVI" # Roman numerals desc order
    split("1000 500 100 50 10 5 1",v) # value for each numeral
}
function roman_val(s,val,c,d,p,q) {
    if(s !~ "^[" rn tolower(rn) "]+$") return 0;
    c = split(toupper(s),d,"")
    val = v[index(rn,d[c])]
    while (--c) {
      p = index(rn,d[c])
      q = index(rn,d[c+1])
      val += (p>q)? -v[p] : v[p]
    }
    return val
}
/p_[0-9]+/ {
  x=$0
  while(match(x, "p_[0-9]+")) {
      pg=substr(x,RSTART+2,RLENGTH-2)-16
      n=n substr(x,1,RSTART) "p_" pg
      x=substr(x,RSTART+RLENGTH)
  }
  $0= n x
}

$0 ~ "p_[" rn tolower(rn) "]+" {
  while(match($0, "p_[" rn tolower(rn) "]+")) {
      pg=roman_val(substr($0,RSTART+2,RLENGTH-2))-2
      x=substr($0,1,RSTART) "p_" pg substr($0,RSTART+RLENGTH)
      $0=x
  }
} 1' your_uncompressed.pdf

I think you are right. Unpacked pdf file looks like a plain text file but it is still probably a binary file. However, the tool I used to unpack (pdftk) claims that unpacked file can be edited by a simple text editor:

When I run your script, it shows some binary characters and crashes. I also thought of using awk but as far as I know it works only if a file is plain text.

Could you have a look at the pdf I have attached in my first post?

How about this perl solution using the Roman CPAN module:

use warnings;
use strict;

use Roman;

our @array = `find -P . -type f -name \'*.pdf\'`;

foreach my $p (@array){

    chomp($p);
    open(my $source, $p) or die "Cannot open file $p";
    open(my $dest, '>', $p . ".new") or die "Cannot open output file $p.new";
    binmode($source);
    binmode($dest);

    while(my $line = <$source>){
        while (my ($page) = $line =~ /pp_(\d+)/) {
            my $newpage = $page-16;
            $line =~ s/pp_$page/zUNIQz_$newpage/;
        }
        while (my ($rdigit) = $line =~ /pp_([MDCLXVI]+)/i) {
           my $newpage = arabic($rdigit)-2;
           $line =~ s/pp_$rdigit/zUNIQz_$newpage/;
        }
        $line =~ s/zUNIQz_/pp_/g;
        print $dest $line;
    }
    close($source);
    close($dest);
    rename "$p" => "$p.bak" or
          die "can't rename $p to $p.bak";
    rename "$p.new" => "$p" or
          die "can't rename $p.new to $p";
}