Delete block of text in one file based on list in another file

Festus_Hagen · September 1, 2009, 2:14am

Hi all

I currently use the following in shell.

#!/bin/sh

while read LINE
do
  perl -i -ne "$/ = ''; print if !m'Using archive: ${LINE}'ms;" "datafile"
done < "listfile"

NOTE the single quote delimiters in the expression. It's highly likely the 'LINE' may very well have characters in it that perl will try to interpolate. For example the '@'... See sample data.

I would like to reduce the overhead of the multi perl calls and do both loops in one perl call from within a shell.

So Inspired by this thread: Removing Lines if value exist in first file

And this bit of code from that thread:

my @a, %exclude;
my $file = shift;
open(EXCLUDE_LIST, "< $file") or die;
chomp( @a=<EXCLUDE_LIST> );
close(EXCLUDE_LIST);
@exclude{@a}=@a;

while (<>) {
    print unless exists $exclude{ (split(/,/))[3] };
}

I have been attempting to hack into submission without success!

HELP!

I like the idea of the hash, however that is way above my head and after many hours of pawing over this site and the perl man pages I have yet to even come close to figuring out how to use it!

If I understand the above with the use of the hash, that would limit the loop to one iteration for multiple matches!
Correct or Incorrect??

And if needed, The following is sample datafile, listfile and results.

Sample datafile: (first line is blank, last line is not)

Backup started: Sat Aug 22 05:15:00 EDT 2009, MyBackup v3.0.8
 Using archive: /mnt/Raid/test/Backup_20090822@051500.tbz
 Removed archive: /mnt/Raid/test/Backup_20090820@051500.tbz
Backup completed: 293,437,440 bytes in 131 seconds at 05:17:11 EDT

Backup started: Sun Aug 23 05:15:00 EDT 2009, MyBackup v3.0.8
 Using archive: /mnt/Raid/test/Backup_20090823@051500.tbz
 Removed archive: /mnt/Raid/test/Backup_20090821@051500.tbz
Backup completed: 224,477,184 bytes in 100 seconds at 05:16:40 EDT

Backup started: Mon Aug 24 05:15:00 EDT 2009, MyBackup v3.1.0
 Using archive: /mnt/Raid/test/Backup_20090824@051500.tbz
 Removed archive: /mnt/Raid/test/Backup_20090822@051500.tbz
Backup completed: 224,307,734 bytes in 99 seconds at 05:16:39 EDT

Backup started: Tue Aug 25 05:15:00 EDT 2009, MyBackup v3.1.0
 Using archive: /mnt/Raid/test/Backup_20090825@051500.tbz
 Removed archive: /mnt/Raid/test/Backup_20090823@051500.tbz
Backup completed: 237,993,204 bytes in 104 seconds at 05:16:44 EDT

Sample listfile: No blanks

/mnt/Raid/test/Backup_20090823@051500.tbz
/mnt/Raid/test/Backup_20090825@051500.tbz

Target Results: (first line is blank, last line is not)

Backup started: Sat Aug 22 05:15:00 EDT 2009, MyBackup v3.0.8
 Using archive: /mnt/Raid/test/Backup_20090822@051500.tbz
 Removed archive: /mnt/Raid/test/Backup_20090820@051500.tbz
Backup completed: 293,437,440 bytes in 131 seconds at 05:17:11 EDT

Backup started: Mon Aug 24 05:15:00 EDT 2009, MyBackup v3.1.0
 Using archive: /mnt/Raid/test/Backup_20090824@051500.tbz
 Removed archive: /mnt/Raid/test/Backup_20090822@051500.tbz
Backup completed: 224,307,734 bytes in 99 seconds at 05:16:39 EDT

Thanks

-Enjoy
fh : )_~

thegeek · September 2, 2009, 8:57pm

Following implementation can help you out...

open IN, "<t1";
while(<IN>)  {
	chomp;
	$hash{$_} = 1;
}
$i = 0;
while(<>)  {
	if ( $_ !~ /^$/ )  {
		#print ">$_";
		$array[$i++] = $_;
		next;
	}
	if ( $i != 0 )  {
		$key = (split / /, $array[1])[3];
		chomp($key);
		print join '', @array if defined $hash{$key};
		$i  = 0;
	}
}

ripat · September 3, 2009, 2:19am

Or in awk, this seems to produce the desired output with the given sample files:

awk 'FNR>=NR{a[" Using archive: "$0]=1;next}{RS=ORS="\n\n";FS="\n"}!a[$2]{print}' listfile datafile

I don't like too much the {RS=ORS="\n\n";FS="\n"} bloc that executes on every lines of the second file but I don't know how do avoid this. Can't do that variable assignment in the BEGIN bloc as the first file will then not be parsed correctly. Any idea?

Edit
The only thing I could think of is to flag the variable assignment so that when it is done once, skip it on the next line:

awk 'FNR>=NR{a[" Using archive: "$0]=1;next}!f{RS=ORS="\n\n";FS="\n";f=1}!a[$2]{print}' flle1 flie2

Festus_Hagen · September 3, 2009, 11:07pm

Hi all,

Thanks for the responses ...

I have accomplished this with the following methods, however I have gone a step further with a third method...

Hopefully they help the next one in need!
The 3rd one is pretty specific to my needs.

Method 1 based on Removing Lines if value exist in first file post #4 by Azhrei, Thanks Azhrei

perl -i~ -e '  # -i~ for in-place editing with tilde backup file
  use strict;
  use warnings;
  my @a;
  my %excludehash;
  my $file = shift;
  open(excludelist, "< $file") or die;
  chomp( @a=<excludelist> );
  close(excludelist);
  @excludehash{@a}=@a;
  {
    local($/) = "";
    while (<>) {
      m/YOURKEY:\s+(.*)$/m;
      print unless exists $excludehash{ $1 }
    }
  }' "excludefile" "datafile"

Method 2 my own brew.

perl -i~ -e '  # -i~ for in-place editing with tilde backup file
  use strict;
  use warnings;
  my %excludehash;
  my $file = shift;
  open(my $excludelist, "<", $file) or die;
  while(<$excludelist>) {
    chomp;
    next if /^$/;
    $excludehash{ $_ } = $_;
  }
  close($excludelist);
  {
    local($/) = "";
    while (<>) {
      next if ( m/^YOURKEY:\s+(.*)$/m && $excludehash{ $1 } );
      print
    }
  }' "excludefile" "datafile"

Just for giggles I created a dummy datafile that was ~26M with 80,743 records ... each record consisted of at least 7 and up to 30 lines of text. After generating an excludefile of 20,271 records to be removed... I ran them both across it.

The speed is freaking incredible!
I didn't accurately time them, however it is done in less than 15 seconds! I was/am blown away by that!
Especially on my FBSD7.1R PIII-866!

Now from the results of that education I did the following!

What is in production for my needs...
Take a look at the sample data above and you will notice that Archive Maintenance removes old archives and logs them as "Removed archive: ... ..." ... There is the EXCLUDELIST!!

The following code reads the log file in the first while loop adding all the "Removed archive:" elements to a hash (removehash).
Then moves the file pointer back to the beginning of the log, and in the second while loop scrolls down though the records matching the removehash elements to the "Using archive: of each record... If there is a match skip it!

It's even wise to multiple 'Removed archive:' elements per record...
And is incredibly fast.

perl -i~ -e '  # -i~ for in-place editing with tilde backup file
    my %removehash;
    {
      local($/) = "";
      {
        while(<>) {
          while (m/^ Removed archive:\s+(.*)$/mg) {
            $removehash{ $1 } = $1;
          }
          last if (eof)
        }
      }
      seek(ARGV, 0, 0);
      {
        while (<>) {
          m/^ Using archive:\s+(.*)$/m;
          print unless exists $removehash{ $1 }
        }
      }
    }' "logfile"

-Enjoy
fh : )_~