Filter or remove duplicate block of text without distinguishing marks or fields

samask · October 11, 2011, 9:23am

Hello,

Although I have found similar questions, I could not find advice that
could help with our problem.

The issue:

We have several hundreds text files containing repeated blocks of text
(I guess back at the time they were prepared like that to optmize
printing).

The block of texts are not regular, i.e. it is difficult to identify in
them awk fields.

The only useful tidbit seems to be the $newpage tag. :wall:

Example:


[block 1] The Branchial or Visceral Arches and Pharyngeal Pouches. -In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).

$newpage

[block 1] The Branchial or Visceral Arches and Pharyngeal Pouches. -In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).

$newpage

[block 2] Each of the upper four pouches is prolonged into a dorsal and
a ventral diverticulum.

Over these pouches corresponding indentations of the ectoderm occur,
forming what are known as the branchial or outer pharyngeal grooves.

$newpage

[block 2] Each of the upper four pouches is prolonged into a dorsal and
a ventral diverticulum.

Over these pouches corresponding indentations of the ectoderm occur,
forming what are known as the branchial or outer pharyngeal grooves.

$newpage

[block 3] The intervening mesoderm is pressed aside and the ectoderm
comes for a time into contact with the entodermal lining of the
fore-gut, and the two layers unite along the floors of the grooves to
form thin closing membranes between the fore-gut and the exterior.

Later the mesoderm again penetrates between the entoderm and the
ectoderm. In gill-bearing animals the closing membranes disappear, and
the grooves become complete clefts, the gill-clefts, opening from the
pharynx on to the exterior; perforation, however, does not occur in
birds or mammals.

$newpage

[block 3] The intervening mesoderm is pressed aside and the ectoderm
comes for a time into contact with the entodermal lining of the
fore-gut, and the two layers unite along the floors of the grooves to
form thin closing membranes between the fore-gut and the exterior.

Later the mesoderm again penetrates between the entoderm and the
ectoderm. In gill-bearing animals the closing membranes disappear, and
the grooves become complete clefts, the gill-clefts, opening from the
pharynx on to the exterior; perforation, however, does not occur in
birds or mammals.

$newpage

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).

The dorsal ends of these arches are attached to the sides of the head,
while the ventral extremities ultimately meet in the middle line of the
neck.

$newpage

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).

The dorsal ends of these arches are attached to the sides of the head,
while the ventral extremities ultimately meet in the middle line of the
neck.

$newpage

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

The first arch is named the mandibular, and the second the hyoid; the
others have no distinctive names.

In each arch a cartilaginous bar, consisting of right and left halves,
is developed, and with each of these there is one of the primitive
aortic arches.

$newpage

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

The first arch is named the mandibular, and the second the hyoid; the
others have no distinctive names.

In each arch a cartilaginous bar, consisting of right and left halves,
is developed, and with each of these there is one of the primitive
aortic arches.

Note that:

the block id in square brackets is mine: I added it to clarify the
example, but it is not present in the files.
Not all blocks of text are separeted by the same number of new
lines.
If a block of text is duplicated, the copy follows right after the
first instance. i.e. There are not copies of a block which are not
following right after the original.
We do not need to maintain the $newpage tag.

Is there any script I could use to automatically delete a duplicated
block of text, so that, taking as source the example abopve, we get:


[block 1] The Branchial or Visceral Arches and Pharyngeal Pouches. -In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).

[block 2] Each of the upper four pouches is prolonged into a dorsal and
a ventral diverticulum.

Over these pouches corresponding indentations of the ectoderm occur,
forming what are known as the branchial or outer pharyngeal grooves.

[block 3] The intervening mesoderm is pressed aside and the ectoderm
comes for a time into contact with the entodermal lining of the
fore-gut, and the two layers unite along the floors of the grooves to
form thin closing membranes between the fore-gut and the exterior.

Later the mesoderm again penetrates between the entoderm and the
ectoderm. In gill-bearing animals the closing membranes disappear, and
the grooves become complete clefts, the gill-clefts, opening from the
pharynx on to the exterior; perforation, however, does not occur in
birds or mammals.

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).

The dorsal ends of these arches are attached to the sides of the head,
while the ventral extremities ultimately meet in the middle line of the
neck.

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

The first arch is named the mandibular, and the second the hyoid; the
others have no distinctive names.

In each arch a cartilaginous bar, consisting of right and left halves,
is developed, and with each of these there is one of the primitive
aortic arches.

Thank you for any help or indication on how to solve this problem.

radoulov · October 11, 2011, 11:22am

awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n", p
  }
/\$newpage/ {
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

Edit: The above code will not print the last paragraph if it's not duplicate.
This version should handle that case correctly:

awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n", p
  if (p[i - 1] != r)
    print r  
  }
/\$newpage/ {
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

samask · October 11, 2011, 11:56am

Wow, thank you so much Radoulov!

That AWK code is just beautiful, and it works perfectly.

The only minor issue is that not all the blocks of text are separated by the same number of new lines.

Sometime $newpage is preceded (or followed) by different numbers of newlines. In those cases, the code does not delete the duplicate block.

But I can clean up the texts beforehand with some regex.

I will to study your code, to improve my tiny awk skills.

Thank you so much once again.

radoulov · October 11, 2011, 12:05pm

This should handle multiple trailing newlines (the multiple leading newlines should be already OK):

awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n", p
  if (p[i - 1] != r)
    print r  
  }
/\$newpage/ {
    sub(/\n\n*$/, "\n", r)
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

Le me know how it goes!

samask · October 11, 2011, 12:21pm

I simplified a test case, with different number of newlines:


[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).


$newpage

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).
$newpage

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.



$newpage

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

With that test case, I get:

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).
[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

radoulov · October 11, 2011, 12:27pm

OK, try this:

awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n\n", p
  if (p[i - 1] != r)
    print r  
  }
/\$newpage/ {
    sub(/\n\n*$/, x, r)
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

samask · October 11, 2011, 12:33pm

Excellent! It works flawlessly now.

I felt bad that I bothered you to tweak the code, but now I am happy. In that way, looking at how you have improved it, I can learn even more

Thank you so much for your valuable advice.

PS: If it is all right, I added a rating to this thread, but it should really be a rating to your nice code, more than the thread itself.

radoulov · October 11, 2011, 12:35pm

It's OK, you're welcome!
More (difficult) questions, more fun for us!

binlib · October 11, 2011, 7:28pm

Yes, we like challenges. If you have gawk, you can do:

gawk '_ != (_ = $0)' RS='\n*\\$newpage\n*|\n$' ORS='\n\n' infile

genehunter · October 11, 2011, 7:51pm

Dear Radoulov,
Is it possible to explain the solution for people like me who love to use awk if we could learn with real life examples like the one OP posted. We will never be as good as you are, but atleast understand a tiny bit at a time..

radoulov · October 12, 2011, 3:21am

@binlib,
nice one!

samask · October 12, 2011, 4:40am

@binlib,

I do have gawk, and I can confirm that the your code works also perfectly.

Thank you!

radoulov · October 12, 2011, 4:51am

Sure,
I'll try.

The code is:

awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n\n", p
  if (p[i - 1] != r)
    print r  
  }
/\$newpage/ {
    sub(/\n\n*$/, x, r)
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

We have 3 rules (3 pattern/action pairs):

pattern { action }

In an awk rule, either the pattern or the action can be omitted, but not both.

One:

END {
  ...
  }

The pattern is the END special pattern.
The action is executed once the pattern matches.

Two:

/\$newpage/ {
  ...
  }

The pattern matches the regular expression between the //,
in this case it's rather simple: the literal string $newpage.

Three:

{
  ...
  }

Here the pattern is omitted, so (by default) the action is performed
for every record read. This one will be executed first (if the first input line
doesn't contain the pattern $newpage.

The END rule/block will be executed once all the input has been read (don't be confused
if you see it first, you can place it in the middle if you wish,
that won't change the semantics. By the way, the old awk - /bin/awk on Solaris,
for example - doesn't like misplaced BEGIN/END blocks:

$ awk 'END{ print "end" } NR < 3 { print "zero"; next } { exit }' </dev/random
awk: syntax error near line 1
awk: bailing out near line 1

The new one works fine:

$ nawk 'END{ print "end" } NR < 3 { print "zero"; next } { exit }' </dev/random
zero
zero
end

As I said, most likely (given the input provided by @samask),
the first action to be executed will be the following:

r = r ? r RS $0 : $0

This is assignment (we're assigning a value to the variable r
(r stands for record in my head, you could named differently, if you wish so).
On the right side of the assignment statement I'm using the ternary operator,
its syntax could be described like this:

expression ? return_this_if_true : return_this_otherwise

If r already contains
some value (actually it's: if r is different than null string or 0, more on this later), append a newline
(the current Record Separator - RS) and the current record ($0) to its value, otherwise assign the value
of the current record ($0).
In other words, build a long string concatenating all the records.

While building the string named r, awk reaches a record matching the pattern $newline and executes
the actions associated with that pattern:

sub(/\n\n*$/, x, r)
    t[r]++ || p[++idx] = r
    r = x; next

@samask said that trailing newlines should be ignored when comparing
the text paragraphs. At this point, given the first input provided, r has
the following value:

[block 1] The Branchial or Visceral Arches and Pharyngeal Pouches. �In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).

The first thing to do is to get rid of the trailing newlines in the paragraph:

sub(/\n\n*$/, x, r)

Substitute(sub) one or more newlines at the end of the string r (\n\n*$) with x.
x is an uninitialized variable, thus its value is null (or 0, depending on the usage). You could use "" here,
if you find it more readable.
So here the trailing newlines are removed from the value of r.

t[r]++ || p[++idx] = r

The arrays in awk are associative (indexed by strings). They are sparse.
The order with which the elements will appear when scanning an array
is pseudo-random (GNU awk, mawk and maybe TAWK, support extensions to deal with this issue,
but most commercial Unix awk implementations don't provide such extensions).

So I decided to use two arrays: t and p.
The first one - t -is used to identify the unique paragraphs, because the
associative arrays guarantee uniqueness (the values get overwritten).
Note that the OP said that repeated paragraphs are always grouped together,
but this code will handle non consecutive duplicates as well.

t[r]++ is a common awk idiom, it works like this:

Consider the following values:

zsh-4.3.12[t]% print -l {1..5} {2..7}
1
2
3
4
5
2
3
4
5
6
7

Some values are unique (1, 6, 7), other have duplicates (2-5).
This is what I need:

zsh-4.3.12[t]% print -l {1..5} {2..7} | awk '{ print $1, "=>", t[$1]++ }'
1 => 0
2 => 0
3 => 0
4 => 0
5 => 0
2 => 1
3 => 1
4 => 1
5 => 1
6 => 0
7 => 0

Thus the expression t[r]++ returns 0 only the first time a value is seen.
So the logic is:

t[r]++ || ...

When we see a paragraph (r) for the first time - || is the logical OR operator,
we need it because we want to perform an action when the expression is evaluated false
(in awk, as far as the boolean logic is concerned,if an expression is evaluated false
when its (computed) value is the null string "" (when used as string) or or 0, when used as number, everything else is true.
So, again, when we see a paragraph for the first time, we create a new element in the array p (p for paragraphs),
this time we use numeric indexes (even if they get converted to strings anyway).

p[++idx] = r

The first paragraph is in p[1], the second in p[2] etc.
After that we need to reset the value of r and execute the next statement
in order to make the record containing the pattern $newpage invisible to the
following statement r = r ? ... .

END {
  for (i = 0; ++i <= idx;)
    printf "%s\n\n", p
  if (p[i - 1] != r)
    print r  
  }

At the end we just dump the content of the array containing the paragraphs in order.
The last if checks if we already printed the last paragraph (this is because we build the array p in the action part
before the r building statement r = r ? ... .

@binlib provided a GNU awk solution. He's using an extremely powerful gawk feature (even Perl doesn't have this one,
at least not as a command line option, it could be simulated, of course) - a regular expression as record separator.

Hope this helps.

samask · October 12, 2011, 5:04am

@radoulov,

That is an *awesome* explanation.

I am so grateful for the code and the lesson.

Thank you!