How to remove duplicate sentence/string in perl?

Hi,

I have two strings like this in an array:

For example:

@a=("Brain aging is associated with a progressive imbalance between intracellular concentration of Reactive Oxygen Species","Brain aging is associated with a progressive imbalance between intracellular concentration of Reactive Oxygen Species");

Actually its the duplicate sentences.

I want to remove the duplicate string from an array and i have many duplicate strings like this in an array.

How do i remove duplicate sentence/string from an array in perl?

I don't want to use any module to remove duplicate sentences.

Any solution???

with regards
Vanitha

Well, it can be as simple as this:

@a = ("a", "b", "c", "a", "b", "d");
@a = (map { $_u{$_} = 1; (); } @a or keys %_u);
$, = "\n";
print @a;

Or, if you also want to preserve the order:

@a = qw{a b c a b d};
print join "\n", grep !$_{$_}++, @a
@arr=('a','b','a','b','c');
$hash{$_}++ foreach @arr;
print join ",",keys %hash;

Hi,

I tried the above methods its removing the duplicates but the order is not retained.I want to retain the order also.I used sort but still its giving different order.Here is my array:

@arr=('TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.','For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.','In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.','Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.','Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.','TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.','For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.','In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.','Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.','Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.');

output i got was:

In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.

But i want the output like this:

TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis. For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter. In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion. Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1. Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.

How to change the order and print the same?

With regards
Vanitha

Did you read my post?

$ cat p
#! /usr/bin/env perl

@arr =(
'TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.',
'For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.',
'In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.',
'Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.',
'Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.',
'TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.',
'For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.',
'In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.',
'Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.',
'Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.'
);

$, = "\n\n";
$\ = "\n";

print grep !$_{$_}++, @arr;

$ ./p
TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.

For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.

In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.

Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.

Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.
$ 

(radoulov was faster, shorter and probably better. Code below uses same principle, but very verbose.)

This program should do it (very verbose)

@a = qw/a c b a b d/;
%b = {};
@c = ();
foreach(@a) {
        if(!$b{$_}){
                push @c, $_;
                $b{$_}++;
        }
}
print "Output:\n" . join("\n",@c) . "\n";

Feel free to line it up...

@a = qw/a c b a b d/; %b = {}; @c = ();
foreach (@a) { push @c, $_ if !$b{$_}++; }
print "Output:\n" . join("\n",@c) . "\n";

Real perl guru's can probably make it even more crypting, reusing variables and special variables :smiley:

Basically, you just use a hashtable to store the words, and each time check if the word is already in the hashtable (which is a O(1) action).

Ok I adjusted my previous example that now preserves ordering:

// @arr = (....);
@arr = map {
	my @r = ($_u{$_}?():($_));
	$_u{$_} = 1;
	@r;
} @arr;
$, = "\n";
print @arr;

The main idea, if you have a lot of these lines to process, doing a linear search will become slower with more input in an exponential sense. Use a hash, and you can sort of reduce that.

Hi,

Thanks for the reply!!!

I want to substitute the sentences with other sentences that is an other array!!

I tried using substitution operator but its not working properly!!!!

How can i substitute the sentences?

Any method to substitute sentences/strings?

With regards
Vanitha

Not sure what you are referring to. Examples please.