My main purpose is convert the input file's content that same like the content in referred file into "XXXXXXXXXXXXXXX" and written into an output file.
Thanks for any advice or suggestion.
sed 's/DFGDSFIODUFIODSUF/XXXXXXXXXXXXXXX/g' input file > output file
But it seems like sed command is not worked well when deal which huge data
Do you got better suggestion by using perl script to archive my goal?
Thanks for any advice.
cat reference.txt
CGTGCFTGCGTFREDG
PEOGDKGJDGKLJGKL
DFGDSFIODUFIODSUF
FSDOFJSODIFJSIODFJ
DSFSDFDFSDOFJFOSF
SDFOSDJFOJFPPIPIOP
for i in $(cat reference.txt)
do
sed -i "s/$i/XXXXXXXXXXXXXXX/g" input # if your sed don't support -i option, replace by below perl command
# or
perl -i -pe "s/$i/XXXXXXXXXXXXXXX/g" input
done
Thanks for your suggestion. It is worked
But it seems like not worked if my input my sequence content is like:
problem 1:
>sample_10
TRAAERASDDSDSTERTRTRTERTEDFGDSFIODU
FIODSUFASDSDASFSAFEDFFDFDFDFDFDFDA
.
.
It seems like I needed to remove the newline (\n) in the content before start doing the conversion, am I right?
Sed command that I try previously worked as well.
Just it takes time when loading for huge data convert
Thanks again my advice ya
---------- Post updated at 06:43 AM ---------- Previous update was at 05:47 AM ----------
Hi rdcwayx,
The command that you suggested will output the result in the same file (input file), am I right?
Besides that, if my referred file content got around 124464 reads (each reads length around 22) and the input file got around 115478631 bases, do you got any other suggestion to speed up the progress?
Thanks
open FH,"<listfile";
my @tmp = <FH>;
my @pat = map {s/\n//;$_} @tmp;
close FH;
open FH,"<filetobereplaced";
while(<FH>){
my $str = $_;
foreach my $pat (@pat){
my $tmp = $str=~s/$pat/<>/g;
last if $tmp > 0;
}
print $str;
}
close FH;
awk '
NR==FNR {t=$0;gsub(/./,"x");a[t]=$0;next}
{
for (i in a) if ($0~i) sub(i,a)
}1
' referral.data input.data
>sample_1
SDFDSKLxxxxxxxxxxxxDSFAS
>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL
Thanks again and a lot, rdcwayx
You are expert in awk language ^^
Do you familiar how to replace all the lines content to "X" based on the threshold of "X" in one line?
eg. If the output file that generated by your awk script got continuous X more than 10. I plan to replace the all lines with X at this time.
awk '
NR==FNR {t=$0;gsub(/./,"x");a[t]=$0;next}
{
for (i in a) if ($0~i) sub(i,a)
}1
' referral.data input.data
>sample_1
SDFDSKLxxxxxxxxxxxxD
xxxxxxxxxDSUFSDFDSFS
>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL
By your script, it shown that sample_1 got 12 continuous x in its content which above my threshold limit (>=10) of continuous x in a line.
My desired output will look like this at this moment:
If my referral data (3560 different referral data) and input data (around 20MB) is very huge, do you got any better solution to improve the performance of replace the data based on referral data?
Thanks first
hi rdcwayx,
Why it seems like both of the below command is not working properly?
for i in $(cat reference.txt)
do
sed -i "s/$i/XXXXXXXXXXXXXXX/g" input # if your sed don't support -i option, replace by below perl command
# or
perl -i -pe "s/$i/XXXXXXXXXXXXXXX/g" input
done
Is it I missed anything?
Thanks for advice ya ^^
---------- Post updated at 11:39 PM ---------- Previous update was at 11:14 PM ----------
hi rdcwayx,
I just wondering whether we can write the result into another new file by using your script below?
for i in $(cat reference.txt)
do
sed -i "s/$i/XXXXXXXXXXXXXXX/g" input # if your sed don't support -i option, replace by below perl command
# or
perl -i -pe "s/$i/XXXXXXXXXXXXXXX/g" input > input.out (I got try this. But it is not worked :( )
done
Thanks for advice ya ^^
---------- Post updated 10-14-10 at 12:00 AM ---------- Previous update was 10-13-10 at 11:39 PM ----------
I just found out that the command below:
awk '
NR==FNR {t=$0;gsub(/./,"x");a[t]=$0;next}
{
for (i in a) if ($0~i) sub(i,a)
}1
' referral.data input.data
seem like missed to replace some of the referral data content in input data