Excution Problems with loading huge data content and convert it

Hi,

I got long list of referred file content:

CGTGCFTGCGTFREDG
PEOGDKGJDGKLJGKL
DFGDSFIODUFIODSUF
FSDOFJSODIFJSIODFJ
DSFSDFDFSDOFJFOSF
SDFOSDJFOJFPPIPIOP
.
.
.

Input file content:

>sample_1
SDFDSKLFKDSLSDFSDFDFGDSFIODUFIODSUFSDDSFDSSDFDSFAS

>sample_2
DFDFGDSFIODUFIODSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL
.
.

Desired output:

>sample_1
SDFDSKLFKDSLSDFSDFXXXXXXXXXXXXXXXSDDSFDSSDFDSFAS

>sample_2
DFXXXXXXXXXXXXXXXSDFDSFXXXXXXXXXXXXXXXTPHHPHLHL
.
.

My main purpose is convert the input file's content that same like the content in referred file into "XXXXXXXXXXXXXXX" and written into an output file.
Thanks for any advice or suggestion.

You have your requirement defined ...

But what is the problem ? You want us to give solution to you or you have some solution with issues ?!

I would suggest you to solve using perl, if you know perl already.

Hi thegeek,
I got try to use the sed command.

sed 's/DFGDSFIODUFIODSUF/XXXXXXXXXXXXXXX/g' input file > output file

But it seems like sed command is not worked well when deal which huge data :frowning:
Do you got better suggestion by using perl script to archive my goal?
Thanks for any advice.

So what's the problem by sed?

cat reference.txt

CGTGCFTGCGTFREDG
PEOGDKGJDGKLJGKL
DFGDSFIODUFIODSUF
FSDOFJSODIFJSIODFJ
DSFSDFDFSDOFJFOSF
SDFOSDJFOJFPPIPIOP

for i in $(cat reference.txt)
do
  sed -i "s/$i/XXXXXXXXXXXXXXX/g" input      # if your sed don't support -i option, replace by below perl command
# or
  perl -i -pe  "s/$i/XXXXXXXXXXXXXXX/g" input 
done
1 Like

Hi rdcwayx,

Thanks for your suggestion. It is worked :slight_smile:
But it seems like not worked if my input my sequence content is like:

problem 1:
>sample_10
TRAAERASDDSDSTERTRTRTERTEDFGDSFIODU
FIODSUFASDSDASFSAFEDFFDFDFDFDFDFDA
.
.

It seems like I needed to remove the newline (\n) in the content before start doing the conversion, am I right?
Sed command that I try previously worked as well.
Just it takes time when loading for huge data convert :frowning:
Thanks again my advice ya :slight_smile:

---------- Post updated at 06:43 AM ---------- Previous update was at 05:47 AM ----------

Hi rdcwayx,

The command that you suggested will output the result in the same file (input file), am I right?
Besides that, if my referred file content got around 124464 reads (each reads length around 22) and the input file got around 115478631 bases, do you got any other suggestion to speed up the progress?
Thanks :slight_smile:

open FH,"<listfile";
my @tmp = <FH>;
my @pat = map {s/\n//;$_} @tmp;
close FH;
open FH,"<filetobereplaced";
while(<FH>){
  my $str = $_;
  foreach my $pat (@pat){
     my $tmp = $str=~s/$pat/<>/g;
     last if $tmp > 0;
  }
  print $str;
}
close FH;
1 Like

Hi summer,

Thanks for your perl script :slight_smile:
I'm trying it now.
My referred list got around 124464 reads. Hopefully it won't taken too long of times :slight_smile:
Thanks again ya.

Hi Summer,

If my referral data is look like this:

CGTGCF
PEOGDKGJDGKL
DFGDSFIODU
FSDOSOJSIODFJ
DSFSDFFJFOSF
SDFOSDIOP
.
.

I planned to replace those variable referral data length content to "X" based on its respectively length.
Input data:

>sample_1
SDFDSKLPEOGDKGJDGKLDSFAS

>sample_2
SDFOSDIOPDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL
.
.

Desired output:

>sample_1
SDFDSKLXXXXXXXXXXXXDSFAS

>sample_2
XXXXXXXXXDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL
.
.

Thanks again for your advice.

awk '
NR==FNR {t=$0;gsub(/./,"x");a[t]=$0;next}
{
        for (i in a) if ($0~i) sub(i,a)
}1
' referral.data input.data

>sample_1
SDFDSKLxxxxxxxxxxxxDSFAS

>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL
1 Like

Thanks again and a lot, rdcwayx :slight_smile:
You are expert in awk language ^^
Do you familiar how to replace all the lines content to "X" based on the threshold of "X" in one line?
eg. If the output file that generated by your awk script got continuous X more than 10. I plan to replace the all lines with X at this time.

awk '
NR==FNR {t=$0;gsub(/./,"x");a[t]=$0;next}
{
        for (i in a) if ($0~i) sub(i,a)
}1
' referral.data input.data

>sample_1
SDFDSKLxxxxxxxxxxxxD
xxxxxxxxxDSUFSDFDSFS

>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL

By your script, it shown that sample_1 got 12 continuous x in its content which above my threshold limit (>=10) of continuous x in a line.
My desired output will look like this at this moment:

>sample_1
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxDSUFSDFDSFS

>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL

Thanks again, rdcwayx :slight_smile:

cat infile

>sample_1
SDFDSKLxxxxxxxxxxxxD
xxxxxxxxxDSUFSDFDSFS

>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL

awk -F "x" 'NF>10 {gsub(/./,FS)}1' infile

>sample_1
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxDSUFSDFDSFS

>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL

Hi rdcwayx,

If my referral data (3560 different referral data) and input data (around 20MB) is very huge, do you got any better solution to improve the performance of replace the data based on referral data?
Thanks first :slight_smile:

split the input data files to small files, and run the same awk command in different console.

then attach them again.

split input.data ABC

nohup awk -f command.awk referral.data ABCaa >ABCaa.data &
nohup awk -f command.awk referral.data ABCab >ABCab.data &
1 Like

Thanks a lot, rdcwayx :slight_smile:
Can I ask you what is the meaning of "nohup"?
Is it also a command?
Thanks ^^

---------- Post updated at 01:54 AM ---------- Previous update was at 01:41 AM ----------

Hi rdcwayx,
I Just suddenly think one situation that needs your advice.
referral data

CGTGCF
PEOGDKGJDGKL
DFGDSFIODU
.

Input data:

>sample_1
ADSADAFEFGEGERPEOGDK
GJDGKLASDSFDFE
.
.

You got any idea to generate the output result as below:

ADSADAFEFGEGERXXXXXX
XXXXXXASDSFDFE
.

Thanks again ya.
Just think that this case might be more sticky when replaced the referral data with "X".

hi rdcwayx,
Why it seems like both of the below command is not working properly?

for i in $(cat reference.txt)
do
  sed -i "s/$i/XXXXXXXXXXXXXXX/g" input      # if your sed don't support -i option, replace by below perl command
# or
  perl -i -pe  "s/$i/XXXXXXXXXXXXXXX/g" input 
done

Is it I missed anything?
Thanks for advice ya ^^

---------- Post updated at 11:39 PM ---------- Previous update was at 11:14 PM ----------

hi rdcwayx,
I just wondering whether we can write the result into another new file by using your script below?

for i in $(cat reference.txt)
do
  sed -i "s/$i/XXXXXXXXXXXXXXX/g" input      # if your sed don't support -i option, replace by below perl command
# or
  perl -i -pe  "s/$i/XXXXXXXXXXXXXXX/g" input > input.out (I got try this. But it is not worked :( )
done

Thanks for advice ya ^^

---------- Post updated 10-14-10 at 12:00 AM ---------- Previous update was 10-13-10 at 11:39 PM ----------

I just found out that the command below:

awk '
NR==FNR {t=$0;gsub(/./,"x");a[t]=$0;next}
{
        for (i in a) if ($0~i) sub(i,a)
}1
' referral.data input.data

seem like missed to replace some of the referral data content in input data :frowning: