Excution Problems with loading huge data content and convert it

patrick87 · September 22, 2010, 7:51am

Hi,

I got long list of referred file content:

CGTGCFTGCGTFREDG
PEOGDKGJDGKLJGKL
DFGDSFIODUFIODSUF
FSDOFJSODIFJSIODFJ
DSFSDFDFSDOFJFOSF
SDFOSDJFOJFPPIPIOP
.
.
.

Input file content:

>sample_1
SDFDSKLFKDSLSDFSDFDFGDSFIODUFIODSUFSDDSFDSSDFDSFAS

>sample_2
DFDFGDSFIODUFIODSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL
.
.

Desired output:

>sample_1
SDFDSKLFKDSLSDFSDFXXXXXXXXXXXXXXXSDDSFDSSDFDSFAS

>sample_2
DFXXXXXXXXXXXXXXXSDFDSFXXXXXXXXXXXXXXXTPHHPHLHL
.
.

My main purpose is convert the input file's content that same like the content in referred file into "XXXXXXXXXXXXXXX" and written into an output file.
Thanks for any advice or suggestion.

thegeek · September 22, 2010, 10:35am

You have your requirement defined ...

But what is the problem ? You want us to give solution to you or you have some solution with issues ?!

I would suggest you to solve using perl, if you know perl already.

patrick87 · September 22, 2010, 10:41pm

Hi thegeek,
I got try to use the sed command.

sed 's/DFGDSFIODUFIODSUF/XXXXXXXXXXXXXXX/g' input file > output file

But it seems like sed command is not worked well when deal which huge data
Do you got better suggestion by using perl script to archive my goal?
Thanks for any advice.

rdcwayx · September 23, 2010, 12:12am

So what's the problem by sed?

cat reference.txt

CGTGCFTGCGTFREDG
PEOGDKGJDGKLJGKL
DFGDSFIODUFIODSUF
FSDOFJSODIFJSIODFJ
DSFSDFDFSDOFJFOSF
SDFOSDJFOJFPPIPIOP

for i in $(cat reference.txt)
do
  sed -i "s/$i/XXXXXXXXXXXXXXX/g" input      # if your sed don't support -i option, replace by below perl command
# or
  perl -i -pe  "s/$i/XXXXXXXXXXXXXXX/g" input 
done

patrick87 · September 23, 2010, 7:43am

Hi rdcwayx,

Thanks for your suggestion. It is worked
But it seems like not worked if my input my sequence content is like:

problem 1:
>sample_10
TRAAERASDDSDSTERTRTRTERTEDFGDSFIODU
FIODSUFASDSDASFSAFEDFFDFDFDFDFDFDA
.
.

It seems like I needed to remove the newline (\n) in the content before start doing the conversion, am I right?
Sed command that I try previously worked as well.
Just it takes time when loading for huge data convert
Thanks again my advice ya

rdcwayx:

So what's the problem by sed?

cat reference.txt

CGTGCFTGCGTFREDG
PEOGDKGJDGKLJGKL
DFGDSFIODUFIODSUF
FSDOFJSODIFJSIODFJ
DSFSDFDFSDOFJFOSF
SDFOSDJFOJFPPIPIOP

for i in $(cat reference.txt)
do
  sed -i "s/$i/XXXXXXXXXXXXXXX/g" input      # if your sed don't support -i option, replace by below perl command
# or
  perl -i -pe  "s/$i/XXXXXXXXXXXXXXX/g" input 
done

---------- Post updated at 06:43 AM ---------- Previous update was at 05:47 AM ----------

Hi rdcwayx,

The command that you suggested will output the result in the same file (input file), am I right?
Besides that, if my referred file content got around 124464 reads (each reads length around 22) and the input file got around 115478631 bases, do you got any other suggestion to speed up the progress?
Thanks

summer_cherry · September 23, 2010, 10:13pm

open FH,"<listfile";
my @tmp = <FH>;
my @pat = map {s/\n//;$_} @tmp;
close FH;
open FH,"<filetobereplaced";
while(<FH>){
  my $str = $_;
  foreach my $pat (@pat){
     my $tmp = $str=~s/$pat/<>/g;
     last if $tmp > 0;
  }
  print $str;
}
close FH;

patrick87 · September 23, 2010, 10:31pm

Hi summer,

Thanks for your perl script
I'm trying it now.
My referred list got around 124464 reads. Hopefully it won't taken too long of times
Thanks again ya.

patrick87 · September 26, 2010, 7:08am

Hi Summer,

If my referral data is look like this:

CGTGCF
PEOGDKGJDGKL
DFGDSFIODU
FSDOSOJSIODFJ
DSFSDFFJFOSF
SDFOSDIOP
.
.

I planned to replace those variable referral data length content to "X" based on its respectively length.
Input data:

>sample_1
SDFDSKLPEOGDKGJDGKLDSFAS

>sample_2
SDFOSDIOPDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL
.
.

Desired output:

>sample_1
SDFDSKLXXXXXXXXXXXXDSFAS

>sample_2
XXXXXXXXXDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL
.
.

Thanks again for your advice.

rdcwayx · September 27, 2010, 7:24am

awk '
NR==FNR {t=$0;gsub(/./,"x");a[t]=$0;next}
{
        for (i in a) if ($0~i) sub(i,a)
}1
' referral.data input.data

>sample_1
SDFDSKLxxxxxxxxxxxxDSFAS

>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL

patrick87 · September 28, 2010, 6:24am

Thanks again and a lot, rdcwayx
You are expert in awk language ^^
Do you familiar how to replace all the lines content to "X" based on the threshold of "X" in one line?
eg. If the output file that generated by your awk script got continuous X more than 10. I plan to replace the all lines with X at this time.

awk '
NR==FNR {t=$0;gsub(/./,"x");a[t]=$0;next}
{
        for (i in a) if ($0~i) sub(i,a)
}1
' referral.data input.data

>sample_1
SDFDSKLxxxxxxxxxxxxD
xxxxxxxxxDSUFSDFDSFS

>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL

By your script, it shown that sample_1 got 12 continuous x in its content which above my threshold limit (>=10) of continuous x in a line.
My desired output will look like this at this moment:

>sample_1
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxDSUFSDFDSFS

>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL

Thanks again, rdcwayx

rdcwayx · September 28, 2010, 7:01am

cat infile

>sample_1
SDFDSKLxxxxxxxxxxxxD
xxxxxxxxxDSUFSDFDSFS

>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL

awk -F "x" 'NF>10 {gsub(/./,FS)}1' infile

>sample_1
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxDSUFSDFDSFS

>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL

patrick87 · September 29, 2010, 7:01am

Hi rdcwayx,

If my referral data (3560 different referral data) and input data (around 20MB) is very huge, do you got any better solution to improve the performance of replace the data based on referral data?
Thanks first

rdcwayx:

awk '
NR==FNR {t=$0;gsub(/./,"x");a[t]=$0;next}
{
   for (i in a) if ($0~i) sub(i,a)
}1
' referral.data input.data

>sample_1
SDFDSKLxxxxxxxxxxxxDSFAS

>sample_2
xxxxxxxxxDSUFSDFDSFSDFOSDJFOJFPPIPIOPTPHHPHLHL

rdcwayx · September 29, 2010, 8:08am

split the input data files to small files, and run the same awk command in different console.

then attach them again.

split input.data ABC

nohup awk -f command.awk referral.data ABCaa >ABCaa.data &
nohup awk -f command.awk referral.data ABCab >ABCab.data &

patrick87 · September 30, 2010, 2:54am

Thanks a lot, rdcwayx
Can I ask you what is the meaning of "nohup"?
Is it also a command?
Thanks ^^

---------- Post updated at 01:54 AM ---------- Previous update was at 01:41 AM ----------

Hi rdcwayx,
I Just suddenly think one situation that needs your advice.
referral data

CGTGCF
PEOGDKGJDGKL
DFGDSFIODU
.

Input data:

>sample_1
ADSADAFEFGEGERPEOGDK
GJDGKLASDSFDFE
.
.

You got any idea to generate the output result as below:

ADSADAFEFGEGERXXXXXX
XXXXXXASDSFDFE
.

Thanks again ya.
Just think that this case might be more sticky when replaced the referral data with "X".

patrick87 · October 14, 2010, 1:00am

hi rdcwayx,
Why it seems like both of the below command is not working properly?

for i in $(cat reference.txt)
do
  sed -i "s/$i/XXXXXXXXXXXXXXX/g" input      # if your sed don't support -i option, replace by below perl command
# or
  perl -i -pe  "s/$i/XXXXXXXXXXXXXXX/g" input 
done

Is it I missed anything?
Thanks for advice ya ^^

---------- Post updated at 11:39 PM ---------- Previous update was at 11:14 PM ----------

hi rdcwayx,
I just wondering whether we can write the result into another new file by using your script below?

for i in $(cat reference.txt)
do
  sed -i "s/$i/XXXXXXXXXXXXXXX/g" input      # if your sed don't support -i option, replace by below perl command
# or
  perl -i -pe  "s/$i/XXXXXXXXXXXXXXX/g" input > input.out (I got try this. But it is not worked :( )
done

Thanks for advice ya ^^

---------- Post updated 10-14-10 at 12:00 AM ---------- Previous update was 10-13-10 at 11:39 PM ----------

I just found out that the command below:

awk '
NR==FNR {t=$0;gsub(/./,"x");a[t]=$0;next}
{
        for (i in a) if ($0~i) sub(i,a)
}1
' referral.data input.data

seem like missed to replace some of the referral data content in input data