Scan and change file data content problem

Input file

>Read_1
XXXXXXXXXXSDFXXXXXDS  (condition 1: After the last "X" per line, if the distance is less than or equal to 3 letter, replace those not "X" letter with "X")
TREXXXXXXXSDFXXXXXDS (condition 2: Before the first "X" per line, if the distance is less than or equal to 3 letter, replace those not "X" letter with "X")
.
.

Desired output:

>Read_1
XXXXXXXXXXSDFXXXXXXX
XXXXXXXXXXSDFXXXXXXX
.
.

I got try using this one to solve the condition 1 problem. But it is not worked:

perl -pe 's/X[^X]{2}$/XXX/g' input

Thanks for any advice.

$ 
$ 
$ cat f25
>Read_1
XXXXXXXXXXSDFXXXXXDS
TREXXXXXXXSDFXXXXXDS
XXXXXXXXXXABCXXXXXXW
AQXXXXXXXXPQRXXXXXXX
XXXXXXXXXXDEFXXXXXXX
$ 
$ 
$ perl -lne 'if (/^([^X]{1,3})(X.*)/){($a,$b)=($1,$2); $a=~s/./X/g; print "$a$b"}
             elsif (/^(.*X)([^X]{1,3})$/){($a,$b)=($1,$2); $b=~s/./X/g; print "$a$b"}
             else {print}                                                            
            ' f25
>Read_1
XXXXXXXXXXSDFXXXXXXX
XXXXXXXXXXSDFXXXXXDS
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXPQRXXXXXXX
XXXXXXXXXXDEFXXXXXXX
$ 
$ 

tyler_durden

With sed..

sed -e 's/\(^X.*\)XDS$/\1XXX/g' -e 's/^TREX\(.*\)/XXXX\1/g' inputfile

Hi durden_tyler,

Thanks for your reply.
I just edit a little bit of my previous post due to my small mistakes.
Do you have any idea to archive it?
As long as the letter (less or equal to 3 letter ) before the first "X" and last "X" is not "X".
I will replace those letter with "X"
Really thanks again for your help.

---------- Post updated at 01:35 AM ---------- Previous update was at 01:29 AM ----------

Hi michaelrozar17,

Do you have any idea or suggestion to achieve my goal?
I got edit a little bit of my previous post.
Thanks first for your advice :slight_smile:

to my understanding you need TREXXXXXXXSDFXXXXXDS to be replaced with XXXXXXXXXXSDFXXXXXXX

sed -e 's/\(^X.*\)XDS$/\1XXX/g' -e 's/^TREX\(.*\)XDS/XXXX\1XXX/g' inputfile

Hi michaelrozar17,

My input file is a long list of data and "Read_1" just part of it.
I just not sure how to archive it automatic if my input data is a long list of data :slight_smile:

Quite confused now. Wots your requirement? Do you need to replace those *XXX* or to archive? If to archive - pls elaborate wot "archive" you mention here..means.

Sorry for confusing you :frowning:
As long as the letter (less or equal to 3 letter ) before the first "X" and after the last "X" is not "X".
I will replace those letter with "X"

# cat infile
XXXXXXXXXXSDFXXXXXDS
TREXXXXXXXSDFXXXXXDS
# ./justdoit infile
XXXXXXXXXXSDFXXXXXXX
XXXXXXXXXXSDFXXXXXXX
## justdoit ##
#!/bin/bash
rm -f tmpfileX
 while read -r l ; do
x=( $(echo $( echo $l |fold -w1 )) )
xr=( $(echo $( echo $l |fold -w1 )|rev) )
c=0
 for i in ${x[@]} ; do
  if [ "$i" != "X" ] ; then
   ((c++))
  else
   fdst=$c ;c=0;break
  fi
 done
for i in ${xr[@]} ; do
  if [ "$i" != "X" ] ; then
   ((c++))
  else
ltdst=$c ;ldst=$(( ${#x[@]} - $(( $ltdst -1 )) ));break
  fi
 done
dst=$(( $ldst - $fdst ))
if [ $dst -gt 3 ] ; then
  for i in $(seq 0 $(( $fdst - 1 )) )
   do
    x[$i]=X
   done
  for i in $(seq $(( $ldst -1 )) $(( ${#x[@]} - 1 )) )
   do
    x[$i]=X
   done
fi
   echo ${x[@]}|sed 's/ //g' >>tmpfileX
done<"$1"
more tmpfileX

Not sure what you mean by "idea to archive it" - I don't see anything about archiving in your post. You probably mean "achieve it".

In any case, the script that follows takes care of the case where a line could have a non-X letter on each end.

$
$
$ cat f25
>Read_1
XXXXXXXXXXABCXXXXXXS
XXXXXXXXXXABCXXXXXRS
XXXXXXXXXXABCXXXXQRS
XXXXXXXXXXABCXXXPQRS
PXXXXXXXXXABCXXXXXXX
PQXXXXXXXXABCXXXXXXX
PQRXXXXXXXABCXXXXXXX
PQRSXXXXXXABCXXXXXXX
PXXXXXXXXXABCXXXXXXS
PXXXXXXXXXABCXXXXXRS
PXXXXXXXXXABCXXXXQRS
PXXXXXXXXXABCXXXPQRS
PQXXXXXXXXABCXXXXXXS
PQXXXXXXXXABCXXXXXRS
PQXXXXXXXXABCXXXXQRS
PQXXXXXXXXABCXXXPQRS
PQRXXXXXXXABCXXXXXXS
PQRXXXXXXXABCXXXXXRS
PQRXXXXXXXABCXXXXQRS
PQRXXXXXXXABCXXXPQRS
PQRSXXXXXXABCXXXXXXS
PQRSXXXXXXABCXXXXXRS
PQRSXXXXXXABCXXXXQRS
PQRSXXXXXXABCXXXPQRS
XXXXXXXXXXABCXXXXXXX
$
$
$ perl -lne 'if    (/^([^X]{1,3})(X.*X)([^X]{1,3})$/) {($a,$b,$c)=($1,$2,$3); $a=~s/./X/g; $c=~s/./X/g; print "$a$b$c"}
             elsif (/^([^X]{1,3})(X.*)/)              {($a,$b)=($1,$2); $a=~s/./X/g; print "$a$b"}
             elsif (/^(.*X)([^X]{1,3})$/)             {($a,$b)=($1,$2); $b=~s/./X/g; print "$a$b"}
             else  {print}
            ' f25
>Read_1
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXPQRS
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
PQRSXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXPQRS
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXPQRS
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXPQRS
PQRSXXXXXXABCXXXXXXX
PQRSXXXXXXABCXXXXXXX
PQRSXXXXXXABCXXXXXXX
PQRSXXXXXXABCXXXPQRS
XXXXXXXXXXABCXXXXXXX
$
$
$

tyler_durden

1 Like
awk '
/^..X/{sub("^..X","XXX")}
/X..$/{sub("X..$","XXX")}
1' file
1 Like

Franklin52's idea is much better.
You do not really have to check if the first 3 or last 3 characters are non-X. The result is code brevity.

$
$ cat f25
>Read_1
XXXXXXXXXXABCXXXXXXS
XXXXXXXXXXABCXXXXXRS
XXXXXXXXXXABCXXXXQRS
XXXXXXXXXXABCXXXPQRS
PXXXXXXXXXABCXXXXXXX
PQXXXXXXXXABCXXXXXXX
PQRXXXXXXXABCXXXXXXX
PQRSXXXXXXABCXXXXXXX
PXXXXXXXXXABCXXXXXXS
PXXXXXXXXXABCXXXXXRS
PXXXXXXXXXABCXXXXQRS
PXXXXXXXXXABCXXXPQRS
PQXXXXXXXXABCXXXXXXS
PQXXXXXXXXABCXXXXXRS
PQXXXXXXXXABCXXXXQRS
PQXXXXXXXXABCXXXPQRS
PQRXXXXXXXABCXXXXXXS
PQRXXXXXXXABCXXXXXRS
PQRXXXXXXXABCXXXXQRS
PQRXXXXXXXABCXXXPQRS
PQRSXXXXXXABCXXXXXXS
PQRSXXXXXXABCXXXXXRS
PQRSXXXXXXABCXXXXQRS
PQRSXXXXXXABCXXXPQRS
XXXXXXXXXXABCXXXXXXX
$
$ # Perl equivalent of Franklin52's script
$ perl -plne 's/^...X/XXXX/; s/X...$/XXXX/' f25
>Read_1
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXPQRS
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
PQRSXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXPQRS
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXPQRS
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXXXXX
XXXXXXXXXXABCXXXPQRS
PQRSXXXXXXABCXXXXXXX
PQRSXXXXXXABCXXXXXXX
PQRSXXXXXXABCXXXXXXX
PQRSXXXXXXABCXXXPQRS
XXXXXXXXXXABCXXXXXXX
$
$

tyler_durden

1 Like

Thanks again, Franklin52
Your awk script is wonderful and worked perfectly in my case :slight_smile:

---------- Post updated at 10:35 AM ---------- Previous update was at 10:33 AM ----------

Sorry my mistakes cause your misunderstanding, tyler_durden :frowning:
You're right.
It should be "achieve" instead of "archive" :frowning:
Thanks a lot for your latest perl command too.
It worked perfect and easy for me to edit the perl command according different situation too.
Thanks again, tyler_durden.