Search and replace problem

Hi,

I am looking for bash or awk script to solve the following.
Input File 1:

>Min_0-t10270-RA|>Min_0-t10270-RA protein AED:0.41 eAED:0.46 QI:0|0|0|0.25|1|1|4|0|190
MIGLGFKYLDTSYFGGFCEPSEDMNKVCTMRADCCEGIEMRFHDLKLVLEDWRNFTKLST
EEKRLWATPAAEDFF
>Min_0-t10271-RA|>Min_0-t10271-RA protein AED:0.02 eAED:0.02 QI:0|-1|0|1|-1|1|1|0|97
MDWQGQKLAEQLMQIMLVVFAVGSFITGYAIGSFQMMLIIYAAGVVLTTLVTVPNWPFFN
RHPLKWLDPIEAERHPKPQPQPQPASSKKKPTKQHQK

I want that the entire header line (starting with '>' ) will be replaced by it's first part(before '|'). example:

>Min_0-t10270-RA|>Min_0-t10270-RA protein AED:0.41 eAED:0.46 QI:0|0|0|0.25|1|1|4|0|190

will become:

>Min_0-t10270-RA

Expected output:

>Min_0-t10270-RA
MIGLGFKYLDTSYFGGFCEPSEDMNKVCTMRADCCEGIEMRFHDLKLVLEDWRNFTKLST
EEKRLWATPAAEDFF
>Min_0-t10271-RA
MDWQGQKLAEQLMQIMLVVFAVGSFITGYAIGSFQMMLIIYAAGVVLTTLVTVPNWPFFN
RHPLKWLDPIEAERHPKPQPQPQPASSKKKPTKQHQK

If possible a little comment in script will help me to understand and learn as well.
Many thanks.

sed 's/^\(>[^|]*\).*/\1/' file

This is simple substitution command. Find lines starting with ">" up to "|" and replace the whole line with that part you want...
\( \) determines whats held in \1
the first ^ anchors to beginning of line, the next ^ is part of a grouping, meaning not "|"

1 Like

Try

$ awk -F"|" '/^>/{NF=1}1' file

>Min_0-t10270-RA
MIGLGFKYLDTSYFGGFCEPSEDMNKVCTMRADCCEGIEMRFHDLKLVLEDWRNFTKLST
EEKRLWATPAAEDFF
>Min_0-t10271-RA
MDWQGQKLAEQLMQIMLVVFAVGSFITGYAIGSFQMMLIIYAAGVVLTTLVTVPNWPFFN
RHPLKWLDPIEAERHPKPQPQPQPASSKKKPTKQHQK
1 Like

You could also try the slightly simpler awk and sed commands:

awk -F'|' '{print $1}' input
        and
sed 's/|.*//' input

The awk command uses "|" as the field separator and prints the 1st field on every input line.

The sed command removes "|" and anything that follows it from every input line that contains a "|" and then prints the resulting lines (whether or not they changed).

1 Like
#!/usr/bin/env perl

open( $fh, "<", "yourfile") or die "Cannot open file: $!\n";
# go through the file line by line
while( my $line = <$fh>) {
 	chomp($line);                                 # get rid of newline
	$line =~ s/\|.*$// if $line =~ /^>/;    # remove everything after pipe if start with >
	print $line."\n";
}
1 Like

Hello All,

Following may be a solution too.

Input code:

>Min_0-t10270-RA|>Min_0-t10270-RA protein AED:0.41 eAED:0.46 QI:0|0|0|0.25|1|1|4|0|190
MIGLGFKYLDTSYFGGFCEPSEDMNKVCTMRADCCEGIEMRFHDLKLVLEDWRNFTKLST
EEKRLWATPAAEDFF
>Min_0-t10271-RA|>Min_0-t10271-RA protein AED:0.02 eAED:0.02 QI:0|-1|0|1|-1|1|1|0|97
MDWQGQKLAEQLMQIMLVVFAVGSFITGYAIGSFQMMLIIYAAGVVLTTLVTVPNWPFFN
RHPLKWLDPIEAERHPKPQPQPQPASSKKKPTKQHQK
awk '/^\>/ gsub(/\.*\|.*/,X) 1' check_data_range

Output will be as follows.

>Min_0-t10270-RA
MIGLGFKYLDTSYFGGFCEPSEDMNKVCTMRADCCEGIEMRFHDLKLVLEDWRNFTKLST
EEKRLWATPAAEDFF
>Min_0-t10271-RA
MDWQGQKLAEQLMQIMLVVFAVGSFITGYAIGSFQMMLIIYAAGVVLTTLVTVPNWPFFN
RHPLKWLDPIEAERHPKPQPQPQPASSKKKPTKQHQK

NOTE: where file name is check_data_range.

Thanks,
R. Singh