Can this be made into one single line?

nmsinghe · September 19, 2002, 3:17pm

Can someone please suggest a script to make the following into one single (continuous) line so that a pattern search can be carried out on the resulting single line.

Note : Sample (may be shorter or longer) and will be contained in a text file

mstnpkpqrktkrntnrrpqdvkfpgggqivggvyllprrgprlgvrapr
rrqpipkarrpegrtwaqpgypwplygneglgwagwllsprgsrpswgpt
kvidtltcgfadlmgyiplvgaplggaaralahgvrvledgvnyatgnlp
llsclttpasayevhnvsgiyhvtndcsnasivyeaadlimhtpgcvpcv
altptlaarnvtiptttirrhvdllvgaaafcsamyvgdlcgsvflvsql
lqdcncsiypghvsghrmawdmmmnwspttalvvsqllripqavvdmvag
yysmagnwakvlivmllfagvdgdthvtggaqakttnrlvsmfasgpsqk
hinrtalncndslqtgflaalfythsfnssgcpermaqcrtidkfdqgwg
dqrpycwhypppqctivpasevcgpvycftpspvvvgttdrfgvptyrwg
ntrppqgnwfgctwmnstgftktcggppcniggvgnntltcptdcfrkhp
pwltprcmvdypyrlwhypctvnftifkvrmyvggvehrlnaacnwtrge
elsplllsttewqvlpcsfttlpalstglihlhqnivdvqylygigsavv
llfllladarvcaclwmmlliaqaeaalenlvvlnsasvagahgilsflv
rlvpgatyalygvwpllllllalpprayamdremaascggavfvglvllt
rliwwlqyfttraeadlhvwipplnarggrdaiillmcavhpelifditk
vlqagitrvpyfvraqglihacmlvrkvagghyvqmafmklgaltgtyiy
raglrdlavavepvvfsdmetkiitwgadtaacgdiilglpvsarrgkei
rglrllapitaysqqtrgllgciitsltgrdknqvegevqvvstatqsfl
vyhgagsktlaapkgpitqmytnvdqdlvgwpkppgarsltpctcgssdl
pvrrrgdsrgsllsprpvsylkgssggpllcpfghavgifraavctrgva
mettmrspvftdnssppavpqsfqvahlhaptgsgkstkvpaayaaqgyk
tlgfgaymskahgidpnirtgvrtittgapvtystygkfladggcsggay
tdsttilgigtvldqaetagarlvvlatatppgsvtvphpnieevalsnt
pieairggrhlifchskkkcdelaaklsglginavayyrgldvsviptig
mtgytgdfdsvidcntcvtqtvdfsldptftietttvpqdavsrsqrrgr
fvtpgerpsgmfdssvlcecydagcawyeltpaetsvrlraylntpglpv
vftglthidahflsqtkqagdnfpylvayqatvcaraqapppswdqmwkc
ptpllyrlgavqnevtlthpitkyimacmsadlevvtstwvlvggvlaal
vivgriilsgrpaivpdrellyqefdemeecashlpyieqgmqlaeqfkq
kqaeaaapvveskwraletfwakhmwnfisgiqylaglstlpgnpaiasl
lttqstllfnilggwvaaqlappsaasafvgagiagaavgsiglgkvlvd
galvafkvmsgempstedlvnllpailspgalvvgvvcaailrrhvgpge
afasrgnhvspthyvpesdaaarvtqilssltitqllkrlhqwinedcst
wdwictvltdfktwlqskllpqlpgvpffscqrgykgvwrgdgimqttcp
ngsmrivgpktcsntwhgtfpinayttgpctpspapnysralwrvaaeey
yvtgmttdnvkcpcqvpapeffsevdgvrlhryapacrpllreevtfqvg
pcepepdvavltsmltdpshitaetakrrlargsppslasssasqlsaps
spdadlieanllwrqemggnitrvesenkvvvldsfdplraeederevsv
fpaampiwarpdynpplleswkdpdyvppvvhgcplppikappippprrk
ssalaelatktfgssessavdsgtatalpdqasddgdkgsdvesyssmpp
sdgswstvseeasedvvccsmsytwtgalitpcaaeesklpinalsnsll
srsaglrqkkvtfdrlqvlddhyrdvlkemkakastvkakllsveeackl
gygakdvrnlsskavnhihsvwkdlledtvtpidttimaknevfcvqpek
fpdlgvrvcekmalydvvstlpqvvmgssygfqyspgqrveflvntwksk
rcfdstvtendirveesiyqccdlapearqaikslterlyiggpltnskg
sgvlttscgntltcylkasaacraaklqdctmlvngddlvvicesagtqe
amtrysappgdppqpeydlelitscssnvsvahdasgkrvyyltrdpttp
htpvnswlgniimyaptlwarmilmthffsillaqeqlekaldcqiygac
iierlhglsafslhsyspgeinrvasclrklgvpplrvwrhrarsvrarl
gkylfnwavktklkltpipaasrldlsgwfvagysggdiyhslsrarprw
gvgiyllpnr

s93366 · September 19, 2002, 3:37pm

cat test.txt |tr -d "\n" > out.txt

should do the trick..

test.txt is the file with the data that you need to remove linefeed from..

out.txt will contain the same file without linefeed..

hope this helps!

/Peter C

Optimus_P · September 19, 2002, 6:31pm

if the file has it in multiple lines shouldnt it stay in multiple lines. otherwise you would be tampering with the data.

if you notice the script i posted to do the search you are looking for it takes a file with multiple lines and does the search.

my perl script would have spit out the following for the data you have above.'

+=================================+
|COOL STUFF BY OPTIMUSP at UNIXCOM|
+=================================+
Line Position Found/Text
==== ======== ==========
20     8      1
              pvRRRGDSrgsllsprpvsylkgssggpllcpfghavgifraavctrgva

+-----------------------------+
49     45     1
              iierlhglsafslhsyspgeinrvasclrklgvpplrvwRHRARSvrarl

+-----------------------------+

Kelam_Magnus · September 19, 2002, 11:27pm

You're going to hate yourselves when you see my answer.

If you can vi the file just use the "join" function with the number of lines. Like this.

N J

Where N is the number of lines and "J" is the Join function in vi.

If you want join 100 lines then:

100 J

Make sure you are in command mode!!!

Enjoy!!

Perderabo · September 20, 2002, 8:40am

Like Optimus_P, I don't understand why the OP is ignoring my solution to his problem. For the record, when the above input data is run against my script, it outputs:

Line: 20 At position 1 2 unmatched characters
Line: 20 At position 3 MATCH: rrrgds
Line: 20 At position 9 43 trailing characters
pvRRRGDSrgsllsprpvsylkgssggpllcpfghavgifraavctrgva


Line: 49 At position 1 39 unmatched characters
Line: 49 At position 40 MATCH: rhrars
Line: 49 At position 46 6 trailing characters
iierlhglsafslhsyspgeinrvasclrklgvpplrvwRHRARSvrarl

I then joined all of the lines together into one superline. And I commented out the 'echo "$image"' in my script so that it won't print out the line with matches upshifted. When the superline is run against my script, it outputs:

Line: 1 At position 1 952 unmatched characters
Line: 1 At position 953 MATCH: rrrgds
Line: 1 At position 959 289 unmatched characters
Line: 1 At position 1248 MATCH: rgrfvt
Line: 1 At position 1254 1186 unmatched characters
Line: 1 At position 2440 MATCH: rhrars
Line: 1 At position 2446 65 trailing characters

So I got one more match. This explains the motivation for trying to join the lines. I think a better solution is to modify the scripts to find matches across line boundaries. Eliminating the line boundaries is hard and neither of the solutions posted worked very well.

The data file has 2510 letters. That exceeds the maximum line that vi can handle, at least on HP-UX. So vi didn't work. As for the tr solution, I tried:
tr -s "\n" < file1 > file2
which kinda worked, but it left the file with no newline characters at all. Thus the file had zero lines. I used:
echo >> file2
to correct this problem.

At this point, my script worked and spit out the above results, but at some point, ksh will balk at reading a giant line. That's why switching to an algorithm that can match across line boundaries would be the better approach.

nmsinghe · September 20, 2002, 11:13am

Yes so far all suggestions are correct but let's clarify some points.

Making the mulitple lines into a single line does NOT tamper with the data as in fact it is ONE continuous line and is presented on the Protein description web pages as multiple lines for ease of display.

The problem comes in doing matches across line boundaries if the data isn NOT presented in one single line to the ksh script.

If we have a solution to make the searches possible across line boundaries then we have a winner.

My wife and I are carrying out tests with the ksh script and we came up with protein sequences that have many lines some times as many as 50.

So either we try to solve the ksh script, or PERL or we have to try amongst our C gurus which is why I posted there as well.

Someone appeared to be annoyed about my posting under C.

To clarify we aren't doing homework and this is an essential part of an advanced breast cancer research dissertation.

Thanks to all you sincere guys for your efforts

Perderabo · September 20, 2002, 11:58am

Here is a version of my script that ignores line boundaries during match checking.

#! /usr/bin/ksh

##  r-r--s
##  r-r--t

longset="[acdefghiklmnpqrstvyz]"
pattern="r${longset}r${longset}${longset}[ts]"

pos=1
linen=0
IFS=""
while read input ; do
        ((linen=linen+1))
        preamble="Line: ${linen} At position"
        input="${save4next}${input}"
        save4next=""
        orig="${input}"
        matches=0
        while ((${#input})) ; do
                if [[ $input = *(?)${pattern}*(?) ]] ; then
                        ((matches=matches+1))
                        leftover=${input#*${pattern}}
                        temp=${input%${leftover}}
                        lead=${temp%${pattern}}
                        this=${temp#${lead}}
                        input="${leftover}"
                        ((pos=pos+${#lead}))
                        echo $preamble $pos MATCH: $this
                        ((pos=pos+${#this}))
                else
                        ((pos=pos+${#input}))
                        if (( ${#input} < 5 )) ; then
                                save4next="${input}"
                        else
                                temp="${input%?????}"
                                save4next="${input#$temp}"
                        fi
                        ((pos=pos-${#save4next}))
                        input=""
                fi
        done
done
exit 0

Note that you cannot have any leading or trailing blanks or tabs on the data lines. I send your data file through this new script and I got:

Line: 20 At position 953 MATCH: rrrgds
Line: 26 At position 1248 MATCH: rgrfvt
Line: 49 At position 2440 MATCH: rhrars