Shell scripting

The given file consists of two tags i.e., <free-energy> and <position> tag. I want the output like this. The tag <length> is not required to consider.

(General format of output)

free-energy no.

first position no � second position no
third position no � fourth position no
------
------
Free-energy no
First position no � second position no
Third position no �fourth position no.

Output for the given file is
-25.40
9 � 20
25 - 127
-24.80
2 - 13

------------------------------------------------

&lt;structure analysis-ids="UNAFold"&gt;
  &lt;!-- Start of folding 1 for sequence 1 ************************ --&gt;
  &lt;model id="_1.1"&gt;
    &lt;model-info&gt;
      &lt;free-energy&gt;-25.40&lt;/free-energy&gt;
    &lt;/model-info&gt;
          &lt;base-id&gt;
            &lt;position&gt;9&lt;/position&gt;
          &lt;/base-id&gt;
        &lt;/base-id-5p&gt;
        &lt;base-id-3p&gt;
          &lt;base-id&gt;
                       &lt;position&gt;20&lt;/position&gt;
          &lt;/base-id&gt;
        &lt;/base-id-3p&gt;
        &lt;length&gt;4&lt;/length&gt;
          &lt;/helix&gt;
                   &lt;base-id&gt;
                     &lt;position&gt;25&lt;/position&gt;
                   &lt;/base-id&gt;
                 &lt;/base-id-5p&gt;
                 
                     &lt;position&gt;127&lt;/position&gt;
                   &lt;/base-id&gt;
                 &lt;/base-id-3p&gt;
	            &lt;!-- End of folding 1 for sequence 1 ************************ --&gt;
        &lt;!-- Start of folding 2 for sequence 1 ************************ --&gt;
  &lt;model id="_1.2"&gt;
    &lt;model-info&gt;
      &lt;free-energy&gt;-24.80&lt;/free-energy&gt;
    &lt;/model-info&gt;
    &lt;str-annotation&gt;
          &lt;base-id&gt;
            &lt;position&gt;2&lt;/position&gt;
          &lt;/base-id&gt;
        &lt;/base-id-5p&gt;
            &lt;position&gt;13&lt;/position&gt;

PLS. HELP ME TO WRITE A SHELL SCRIPT FOR THIS WHICH HELPS A LOT IN BIOINFORMATICS RESEARCH. THANKS IN ADVANCE.

You can make a sort of cheap xml parser of sed:

sed -n '
  s/.*<free-energy>\([^<]*\)<.*/\1/p
  t
  /<base-id>/{
    :loop
    N
    /<\/base-id-3p>/!b loop
    s/.*<position>\([^<]*\)<.*<position>\([^<]*\)<.*/\1 - \2/p
    t
    s/^/Malformed: /
    s/\n//g
    w /dev/tty
   }
 ' in_file1 in_file2 . . . .

Narrative: sed runs in no-output-but-explicit-print mode. When it finds the free-energy tag it extracts the content (everything after the tag to the next <), makes that the entire new buffer, prints it, and then, just a speedup, branches to end of script (get next line). When it finds the base-id tag, it gets the next line and loops collecting all lines to the base-id-3p end tag in the buffer, extracts the two poistions, drops them around the ' - ' as the entire new buffer, prints it, and then branches to end of script (get next line), but if there are not two position tags, it labels it as Malformed, makes it all one line, writes it to the tty (you can use a log file in batch situations).

2 Likes
awk -F'[<>]' '/free-energy/{print $3}/position/{printf "%s"(f?RS:"-"),$3;f=!f}' infile
1 Like

Hi,

I have to insert ## at 3051th and 3085th position of a record in a file.

Sample Data:

rd    1254O3104969765S0AP     GY10KQH1413980010008    013

Let us assume that i need to insert "##" at these position : ##GY10KQH1413980010008 ##

Code I am using:

sed 's/\(.\{3051\}\)/&##/;s/\(.\{3086\}\)/&##/' newtestfile.txt > test

Output:

sed: command garbled: s/\(.\{3051\}\)/&##/;s/\(.\{3086\}\)/&##/

I guess i know the problem. the command is not accepting huge amount of data in the field ie. 3051 and 3085.
But i have no solution for this problem.
Sed command has a limitation when used on solaris machines.. where as the command works fine on red hat linux..
Please suggest me with a solution because i have to tick to Solaris OS only.

Cheers
Aviroop

@aviroops. Welcome to the forum. Please start a new thread for this.

---------- Post updated at 12:43 ---------- Previous update was at 11:05 ----------

You could try:

nawk '{$3051=$3051"##";$3086=$3086"##"}1' FS= OFS= infile

or

sed 's/\(.\{100\}\)\{30\}.\{86\}/&##/;s/\(.\{100\}\)\{30\}.\{51\}/&##/' infile

use nawk or /usr/xpg4/bin/awk and /usr/xpg4/bin/sed on Solaris

GNU sed has no field length limit, and I find it on many modern Solaris installs, but modern solaris sed goes pretty big.

Maybe stupid quuestion but how much is important to learn shell scripting ?

Shell scripting bridges and enhances the space between application programming and the UNIX OS. Many trivial one- or few-time tasks are better done in a wrapper script than in code, making the testing of the code bits much simpler and the code more robust. Also, unplanned activities are easily, predictable and safely accommodated by modified scripts and unmodified code, like catch up after an outage and similar production support tasks. Shell scripting is really the way to go for all sorts of ad-hoc reports, including data analysis to support design decisions by business case.

I write a lot of simple C to support shell with high performance features that are otherwise lacking, so the shell script can do high volumes of data by having C bits do the heavy lifting. Error handling: detecting, alerting and reporting is better done in a shell script, so:

  • it is not firing during code development,
  • can be changed easily if there is too much or too little being heard from the production run,
  • if the code core dumps, the parent script can report an error.

Many housekeeping tasks for the system or the application space are better done in scripting, like compressing, backing up and purging old files. Scripting skills are mostly common with keyboard shell skills, so there is synergy in them: most tricks yo learn interactively can support a script that runs unattended or automates that interactive task.

It is not necessary to master every fine detail of any shell (which you will forget if you do not use), but pick a good one like ksh or bash, and look for the sweet spots you can imagine using. Some are very powerful and subtle, like the ability of (..) sub-shells to either concatenate output or divide input or both without handling a single byte, by the inheritance of FDs. For instance, these two scripts to pass a header line and sort the remaining lines are equivalent, but the second is:

  • lower overhead, as tail with a pipe does not handle every byte
  • the data is written and read an extra time,
  • lower latency as the data is not all stored before being processed,
  • puts less stress on /tmp space,
  • does not leave a junk file behind if interrupted.
some_code >/tmp/xxx.tmp
head -1 /tmp/xxx.tmp
tail +2 /tmp/xxx.tmp | sort
rm -f /tmp/xxx.tmp

some_code |(
line
exec sort
)

On a UNIX system? Pretty important. That terminal you're typing commands into all the time is a shell, and it's capable of much more useful and complicated work than just running something when you hit enter. You can use real constructs, even when just typing into the console directly.

It can do a lot of work for you. Some simple shell constructs are exceedingly useful:

# do something, then only if it succeeds, do something else
install-update && /sbin/reboot
# do something, and if it fails, do something else
mangle-the-world || putback-the-world
# do a repetitive task without typing it 9 times
for N in 1 2 3 4 5 6 7 8 9
do
        something-repetitive $N
done