shell script for extracting out the shortest substring from the given starting and en

pankajd · October 19, 2007, 2:35am

hi all,
i need an urgent help for writing a shell script which will extract out and print a substring which is the shortest substring from the given string where first and last character of that substring will be given by the user.
for e.g.
if str="abcdpqracdpqaserd"
now if the user gives 'a' and 'd' as the first and last character of the substringi.e. command line arguments.this should extract out acd as the shortest string.
please give simple solution to this.

ghostdog74 · October 19, 2007, 3:22am

str="abcpqracdpqaserd"
startch="a"
endch="d"
awk -v str=$str -v st=$startch -v end=$endch 'BEGIN{ 
s=index(str,startch)
e=index(str,end)
print substr(str,s,e)
}'

output:

# ./test.sh
abcpqracd

aigles · October 19, 2007, 4:05am

Another way with sed (first and last can't be special chars) :

str="abcdpqracdpqaserd"
first=a
last=d
substr=$(echo "$str"| sed -n "s/^[^$first]*\($first[^$last]*$last\).*/\1/p")

$ sh -x substr.sh
+ str=abcdpqracdpqaserd
+ first=a
+ last=d
++ echo abcdpqracdpqaserd
++ sed 's/^[^a]*\(a[^d]*d\).*/\1/p'
+ substr=abcd
+ echo abcd
abcd
$

Jean-Pierre.

summer_cherry · October 19, 2007, 5:35am

Hi,

If really took my much efforts. I have tested it for many cases. And they are all ok. Hope this is right on your target.

input:

abcdpqracdpqaserd
abcdpqracdpqaserd
abcdpqracdpqaserd

output (start:a end:d):

acd
acd
acd

output (start:a end:p):

acdp
acdp
acdp

output (start:a end:r):

abcdpqr
abcdpqr
abcdpqr

code:

read a
read b
sed -e "s/$a[^$b]*$b/|&|/g" a > temp_a
sed 's/^|//' temp_a > temp_b

nawk -v st=$a -v ed=$b 'BEGIN{
FS="|"
}
{
for(i=1;i<=NF;i++)
{
	str=sprintf("b%s",$i)
	if(index(str,"a")==2)
	{
		if(tmp=="")
		{
			tmp=$i
		}
		else
		{
			if (length($i)<length(tmp))
				tmp=$i
		}
	}
}
print tmp
}
' temp_b

radoulov · October 19, 2007, 11:37am

With GNU Awk:

awk 'NF>1&&$0=(FS $NF RT){
	if(length<min){
		min=length;rec=$0}
	}END{
print rec
}' FS="$start" RS="$end" min=9^9 filename

$ cat file
abcdpqracdpqaserd
$ start=a
$ end=d
$ awk 'NF>1&&$0=(FS $NF RT){
if(length<min){
min=length;rec=$0}
}END{
print rec
}' FS="$start" RS="$end" min=9^9 file
acd
$ start=a
$ end=p
$ awk 'NF>1&&$0=(FS $NF RT){
if(length<min){
min=length;rec=$0}
}END{
print rec
}' FS="$start" RS="$end" min=9^9 file
acdp
$ start=a
$ end=r
$ awk 'NF>1&&$0=(FS $NF RT){
if(length<min){
min=length;rec=$0}
}END{
print rec
}' FS="$start" RS="$end" min=9^9 file
aser

drl · October 19, 2007, 12:59pm

Hi.

I like the solution from aigles. I don't see one yet on perl.

The perl RE syntax has special features for the shortest match. Here is the entire code, along with diagnostic code, minimal argument processing, etc:

#!/usr/bin/perl

# @(#) p1       Demonstrate non-greedy matching perl RE syntax.

use warnings;
use strict;

my ($debug);
$debug = 0;
$debug = 1;

my ($lines) = 0;

my ($usage) = "usage: $0 first last\n";
my ($first) = shift || die "$usage";
my ($last)  = shift || die "$usage";

my ($string);

while (<>) {
  print " Bounds on this search: $first, $last\n" unless $lines;
  $lines++;
  chomp;
  print "\n";
  print " Initial string = \"$_\"\n";
  if (/($first.*?$last)/) {
    $string = $1;
    print " Shortest substring = \"$string\"\n";
  }
  else {
    print STDERR " No substring found, continuing.\n";
  }
}

print STDERR " ( Lines read: $lines )\n";

exit(0);

Running this on your test line and a few others in file data1:

% ./p1 a d data1
 Bounds on this search: a, d

 Initial string = "abcdpqracdpqaserd"
 Shortest substring = "abcd"

 Initial string = "abc"
 No substring found, continuing.

 Initial string = "abcdddd"
 Shortest substring = "abcd"
 ( Lines read: 3 )

The heart of the match is in these characters .*?

See the man pages for:

perlre              Perl regular expressions, the rest of the story
perlreref           Perl regular expressions quick reference

for details ... cheers, drl

radoulov · October 19, 2007, 4:33pm

Am I missing something, or the OP wanted acd (not abcd)from abcdpqracdpqaserd with a and d?

drl · October 19, 2007, 5:13pm

Hi, radoulov.

I assumed the OP missed something, namely the b key. If not, he can explain how that should be obtained, give another example, etc. ... cheers, drl

radoulov · October 19, 2007, 5:28pm

I assumed he wanted the shortest match.

drl · October 19, 2007, 5:35pm

Hi.

If it were true that we could arbitrarily omit characters, then the shortest match would always be "ad", and we wouldn't need to work so hard.

Do you see any other algorithmic way to get "acd" from "abcdpqracdpqaserd"? -- or did I miss something this time? ... cheers, drl

radoulov · October 19, 2007, 5:39pm

abcdpqracdpqaserd

acd is the shortest match of a[^d]*d

drl · October 19, 2007, 5:42pm

Hi.

Good eye; got it, thanks. I'll need to scan the entire string, as you did ... cheers, drl

drl · October 19, 2007, 6:37pm

Hi.

Modified perl code to scan entire string:

#!/usr/bin/perl

# @(#) p1       Demonstrate non-greedy matching perl RE syntax.

use warnings;
use strict;

my ($debug);
$debug = 1;
$debug = 0;

my ($lines) = 0;

my ($usage) = "usage: $0 first last\n";
my ($first) = shift || die "$usage";
my ($last)  = shift || die "$usage";

my ($string);
my ($input);
my ($winner);
my ($min) = 1.0E300;

while (<>) {
  print " Bounds on this search: $first, $last\n" unless $lines;
  $lines++;
  chomp;
  print "\n";
  print " Initial string = \"$_\"\n";
  $input = $_;
  pos $input = 0;
  $min    = 1.0E300;
  $winner = 0;

  # See Perl Best Practices, p 250 ff for details on loops like
  # this.
  while ( pos $input < length $input ) {
    if ( $input =~ m{ \G ($first.*?$last) }gcxms ) {
      $string = $1;
      print " matched string :$string:\n" if $debug;
      if ( length $string < $min ) {
        $winner = $string;
        $min    = length $winner;
      }
    }
    else {    # move pointer ahead
      $input =~ m/ \G (.) /gcxms;
    }
    print " so far, winner :$winner:, min :$min:\n" if $debug;
  }
  if ($winner) {
    print " Shortest substring = \"$winner\"\n";
  }
  else {
    print STDERR " No substring found, continuing.\n";
  }
}

print STDERR " ( Lines read: $lines )\n";

exit(0);

Prodcuing:

% ./p1 a d data1
 Bounds on this search: a, d

 Initial string = "abcdpqracdpqaserd"
 Shortest substring = "acd"

 Initial string = "abc"
 No substring found, continuing.

 Initial string = "ad"
 Shortest substring = "ad"

 Initial string = "abcdabcadabcefdadabcd"
 Shortest substring = "ad"

 Initial string = "abc--------------------de"
 Shortest substring = "abc--------------------d"

 Initial string = "abc0123456789defgh   d"
 Shortest substring = "abc0123456789d"
 ( Lines read: 6 )

A tip of the hat to radoulov for noting the discrepancy ... cheers, drl

radoulov · October 20, 2007, 12:09pm

An (not only GNU) Awk solution:

awk -v s="abcdpqracdpqaserd" -v start="a" -v end="d" 'BEGIN{
	re=start"[^"end"]*"end
	min=(length(s)+0)
		while (match(s,re)){
			all[length(substr(s,RSTART,RLENGTH))]=substr(s,RSTART,RLENGTH)
			s=substr(s,++i)
			}
	for(p in all)
		if((p+0)<min)
			{min=p;shortest=all[p]}
print shortest
}'

aigles · October 20, 2007, 1:02pm

Another awk solution :

awk -v string="abcdpqracdpqaserd" \
    -v start="a"                  \
    -v end="d"                    \
    '
    BEGIN{
       regex = start "[^" end "]*" end;
       min_length = length(string) + 1;
       while (match(string,regex)) {
          if (RLENGTH < min_length) {
             min_length = RLENGTH;
             shortest   = substr(string, RSTART, RLENGTH);
          }
          string = substr(string, RSTART+1);
       }
       print shortest;
    }
    '

Jean-Pierre.

radoulov · October 20, 2007, 1:21pm

aigles:

Another awk solution :

awk -v string="abcdpqracdpqaserd" \
   -v start="a"                  \
   -v end="d"                    \
   '
   BEGIN{
   regex = start "[^" end "]*" end;
   min_length = length(string) + 1;
   while (match(string,regex)) {
   if (RLENGTH < min_length) {
   min_length = RLENGTH;
   shortest   = substr(string, RSTART, RLENGTH);
   }
   string = substr(string, RSTART+1);
   }
   print shortest;
   }
   '

Jean-Pierre.

... of course :), the array was completely unnecessary,
thanks, Jean-Pierre, for pointing it out.

pankajd · November 22, 2007, 12:55am

summer_cherry:

Hi,

If really took my much efforts. I have tested it for many cases. And they are all ok. Hope this is right on your target.

input:
abcdpqracdpqaserd
abcdpqracdpqaserd
abcdpqracdpqaserd
output (start:a end:d):
acd
acd
acd
output (start:a end:p):
acdp
acdp
acdp
output (start:a end:r):
abcdpqr
abcdpqr
abcdpqr
code:
read a
read b
sed -e "s/$a[^$b]*$b/|&|/g" a > temp_a
sed 's/^|//' temp_a > temp_b

nawk -v st=$a -v ed=$b 'BEGIN{
FS="|"
}
{
for(i=1;i<=NF;i++)
{
	str=sprintf("b%s",$i)
	if(index(str,"a")==2)
	{
		if(tmp=="")
		{
			tmp=$i
		}
		else
		{
			if (length($i)<length(tmp))
				tmp=$i
		}
	}
}
print tmp
}
' temp_b

hi,
thank you for the code..
but this do not give correct result i.e u have not properly getting my question.
e.g.for input string as "abcdpqracdpqaserd",and first character is "a" and last char is "r",then shortest substring is "aser" and not "abcdpqr"so help me to generate such code..

solaodeji · March 10, 2008, 5:23am

Hello all, Please I need your help to stream out a particular FS (-)in this situation. There are several (-) in the output result I do get as you can notice in the excerpt below.

|0-0|filter -pstf |6464| 332558| 0| 41|Feb 20 01:56| 0| 0| 0|

|0-0|regular-styl131 |2794| 330622| 0| 41|Feb 20 01:56| 0| 0| 257|

|0-0|msnwr-1-0-styl131 | 38| 333886| 0| 41|Feb 20 01:56| 0| 0| 0|

|0-0|past-1-0-lgl131 | 150| 324889| 0| 101|Feb 20 01:56| 0| 35| 0|

|0-0|tns-sty131 |1315| 333811| 0| 11|Feb 20 01:56| 0| 0| 0|

--TOTAL--------------------------------------79-------------------896--10704686-620-

--TOTAL--------------------------------------795-------------------1596--10704686-5678-

The issue is I need to filter only the last values in the line TOTAL ( the coloured values). The number of characters are not predictable as the values are constantly changing.

How can I write a shell command that can print out only these values.

Thanks

Olusola

radoulov · March 10, 2008, 6:20am

awk '/^--TOTAL/{print $(NF-1)}' FS="-" file