shell script for extracting out the shortest substring from the given starting and en

hi all,
i need an urgent help for writing a shell script which will extract out and print a substring which is the shortest substring from the given string where first and last character of that substring will be given by the user.
for e.g.
if str="abcdpqracdpqaserd"
now if the user gives 'a' and 'd' as the first and last character of the substringi.e. command line arguments.this should extract out acd as the shortest string.
please give simple solution to this.

str="abcpqracdpqaserd"
startch="a"
endch="d"
awk -v str=$str -v st=$startch -v end=$endch 'BEGIN{ 
s=index(str,startch)
e=index(str,end)
print substr(str,s,e)
}'

output:

# ./test.sh
abcpqracd

Another way with sed (first and last can't be special chars) :

str="abcdpqracdpqaserd"
first=a
last=d
substr=$(echo "$str"| sed -n "s/^[^$first]*\($first[^$last]*$last\).*/\1/p")
$ sh -x substr.sh
+ str=abcdpqracdpqaserd
+ first=a
+ last=d
++ echo abcdpqracdpqaserd
++ sed 's/^[^a]*\(a[^d]*d\).*/\1/p'
+ substr=abcd
+ echo abcd
abcd
$

Jean-Pierre.

Hi,

If really took my much efforts. I have tested it for many cases. And they are all ok. Hope this is right on your target.

input:

abcdpqracdpqaserd
abcdpqracdpqaserd
abcdpqracdpqaserd

output (start:a end:d):

acd
acd
acd

output (start:a end:p):

acdp
acdp
acdp

output (start:a end:r):

abcdpqr
abcdpqr
abcdpqr

code:

read a
read b
sed -e "s/$a[^$b]*$b/|&|/g" a > temp_a
sed 's/^|//' temp_a > temp_b

nawk -v st=$a -v ed=$b 'BEGIN{
FS="|"
}
{
for(i=1;i<=NF;i++)
{
	str=sprintf("b%s",$i)
	if(index(str,"a")==2)
	{
		if(tmp=="")
		{
			tmp=$i
		}
		else
		{
			if (length($i)<length(tmp))
				tmp=$i
		}
	}
}
print tmp
}
' temp_b

With GNU Awk:

awk 'NF>1&&$0=(FS $NF RT){
	if(length<min){
		min=length;rec=$0}
	}END{
print rec
}' FS="$start" RS="$end" min=9^9 filename
$ cat file
abcdpqracdpqaserd
$ start=a
$ end=d
$ awk 'NF>1&&$0=(FS $NF RT){
if(length<min){
min=length;rec=$0}
}END{
print rec
}' FS="$start" RS="$end" min=9^9 file
acd
$ start=a
$ end=p
$ awk 'NF>1&&$0=(FS $NF RT){
if(length<min){
min=length;rec=$0}
}END{
print rec
}' FS="$start" RS="$end" min=9^9 file
acdp
$ start=a
$ end=r
$ awk 'NF>1&&$0=(FS $NF RT){
if(length<min){
min=length;rec=$0}
}END{
print rec
}' FS="$start" RS="$end" min=9^9 file
aser

Hi.

I like the solution from aigles. I don't see one yet on perl.

The perl RE syntax has special features for the shortest match. Here is the entire code, along with diagnostic code, minimal argument processing, etc:

#!/usr/bin/perl

# @(#) p1       Demonstrate non-greedy matching perl RE syntax.

use warnings;
use strict;

my ($debug);
$debug = 0;
$debug = 1;

my ($lines) = 0;

my ($usage) = "usage: $0 first last\n";
my ($first) = shift || die "$usage";
my ($last)  = shift || die "$usage";

my ($string);

while (<>) {
  print " Bounds on this search: $first, $last\n" unless $lines;
  $lines++;
  chomp;
  print "\n";
  print " Initial string = \"$_\"\n";
  if (/($first.*?$last)/) {
    $string = $1;
    print " Shortest substring = \"$string\"\n";
  }
  else {
    print STDERR " No substring found, continuing.\n";
  }
}

print STDERR " ( Lines read: $lines )\n";

exit(0);

Running this on your test line and a few others in file data1:

% ./p1 a d data1
 Bounds on this search: a, d

 Initial string = "abcdpqracdpqaserd"
 Shortest substring = "abcd"

 Initial string = "abc"
 No substring found, continuing.

 Initial string = "abcdddd"
 Shortest substring = "abcd"
 ( Lines read: 3 )

The heart of the match is in these characters .*?

See the man pages for:

perlre              Perl regular expressions, the rest of the story
perlreref           Perl regular expressions quick reference

for details ... cheers, drl

Am I missing something, or the OP wanted acd (not abcd)from abcdpqracdpqaserd with a and d?

Hi, radoulov.

I assumed the OP missed something, namely the b key. If not, he can explain how that should be obtained, give another example, etc. ... cheers, drl

I assumed he wanted the shortest match.

Hi.

If it were true that we could arbitrarily omit characters, then the shortest match would always be "ad", and we wouldn't need to work so hard.

Do you see any other algorithmic way to get "acd" from "abcdpqracdpqaserd"? -- or did I miss something this time? ... cheers, drl

abcdpqracdpqaserd

acd is the shortest match of a[^d]*d

:slight_smile:

Hi.

Good eye; got it, thanks. I'll need to scan the entire string, as you did ... cheers, drl

Hi.

Modified perl code to scan entire string:

#!/usr/bin/perl

# @(#) p1       Demonstrate non-greedy matching perl RE syntax.

use warnings;
use strict;

my ($debug);
$debug = 1;
$debug = 0;

my ($lines) = 0;

my ($usage) = "usage: $0 first last\n";
my ($first) = shift || die "$usage";
my ($last)  = shift || die "$usage";

my ($string);
my ($input);
my ($winner);
my ($min) = 1.0E300;

while (<>) {
  print " Bounds on this search: $first, $last\n" unless $lines;
  $lines++;
  chomp;
  print "\n";
  print " Initial string = \"$_\"\n";
  $input = $_;
  pos $input = 0;
  $min    = 1.0E300;
  $winner = 0;

  # See Perl Best Practices, p 250 ff for details on loops like
  # this.
  while ( pos $input < length $input ) {
    if ( $input =~ m{ \G ($first.*?$last) }gcxms ) {
      $string = $1;
      print " matched string :$string:\n" if $debug;
      if ( length $string < $min ) {
        $winner = $string;
        $min    = length $winner;
      }
    }
    else {    # move pointer ahead
      $input =~ m/ \G (.) /gcxms;
    }
    print " so far, winner :$winner:, min :$min:\n" if $debug;
  }
  if ($winner) {
    print " Shortest substring = \"$winner\"\n";
  }
  else {
    print STDERR " No substring found, continuing.\n";
  }
}

print STDERR " ( Lines read: $lines )\n";

exit(0);

Prodcuing:

% ./p1 a d data1
 Bounds on this search: a, d

 Initial string = "abcdpqracdpqaserd"
 Shortest substring = "acd"

 Initial string = "abc"
 No substring found, continuing.

 Initial string = "ad"
 Shortest substring = "ad"

 Initial string = "abcdabcadabcefdadabcd"
 Shortest substring = "ad"

 Initial string = "abc--------------------de"
 Shortest substring = "abc--------------------d"

 Initial string = "abc0123456789defgh   d"
 Shortest substring = "abc0123456789d"
 ( Lines read: 6 )

A tip of the hat to radoulov for noting the discrepancy ... cheers, drl

An (not only GNU) Awk solution:

awk -v s="abcdpqracdpqaserd" -v start="a" -v end="d" 'BEGIN{
	re=start"[^"end"]*"end
	min=(length(s)+0)
		while (match(s,re)){
			all[length(substr(s,RSTART,RLENGTH))]=substr(s,RSTART,RLENGTH)
			s=substr(s,++i)
			}
	for(p in all)
		if((p+0)<min)
			{min=p;shortest=all[p]}
print shortest
}'

Another awk solution :wink: :

awk -v string="abcdpqracdpqaserd" \
    -v start="a"                  \
    -v end="d"                    \
    '
    BEGIN{
       regex = start "[^" end "]*" end;
       min_length = length(string) + 1;
       while (match(string,regex)) {
          if (RLENGTH < min_length) {
             min_length = RLENGTH;
             shortest   = substr(string, RSTART, RLENGTH);
          }
          string = substr(string, RSTART+1);
       }
       print shortest;
    }
    '

Jean-Pierre.

... of course :), the array was completely unnecessary,
thanks, Jean-Pierre, for pointing it out.

hi,
thank you for the code..
but this do not give correct result i.e u have not properly getting my question.
e.g.for input string as "abcdpqracdpqaserd",and first character is "a" and last char is "r",then shortest substring is "aser" and not "abcdpqr"so help me to generate such code..

Hello all, Please I need your help to stream out a particular FS (-)in this situation. There are several (-) in the output result I do get as you can notice in the excerpt below.

|0-0|filter -pstf |6464| 332558| 0| 41|Feb 20 01:56| 0| 0| 0|

|0-0|regular-styl131 |2794| 330622| 0| 41|Feb 20 01:56| 0| 0| 257|

|0-0|msnwr-1-0-styl131 | 38| 333886| 0| 41|Feb 20 01:56| 0| 0| 0|

|0-0|past-1-0-lgl131 | 150| 324889| 0| 101|Feb 20 01:56| 0| 35| 0|

|0-0|tns-sty131 |1315| 333811| 0| 11|Feb 20 01:56| 0| 0| 0|

--TOTAL--------------------------------------79-------------------896--10704686-620-

--TOTAL--------------------------------------795-------------------1596--10704686-5678-

The issue is I need to filter only the last values in the line TOTAL ( the coloured values). The number of characters are not predictable as the values are constantly changing.

How can I write a shell command that can print out only these values.

Thanks

Olusola

awk '/^--TOTAL/{print $(NF-1)}' FS="-" file