Get Substring from file name.

Hi,

In my shell script. I am reading all files in directory.

And say if the file names are as follows:

adio_idfl_201302_df.txt
dfa_201301_dll.ctl
dalkd_20130301.csv

I would like to extract the numeric piece of each file name and display as follows:

201302
201301
20130301

I think we should be able to do with sed command but could not completley get it.

 
for i in `ls -1`
do
j=`echo $i | sed 's/_[0-9]*/[0-9]/'`
echo $j
done

Help is appreciated to fix the above code.

Here is one way to extract the number from those files:

 for i in *; do
  j=`echo $i | sed 's/.*_\([0-9]\+\).*/\1/'`
  echo $j
done
 for i in *; do
  j=`echo $i | sed -r 's/.*_([0-9]+).*/\1/'`
  echo $j
done

Because * is 0 or more, [0-9]* was picking up the first '' instead of the one before '2'. And you have to use \1 to do the replacement.

You don't need sed to do such a simple task (provided your shell is capable of doing the following):

i='adio_idfl_201302_df.txt'

echo "${i//[!0-9]/}"
201302

Or you can try:

ls  | tr -dc '[:digit:]'
Edit:

Thanks alister

Another approach using BASH_REMATCH

for file in *
do
        [[ "$file" =~ [0-9]+ ]] && printf "%s\n" "${BASH_REMATCH[0]}"
done

From BASH manual:

BASH_REMATCH

An array variable whose members are assigned by the =~ binary operator to the [[ conditional command.  The element with index  0  is  the  portion  of  the
string  matching  the entire regular expression.  The element with index n is the portion of the string matching the nth parenthesized subexpression.  This
variable is read-only.

You probably meant ls and not ls * .

Regards,
Alister

1 Like
$ ls
adio_idfl_201302_df.txt  dalkd_20130301.csv  dfa_201301_dll.ctl
$ ls | tr -dc [:digit:]
20130220130301201301
$ ls | tr -dc '[\n[:digit:]]'
201302
20130301
201301
1 Like

If the files contain 1 extra digit (for instance a version number somewhere) then most of these methods will not render the correct results. You could select on the basis of how many digits need to be minimally present, for example 4:

IFS=_.
for file in *_*[0-9][0-9][0-9][0-9]*_*.*
do
  set -- $file
  for i do
    case $i in (*[0-9][0-9][0-9][0-9]*)
      printf "%d\n" "$i"
    esac
  done
done
IFS=$oldIFS

or

IFS=_.
for file in *_*[0-9][0-9][0-9][0-9]*_*.*
do
  for i in $file
  do
    case $i in (*[0-9][0-9][0-9][0-9]*)
      printf "%d\n" "$i"
    esac
  done
done
IFS=$oldIFS

Or:

ls | sed -n 's/.*_\([0-9]\{4,\}\)_.*\..*/\1/p'
1 Like

The above ones does not seem to be working. Please see below:

echo $i | sed 's/.*_\([0-9]\+\).*/\1/'
adio_idfl_201302_df.txt
echo $i | sed -r 's/.*_([0-9]+).*/\1/'
sed: illegal option -- r
Usage:  sed [-n] [-u] Script [File ...]
        sed [-n] [-u] [-e Script] ... [-f Script_file] ... [File ...]

Thanks for the help though.

---------- Post updated at 11:36 AM ---------- Previous update was at 11:33 AM ----------

Looks like my shell is not capable of doing this.

"${i//[!0-9]/}": bad substitution

---------- Post updated at 11:38 AM ---------- Previous update was at 11:36 AM ----------

Thanks this is working.

---------- Post updated at 11:38 AM ---------- Previous update was at 11:38 AM ----------

ls | sed -n 's/.*_\([0-9]\{4,\}\)_.*\..*/\1/p'

Thanks I will be using the above code.

The reason for the failure is that the highlighted portion of the regular expression is a GNU extension.

I see no reason to limit one's options by ever using \+ . It may be a bit longer, but \{1,\} works everywhere.

Regards,
Alister