Get Substring from file name.

pinnacle · March 16, 2013, 8:42pm

Hi,

In my shell script. I am reading all files in directory.

And say if the file names are as follows:

adio_idfl_201302_df.txt
dfa_201301_dll.ctl
dalkd_20130301.csv

I would like to extract the numeric piece of each file name and display as follows:

201302
201301
20130301

I think we should be able to do with sed command but could not completley get it.

 
for i in `ls -1`
do
j=`echo $i | sed 's/_[0-9]*/[0-9]/'`
echo $j
done

Help is appreciated to fix the above code.

hanson44 · March 16, 2013, 10:06pm

Here is one way to extract the number from those files:

 for i in *; do
  j=`echo $i | sed 's/.*_\([0-9]\+\).*/\1/'`
  echo $j
done

 for i in *; do
  j=`echo $i | sed -r 's/.*_([0-9]+).*/\1/'`
  echo $j
done

Because * is 0 or more, [0-9]* was picking up the first '' instead of the one before '2'. And you have to use \1 to do the replacement.

elixir_sinari · March 16, 2013, 10:12pm

You don't need sed to do such a simple task (provided your shell is capable of doing the following):

i='adio_idfl_201302_df.txt'

echo "${i//[!0-9]/}"
201302

jim_mcnamara · March 16, 2013, 10:12pm

Or you can try:

ls  | tr -dc '[:digit:]'

Edit:

Thanks alister

Yoda · March 16, 2013, 10:34pm

Another approach using BASH_REMATCH

for file in *
do
        [[ "$file" =~ [0-9]+ ]] && printf "%s\n" "${BASH_REMATCH[0]}"
done

From BASH manual:

BASH_REMATCH

An array variable whose members are assigned by the =~ binary operator to the [[ conditional command.  The element with index  0  is  the  portion  of  the
string  matching  the entire regular expression.  The element with index n is the portion of the string matching the nth parenthesized subexpression.  This
variable is read-only.

alister · March 16, 2013, 10:45pm

You probably meant ls and not ls * .

Regards,
Alister

anbu23 · March 17, 2013, 3:00am

$ ls
adio_idfl_201302_df.txt  dalkd_20130301.csv  dfa_201301_dll.ctl
$ ls | tr -dc [:digit:]
20130220130301201301
$ ls | tr -dc '[\n[:digit:]]'
201302
20130301
201301

Scrutinizer · March 17, 2013, 5:08am

If the files contain 1 extra digit (for instance a version number somewhere) then most of these methods will not render the correct results. You could select on the basis of how many digits need to be minimally present, for example 4:

IFS=_.
for file in *_*[0-9][0-9][0-9][0-9]*_*.*
do
  set -- $file
  for i do
    case $i in (*[0-9][0-9][0-9][0-9]*)
      printf "%d\n" "$i"
    esac
  done
done
IFS=$oldIFS

or

IFS=_.
for file in *_*[0-9][0-9][0-9][0-9]*_*.*
do
  for i in $file
  do
    case $i in (*[0-9][0-9][0-9][0-9]*)
      printf "%d\n" "$i"
    esac
  done
done
IFS=$oldIFS

Or:

ls | sed -n 's/.*_\([0-9]\{4,\}\)_.*\..*/\1/p'

pinnacle · March 17, 2013, 12:38pm

hanson44:

Here is one way to extract the number from those files:
 for i in *; do
  j=`echo $i | sed 's/.*_$[0-9]\+$.*/\1/'`
  echo $j
done
 for i in *; do
  j=`echo $i | sed -r 's/.*_([0-9]+).*/\1/'`
  echo $j
done
Because * is 0 or more, [0-9]* was picking up the first '' instead of the one before '2'. And you have to use \1 to do the replacement.

The above ones does not seem to be working. Please see below:

echo $i | sed 's/.*_\([0-9]\+\).*/\1/'
adio_idfl_201302_df.txt

echo $i | sed -r 's/.*_([0-9]+).*/\1/'
sed: illegal option -- r
Usage:  sed [-n] [-u] Script [File ...]
        sed [-n] [-u] [-e Script] ... [-f Script_file] ... [File ...]

Thanks for the help though.

---------- Post updated at 11:36 AM ---------- Previous update was at 11:33 AM ----------

Looks like my shell is not capable of doing this.

"${i//[!0-9]/}": bad substitution

---------- Post updated at 11:38 AM ---------- Previous update was at 11:36 AM ----------

Thanks this is working.

---------- Post updated at 11:38 AM ---------- Previous update was at 11:38 AM ----------

ls | sed -n 's/.*_\([0-9]\{4,\}\)_.*\..*/\1/p'

Thanks I will be using the above code.

alister · March 17, 2013, 12:44pm

The reason for the failure is that the highlighted portion of the regular expression is a GNU extension.

I see no reason to limit one's options by ever using \+ . It may be a bit longer, but \{1,\} works everywhere.

Regards,
Alister