FILE_ID extraction from file name and save it in CSV file after looping through each folders

FILE_ID extraction from file name and save it in CSV file after looping through each folders

My files are located in UNIX Server, i want to extract file_id and file_name from each file .and save it in a CSV file. How do I do that?
I have folders in unix environment, directory structure is structured as follows
year folder -> inside 12 months folders -> inside 30/31 days folders

I ran ls command folder
year as follows
2009 2010 2011 2012
I ran cd command for year 2012

$ cd 2012 

I ran ls command for 2012 year folder

$ ls 
01 02 03 04 05 06 07 08 09 

then I ran command for september

$ cd 09 
$ ls 
01 02 03 04 05 06 07 08 09 10 11 12 13 
$ cd 13 
$ ls 
sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz 
sasmm_fsbc_durds_id00020513_t20120913003312.dat.trnsfr.gz 

there are folders for each year like 2009,2010,2011 and 2012
and folder has 12 folders for each months like 01,02,03,04,05,06,07,08,09,10,11,12
and each month folder has 31 folders for days like 1,2,3, etc... 29,30,31

inside each day folder has files..
the file name is as follows,
sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz
sasmm_fsbc_durds_id00020513_t20120913003312.dat.trnsfr.gz
I want to have one csv file and that file needs to have two columns , one is for file_id and
second field is for file name.
to obtain file_id value ,loop through each folders and get file name, then read file name and
get substring between "sasmm_fsbc_durds_id000" and _t and store it in file_id column and store
file name in file_name column.

in above example for file sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz
read file name sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz
cut 20532 and save it in a file_id clumn and the whole file name in second column = sasmm_fsbc_durds_id00020532_t20100313192606.dat

CSV file will look like

file_id file_name 
20532 sasmm_fsbc_durds_id00020532_t20100313192606.dat 
20513 sasmm_fsbc_durds_id00020513_t20120913003312.dat 

file_id is to be cut from the file name , if you look at the file name closely, you can see;
after 000 , file_ids in above file name examples , they are 20532 and 20513.

How do I loop through year 2012 and 12 months folders and 31 days folders inside it and create
csv file which has data as shown above?
I am very new unix, please help me out.. If you provide a code , that would be great..
thanks..

output CSV file look like this

file_id file_name 
20532 sasmm_fsbc_durds_id00020532_t20100313192606.dat 
20513 sasmm_fsbc_durds_id00020513_t20120913003312.dat

do we need to search files recursively for finding file in each folder or to go dwon to day folder?

First you say that filenames in the directory 2012/09/13 are:

sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz 
sasmm_fsbc_durds_id00020513_t20120913003312.dat.trnsfr.gz

and then you say you want the entire filename to be the second field in your output file and say that that field should be:

sasmm_fsbc_durds_id00020532_t20100313192606.dat
sasmm_fsbc_durds_id00020513_t20120913003312.dat

What happened to the .trnsfr.gz at the end of the filenames?

Is the file_id field always supposed to be a string a decimal digits or could other characters appear in the file_id?

Is there any chance that there will be more than one occurrence of _t in a filename after sasmm_fsbc_durds_id000 ?

Should an error be reported if other files exist under 2???/[01][0-9]/[0-3][0-9] with filenames that that don't start with sasmm_fsbc_durds_id000 and contain _t after that?

A quick Ksh script that assumes the current directory contains the year directories:

#!/usr/bin/env ksh
find 20[0-1][0-9] -type f | while read path
do
    name=${path##*/}
    name=${name%.trns*}
    id=${name%_*}
    id=${id##*_}
    id=${id:2}
    echo  ${id/~(+E)^[0]+/} $name
done >output-file

Requires Kshell, and there are probably more efficient ways to do this.

Code:
sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz sasmm_fsbc_durds_id00020513_t20120913003312.dat.trnsfr.gz
and then you say you want the entire filename to be the second field in your output file and say that that field should be:

Code:
sasmm_fsbc_durds_id00020532_t20100313192606.dat sasmm_fsbc_durds_id00020513_t20120913003312.dat
What happened to the .trnsfr.gz at the end of the filenames?

yeah in each folder , the file name ends with
.dat.trnsfr.gz
but when we enter into CSV file UNDER file_name column , it should omit
.trnsfr.gz
for file_id

it is number, it should be extracted from file name itself

in your code , you have not specified output file as CSV,
are you looping through all files inside all folders in a year?
which code is used for extracting id from file id?

how you specify the coulmn names in out put file?

do you know write same logic in simple Shell, Shell Scripts?

---------- Post updated at 10:18 PM ---------- Previous update was at 09:59 PM ----------

if i use this loop, will it loop through all folders?

FILES=`ls -1`
for FILE in $FILES
do

---------- Post updated at 10:27 PM ---------- Previous update was at 10:18 PM ----------

I ran your script, it says error message

[/work/users/po/prince]$ ./testSBI.sh
./testSBI.sh[8]: id=${id:2}: bad substitution

your code

---------- Post updated at 10:39 PM ---------- Previous update was at 10:27 PM ----------

i removed line of code which causes the error
i executed your script without that, it again throw an error

./testSBI.sh[10]: ${id/~(+E)[2]+/}: bad substitution


  1. 0 ↩︎

  2. 0 ↩︎

You can set the output file name however you want. Replace output-file with CSV, or what ever you want the output filename to be. The find command will list all files under all directories that are of the form 2000 - 2099, so yes, in a way we are looping through all files, but letting find do the work rather than the script.

The code that extracts the ID from the name is:

id=${name%_*}    # delete from last underbar to the end, and assign to variable id
id=${id##*_}    # delete from front of the string to the last underbar and reassign to id
id=${id:2}   # extract the number (portion of string starting at character 2)

The leading zeros are removed as the variable is expanded in the echo:

${id/~(+E)^[0]+/}

You made no mention of column names, only that the ID was to be first and the filename was to be second. The code prints ID followed by filename. Per your example there is no comma; I was a bit confused with your initial post as you indicated that the file was comma separated values (csv) yet you didn't indicate that the columns should be separated that way.

The code I posted is a simple shell script.

Yes, but it's bad form if you ask me. Something like this would be better:

ls | while read file
do
   echo $file
done

Were you using ksh (Korn Shell)? Bash cannot handle the last substitution which eliminates the leading zeros from the ID. If you cannot use ksh, then you'll need to change the echo and delete the zeros with sed or some other mechanism.

The following seems to do what you requested. You say that you want to create a CSV file, but by definition a CSV file has fields that are separated by commas. You don't show any commas in any of your sample output. This script uses a tab to separate output fields to get the headers to line up with the following data. Although it is written using ksh, it should also work with at least bash and sh:

#!/bin/ksh
printf "file_id\tfile_name\n"
find 2[0-9][0-9][0-9] -name 'sasmm_fsbc_durds_id000[0-9]*_t?*' | while read path
do
        file=$(basename "$path" .trnsfr.gz)
        id=${file#sasmm_fsbc_durds_id000}
        id=${id%%_t*}
        printf "%s\t%s\n" "$id" "$file"
done

Note that this will ignore any files found in and under the year directories that don't match your filename specifications.

To run it, save the above code in a file (e.g., extract) in the same directory where the year directories reside, make it executable by issuing the command:

chmod +x extract

and then issue the command:

./extract > output_file

If you leave off > output_file , the output will be written to your terminal. If you want to save the output in a file with a name other than output_file, replace it with any name you want.

script

#!/usr/bin/env ksh
OUTFILE=test.txt
find 20[0-1][0-9] -type f | while read path
 do
   name=${path##*/}
   name=${name%.trns*}   
   id=${name%_*}
   id=${id##*_}
   id=${id##*000}
   echo "id: $id"
   echo "file name: $name"
  done  > ${OUTFILE}
exit

MY SCRIPT RESULT

id: 20532
file name: sasmm_fsbc_durds_id00020532_t20120112192606.dat
id: 20533
file name: sasmm_fsbc_durds_id00020533_t20120212192606.dat
id: 20534
file name: sasmm_fsbc_durds_id00020534_t20120312192606.dat

If you really want a CSV, why don't any of your postings show desired output containing a comma? Anyway, the following minor change to my earlier posted script should meet your currently stated requirements:

!/bin/ksh
printf "file_id,file_name\n"
find 2[0-9][0-9][0-9] -name 'sasmm_fsbc_durds_id000[0-9]*_t?*.dat.trnsfr.gz' | while read path
do
        file=$(basename "$path" .trnsfr.gz)
        id=${file#sasmm_fsbc_durds_id000}
        id=${id%%_t*}
        printf "%s,%s\n" "$id" "$file"
done

  1. 0 ↩︎

  2. 0 ↩︎

I neeed one more solution for replacing a few characters in a file name. As I mentioned above, my folder structure and files are constructed as given in my previous posts.

 
folder years -> folder months -> folder days

In each day folder , there will files, and their file names are as mentioned above posts

sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz 

Now I have to loop through each folder sequentially like year 2009,2010,2011 etc..and go in days folder and modify file names. from March 2011 to 26 jan 2012

 
sasmm_fsbc_durds_id00020079_t20110301010023.dat.trnsfr.gz 

this is the first file is placed in the folder 2011-> 03-> 01

from this file onwards modify file name from

 
sasmm_fsbc_durds_id00020079_t20110301010023.dat.trnsfr.gz
 
sasmm_fsbc_durds_id0007111_t20110301010023.dat.trnsfr.gz 

SECOND FILE file name modification

 
from 
sasmm_fsbc_durds_id00020080_t20110301020123.dat
to
sasmm_fsbc_durds_id0007112_t20110301020123.dat

Out here ; assigning 7111 to very first file and then incrementing by one to next files

It means ; I am looping through each file and replacing the character between id000 and _t to 7111 and incrementing by 1 sequentially.

 
sasmm_fsbc_durds_id00020318_t20110311022510.dat
sasmm_fsbc_durds_id00020319_t20110311032555.dat
sasmm_fsbc_durds_id00020320_t20110311042632.dat
sasmm_fsbc_durds_id00020321_t20110311052657.dat
sasmm_fsbc_durds_id00020322_t20110311062730.dat
 
will be modified into 
 
sasmm_fsbc_durds_id0007111_t20110311022510.dat
sasmm_fsbc_durds_id0007112_t20110311032555.dat
sasmm_fsbc_durds_id0007113_t20110311042632.dat
sasmm_fsbc_durds_id0007114_t20110311052657.dat
sasmm_fsbc_durds_id0007115_t20110311062730.dat

it goes on till it reaches the year 2012, month january[01] and date 26

 
from 2011 -> 03 -> 01
to   2012 -> 03 - >26

thanks in advance

Before we start on a new request, please tell us if any of our suggestions did what you wanted before this last set of changes so we have some idea as to whether or not we have finally correctly interpreted what you're asking us to do.

Then please clarify your requirements:

  1. Do you want only the files with names matching sasmm_fsbc_durds_id000[0-9]*_t?*.dat.trnsfr.gz to be renamed, or do you want every file matching id000[0-9]*_t to be renamed?
  2. Do you want the file names to restart at 7111 in each directory processed, or do you want files in all of the directories processed to be treated as a single list numbered starting at 7111 and incrementing for each file processed?
  3. Do you have a backup of this directory hierarchy in case something goes horribly wrong during the renaming process?
  4. Are you absolutely sure that no file existing before this renaming process begins has a name that will match any new file name that will be created by this renaming process?
  5. When you say you want these changes applied to files for dates:
    text from 2011 -> 03 -> 01 to 2012 -> 03 - >26

    is that range inclusive or exclusive? (I'm assuming you want files in 2011/03/01 renamed, but it isn't at all clear whether you want files in 2012/03/26 renamed.)
  1. Do you want only the files with names matching sasmm_fsbc_durds_id000[0-9]*_t?.dat.trnsfr.gz to be renamed, or do you want every file matching id000[0-9]_t to be renamed?
  1. Do you want the file names to restart at 7111 in each directory processed, or do you want files in all of the directories processed to be treated as a single list numbered starting at 7111 and incrementing for each file processed?
  1. Do you have a backup of this directory hierarchy in case something goes horribly wrong during the renaming process?
  1. Are you absolutely sure that no file existing before this renaming process begins has a name that will match any new file name that will be created by this renaming process?
  1. When you say you want these changes applied to files for dates:
    Code:
    ---------
    from 2011 -> 03 -> 01
    to 2012 -> 03 - >26
    ---------
    is that range inclusive or exclusive? (I'm assuming you want files in 2011/03/01 renamed, but it isn't at all clear whether you want files in 2012/03/26 renamed.)

---------- Post updated at 08:24 AM ---------- Previous update was at 08:20 AM ----------

sasmm_fsbc_durds_id00020079_t20110301010023.dat.trnsfr.gz

you have mentoned id00020079 to id000[0-9]* ; does this [0-9] consider all digital numbers starting from 7111.

The following script creates a file containing the mv commands needed to rename the files as you requested, and then runs those commands, and removes that file. Before running this script, I strongly suggest commenting out the last two lines, run the modified script and verify that the command file created performs the file moves that you want to perform. This script is written using ksh, but it should also work with at least bash and sh.

#!/bin/ksh
newID=7111
find 2011/0[3-9] 2011/1[0-2] 2012/0[1-2] 2012/03/[01][0-9] 2012/03/2[0-5] \
    -name 'sasmm_fsbc_durds_id000[0-9]*_t?*.dat.trnsfr.gz' | while read path
do
        oldID=${path##*id000}
        oldID=${oldID%_t*}
        newpath=${path%${oldID}_t*}$newID${path##*id000$oldID}
        newID=$((newID + 1))
        printf "mv \"%s\" \"%s\"\n" "$path" "$newpath"
done > mv_commands.$$
. mv_commands.$$
rm mv_commands.$$

---------- Post updated at 10:01 AM ---------- Previous update was at 09:44 AM ----------

I forgot to mention this in my last posting. Instead of the command:

printf "mv \"%s\" \"%s\"\n" "$path" "$newpath"

in the script in my last posting, I could have just used:

mv "$path" "$newpath"

but if there are enough files in one of the directories being processed it would be possible to end up unintentionally renaming one or more of the renamed files (possibly even creating an infinite loop of mv commands). This isn't likely since we're renaming files rather than creating additional files, but the standards don't guarantee that a file will be found at all nor that a file will only be found once if a directory is being changed while the find utility is processing that directory. Using the two step process given in my script avoids this possible complication.

in my first question, file_id extraction from file_name, if i need to extract file_ids in a range like January 26 2012 to today? How do I specify it?

---------- Post updated at 04:44 PM ---------- Previous update was at 04:40 PM ----------

 
!/bin/ksh
printf "file_id,file_name\n"
find 2[0-9][0-9][0-9] -name 'sasmm_fsbc_durds_id000[0-9]*_t?*.dat.trnsfr.gz' | while read path
do
        file=$(basename "$path" .trnsfr.gz)
        id=${file#sasmm_fsbc_durds_id000}
        id=${id%%_t*}
        printf "%s,%s\n" "$id" "$file"
done

I tested this solution , its working absolutely fine for file_id extraction..Thanks a lot Don..! In case , I want to extract file_id and file_name combination in a CSV file for a given date range ; for example Jan 26 2012 to today? Where do I need to make change and what would be the change?

Thanks..!

The current script selects all directories for years 2000 through 2999 (this comes from the 2[0-9][0-9][0-9] to select those directories in the find command). So for January 26-31, 2012 you need 2012/01/2[6-9] and 2012/01/3[01] and for February 1, 2012 through today you can use 2012/0[2-9]. So replacing:

find 2[0-9][0-9][0-9] -name 'sasmm_fsbc_durds_id000[0-9]*_t?*.dat.trnsfr.gz' | while read path

in my script with:

find 2012/01/2[6-9] 2012/01/3[01] 2012/0[2-9] -name 'sasmm_fsbc_durds_id000[0-9]*_t?*.dat.trnsfr.gz' | while read path

will give you that restricted range.

In a few folders, the files are saved with file names as given below [ends with .dat] , in this case I do not need to cut off .trnsfr.gz and also I want to extract FILE_IDs from these files along with other files whose extension end with .dat.trnsfr.gz

There is a possibility of finding files that ends with following extentions, I want to extract file_id and file_name as you provided in your previous post.

  1. .dat.trnsfr.gz
  2. .dat
  3. .aud.trnsfr
  4. .aud.trnsfr.gz

I do not want to consider files that ends with [FONT=r_ansi][SIZE=2].aud.trnsfr.gz and aud.trnsfr, but I want to extract file_id and file_names from the files whose file names end with .dat.trnsfr.gz and .dat

In case of files end with .dat, i do not want to cut of its tail end since it has the file name that I required. But in case files that end with .dat.trnsfr.gz , i need to cut off .trnsfr.gz

The format will be as given in the above post.
[/SIZE][/FONT]
in 2012->04 ->08 folder contains files like this

 
sasmm_fsbc_durds_id00016763_t20120408230850.aud.trnsfr.gz
sasmm_fsbc_durds_id00016763_t20120408230850.dat.trnsfr.gz

but in 2012->04->01 folder contains files like this

 
sasmm_fsbc_durds_id00016596_t20120401231148.aud.trnsfr
sasmm_fsbc_durds_id00016596_t20120401231148.dat

sasmm_fsbc_durds_id00016573_t20120401000754.dat

can you please help me out? can you give end date as 20 sept 2012 and restrics data within the range of Jan 26 2012 to Sept 20 , 2012 as well?

Thanks...

So, if you change the same line I told you to change last time from:

find 2012/01/2[6-9] 2012/01/3[01] 2012/0[2-9] -name 'sasmm_fsbc_durds_id000[0-9]*_t?*.dat.trnsfr.gz' | while read path

to:

find 2012/01/2[6-9] 2012/01/3[01] 2012/0[2-8] 2012/09/[01][0-9] 2012/09/20 -name 'sasmm_fsbc_durds_id000[0-9]*_t?*.dat*' | while read path

it should do what you want. I would have expected that you'd be able to make a simple change like this yourself given the suggestions you've seen in past posts on this thread.