File Move & Sort by Name - Kick out Bad File Names & More

I have a dilemma, we have users who are copying files to "directory 1." These images have file names which include the year it was taken. I need to put together a script to do the following:

  1. Examine the file naming convention, ensuring it's the proper format (e.g. test-1983_filename-123.tif) Ensuring the naming convention is in the correct format, doesn't include incorrect characters, filename out of order, etc. (this is in case a user adds a file/image with the incorrect naming convention)
  2. Rejects any images that are incorrectly named.
  3. Determine what year is in the filename (e.g. 1983)
  4. Search a different filesystem/location to see if that current directory exist (e.g. 1983 - year directory)
    [list=a]
  5. If so, copying file to that location.
  6. If not, create directory (directory named year, e.g. 1983), then copy file to that location.
    [/list]

"Directory 1" will potentially have 1000's of files to process and sort.

I currently have a bash script that determines whether files exist within "directory 1," places them into a directory by today's date, then moves them to a different filesystem/location. I need the script to vet each filename, and move it to the respective directory named by the year. I'm most uncertain about how to vet the filenames. Any assistance is greatly appreciated!

More specifically, to allow us to help you to create regular expression syntax and a script:

1 Like

Absolutely! Thanks for your response.

  1. [list=a]
  2. Years that would be valid = 1900-2000's and going forward.
  3. All files will have .tif file extension. Not valid, otherwise.
    [/list]
  4. Creation date doesn't matter. The date within the file name (e.g. name-2001_filename-123.tif) will be named by a human. This is where the script must check and confirm the naming convention formatting, is correct.

    Essentially, it will contain a year and some sort of file naming convention indicative to the image. Here's the format:

    locationabbreviation-year-orgnumber_imagenumber.tif

    examples:
    bldg1-1996-org101_F-12345.tif
    bldgzone-2001-org234_ZTW-12345.tif
    .
  5. Not delete bad file name, but move to a "bad file" directory or so. If the file doesn't follow this formatting, its kicked out to the "bad file" directory.

How about this as a starting point (needs a recent shell, e.g. bash ):

ls *.tif | while read FN; do [[ "$FN" =~ [[:alnum:]]+-[[:digit:]]{4}-[[:alnum:]]+_[[:upper:]]+-[[:digit:]]+.tif ]] && echo OK $FN || echo NOK; done

Please note that the test file name in your first post would fail this check.

Can you tell us what you mean by Rejects any images that are incorrectly named perhaps by providing examples of bad filenames...

Absolutely! I think RudiC may be on to something.

A bad filename could include anything that doesn't follow the naming schema.

Again, naming schema:

locationabbreviation-year-orgnumber_imagenumber.tif

Error examples:

  1. Any of these, out of order.
  2. Spaces within the name.
  3. Special characters. ( examples: / * ? & % #)
  4. A non-valid year, e.g. 1890. Any not exclusive to 1900's+.
  5. It could also be missing one of these 4 categories, e.g. included locationabbrev, year & image number, but left out org number.

Does this provide any clarity on the question?

To prevent misunderstandings - YOU should be on to something. With the starters given, you could test and adapt and fine tune them until suiting your needs, or come back here if you can't get to a satisfying solution. Also, the final solution could be posted so people can see/learn from them for future questions.

No doubt, I will be doing so. This response didn't infer anything other than just that, in response to shamrock.

To clarify, my current script is written in bash. As of yesterday, this script has been put on a slight hold, to address some high priority IT Security concerns. So I will not be editing my script and testing the suggestion this week.

To determine whether or not any field is valid and to determine whether or not fields are out of order, you first need to have a clear definition of your filename format and a clear definition of what contents are valid in each field. You have shown us three examples of (presumably valid) filenames:

test-1983_filename-123.tif
bldg1-1996-org101_F-12345.tif
bldgzone-2001-org234_ZTW-12345.tif

and a valid filename format:

locationabbreviation-year-orgnumber_imagenumber.tif

(which has hyphens and underscores that do not match any of the sample filenames). Are hyphens and underscores interchangeable? In addition to be field separators, are hyphens and underscores valid characters in some fields???

And then you tell us that one of four of the five fields in your valid filename format can be omitted ( .tif always being required). But you don't tell us what happens to the field separators if a field is missing?

You tell us that years 1900 and going forward are valid. Is 10000 a valid year? Is 12345678901234567890 a valid year? Did you really mean that only years 1900-9999 are valid? Or, is there some other specific range of valid years?

You say it is an error if the orgnumber and the imagenumber are out of order, but you give us absolutely no indication of how to tell whether or not any number is a valid orgnumber nor how to tell whether or not any number is a valid imagenumber???

Please give us a clear definition of the requirements for each field including minimum and maximum numbers of characters in the field; a list of valid characters in each field; a clear statement of what the field separators are; and a clear specification of the exact format of a filename without a locationabbreviation, without a year, with out an orgnumber and without an imagenumber field.