Help with parsing mailbox folder list (identify similar folders)

List sample:

user/xxx/Archives/2010 
user/xxx/BLARG 
user/xxx/BlArG 
user/xxx/Burton 
user/xxx/DAY 
user/yyy/Trainees/Nutrition interns 
user/yyy/Trainees/Primary Care 
user/yyy/Trainees/Psychiatric NP interns 
user/yyy/Trainees/Psychiatric residents 
user/yyy/Trainees/Psychology externs 
user/yyy/Trainees/psychology eXterns
user/zzz/Goose/moose
user/zzz/Goose/mouse
user/zzz/Goose/Moose
user/zzz/Goose/Moose/goose
user/zzz/Goose/Moose/Goose
user/aaa/Boo
user/aaa/Bah/boo
user/aaa/boo/boo
user/aaa/boo/boO
user/aaa/bOo/boo
user/bbb/Zoo
user/bbb/Zoo/boo
user/bbb/ooo/boo
user/bbb/ooo/bOo

I'm helping to migrate a mail server from a case sensitive folder name space to a case insensitive one.
The case sensitive space was able to accommodate folders like "test", "TEST" and "Test" as different folders,
the new system will only allow one of these (on the same lever per user).

The current system will _not_ allow the same level of folder to have an identical name - that the output below appears to show the opposite e.g. Trainees,
means Trainees is a parent folder:

user/yyy/Trainees/Nutrition interns 
user/yyy/Trainees/Primary Care 
user/yyy/Trainees/Psychiatric NP interns 
user/yyy/Trainees/Psychiatric residents 
user/yyy/Trainees/Psychology externs 
user/yyy/Trainees/psychology eXterns

given this behavior, I don't care about Trainees being the same - that it is (identical, and on the same level) indicates it's a parent folder.

but I do care about Psychology externs and psychology eXterns given their shared parent.

In this example:

user/zzz/Goose/moose
user/zzz/Goose/mouse
user/zzz/Goose/Moose
user/zzz/Goose/Moose/goose
user/zzz/Goose/Moose/Goose

I don't care about the first "/Goose/" - it's a parent with children,
but "moose" I do care about because it's in the same container ("/Goose/") as "/Moose/" - so the new system will not allow this.

Similarly, I care about "Goose/Moose/goose" and "Goose/Moose/Goose" because "goose" and "Goose" are in the same "Goose/Moose/" container,
and again this is unacceptable to the new system.

For each user, I'd like to identify the folder path levels that are identical except in case - e.g.

user/xxx/BLARG* 
user/xxx/BlArG*

user/yyy/Trainees/Psychology externs* 
user/yyy/Trainees/psychology eXterns*

user/zzz/Goose/moose*
user/zzz/Goose/Moose*
user/zzz/Goose/Moose/goose*
user/zzz/Goose/Moose/Goose*

user/aaa/Boo*
user/aaa/boo/boo*
user/aaa/boo/boO*
user/aaa/bOo*/boo

user/bbb/ooo/boo*
user/bbb/ooo/bOo*

Again, I don't care about folder paths that are completely identical (case included) as this will indicate it's a parent folder.

Any ideas or working pseudo code?

Thanks for any info. And I hope this was clear and I didn't miss any edge cases.

Bill :eek:

It's not clear what you are asking for.
Are you asking for a mapping strategy?

I would encourage users to rename everything that differs only by case themselves, and adopt a straightforward rule for those that ignore you. Maybe something like this:

  • Keep everything the same until there is a conflict
  • Resolve the first conflicting name by adding a trailing underscore
  • Add a trailing digit after the underscore if there are multiple conflicts

One way of finding all the problem directories:

We first create a list of all relevant directories.
Then extract all case-significant duplicates and re-search the original list for case-insignificant matches.
Reasonably efficient approach for large numbers of directories and a moderate numbers of case-significant duplicates.

find /parent_directory/ -follow -type d -print | sort >/tmp/myworkfile1
cat /tmp/myworkfile1 | tr '[:upper:]' '[:lower:]' | sort | uniq -d >/tmp/myworkfile2
cat /tmp/myworkfile2 | while read dir
do
        grep -ix "${dir}" /tmp/myworkfile1
done
rm -f /tmp/myworkfile1
rm -f /tmp/myworkfile2

Footnote:
It always helps to know what Operating System and version you have and what Shell you prefer.
The code posted should work with most versions of unix or Linux with Bourne-like Shell (sh, bash, ksh etc.).

For the benefit of the "UUOC" police, I prefer left-to-right processing and have yet to find anything faster than "cat" for placing text records on a pipeline.

1 Like

I use Bash on Solaris 10, OS X 10.6, or Red Hat Enterprise Linux 5 - if necessary.

tr '[:upper:]' '[:lower:]' | sort | uniq

was what I needed - from there it was pretty clear which were the duplicates.

Your UOC is fine by me - definitely not an egregious case :slight_smile:

Thanks again!

Bill

I think the bit about identical parent folder paths was unnecessary and confusing - apologies - the paths still need to be unique and that is determined by their entire length, whatever their duplicate column paths may be.

Glad the code works.
I had a comparable problem some years ago when consolidating multiple smaller servers into one large server where many users had accounts on more than one of the original computers ... and were not consistent in the upper/lower case naming of their directories.