Cut -d Question

I went through quite a few threads and didn't find anything on this. I also looked on other sites and couldn't turn up an answer.

For completeness sake, I'm working off of solaris 10 in the korn shell environment.

I wrote a script for a buddy to help him out with the following issue.

He has a directory of files, here is an example of one of the files

verylongstringofmixedcharacters==-=-23480732.pdf

He wanted to write a script to remove everything from the "==-=-" and the numbers after it so the file would look like the following:

verylongstringofmixedcharacters.pdf

Utilizing the "cut -d= -f1" command and the "cut -d. -f2" command, I was able to pull off the "verylongstringofmixedcharacters" and the "pdf" part. I then set a new variable name to using the following line:
fileparts=`echo $filepart1'.'$filepart2`

That's not the whole script, but that's the piece where the new filename is created to what he desired. When I finished it and sent it off, he gave me the bad news that sometimes within the verylongstringofmixedcharacters there can be found an = sign. So it might be "verylong=stringof=mixedcharacters", thereby not allowing my first delimiter of = to work. My question to you all is the following:

Is there a way to have a multicharacter delimiter with cut? Meaning, could it be, "cut -d==-=- -f1"? I've tried it the following ways and I received an invalid delimiter messages:

cut -d==-=- -f1
cut -d"==-=-" -f1
cut -d'==-=-' -f1
cut -d "==-=-" -f1
cut -d '==-=-' -f1

I'm thinking I'll have to use a sed command of sorts to fix this. I'm looking into sed one liners that might help after I post this, but I figure I stimulate your brains with it. Thanks in advance for your help.
~Ryan

You could set the Field Separator pattern in awk...

$ echo "verylongstringofmixedcharacters==-=-23480732.pdf"|awk 'BEGIN{FS="==-=-"}{print $1}'
verylongstringofmixedcharacters

You can even use multiple patterns...

$ echo "verylongstringofmixedcharacters==-=-23480732.pdf"|awk 'BEGIN{FS="(==-=-|\\.)";OFS="."}{print $1,$3}'
verylongstringofmixedcharacters.pdf

Or use sed...

$ echo "verylongstringofmixedcharacters==-=-23480732.pdf"|sed 's/\(.*\)==-=-.*\(\.pdf\)/\1\2/'
verylongstringofmixedcharacters.pdf
for file in *==-=-* ; do
    mv "$file" "${file%%==-=-*}.${file##*.}"
done

Ygor, I'm very new to sed and have only started to realize the true potential of it. I hate to be a bother, but if you could explain how your command is interpreted, it would be greatly helpful:

sed 's/\(.*\)==-=-.*\(\.pdf\)/\1\2/'

Since posting the original post I was using the following string:

sed 's/==-=-*//g'

For some reason it doesn't recognize the * as the wildcard character, even if I put a \ before it.

Reborg, I'd like to note that not all of the files have a .pdf extension, there might be .xls or .txt or other varying types of files. I used pdf just as an example, sorry if I mislead folks there.

Thank you in advance for your help!

see edit above.

For sed, you are dealing with an RE ( regular expression ) , not a GLOB pattern.

  • in RE means 0 or more of the preceding character, not any number of any character as in a glob pattern. The single character wildcard in RE is ., therefore ".*" corresponds ( more or less ) with * in a glob pattern.

Search:

\(     start of first bracketed pattern
.*     any number of characters
\)     end of first bracketed pattern
==-=-  literal
.*     any number of characters
\(     start of second bracketed pattern
\.pdf  literal ("." is escaped to prevent its special meaning)
\)     end of second bracketed pattern

Replacement:

\1     first bracketed pattern
\2     second bracketed pattern

Thank you, I see now how it all fits together. Seeing the actual meaning behind what each piece is doing helps out quite a bit! As stated in the replies above, .pdf isn't necessarily the only file extension this would have to look for. I made a slight change to it and it seems to work:

sed 's/\(.*\)==-=-.*\(\...\)/\1\2/'

I took pdf and replaced it with ..., the only question is, what if there is a greater number than 3 for the file extension? I don't think ... will work then, since it's going for literally 3 characters for that second part. What would be the correct expression to use to say "Any amount of character" and not just 3? Thanks again, both of ye for your input and help!

I think I got it:
sed 's/\(.*\)==-=-.*\(\.\)/\1\2/'

The second bracket I turned what it should look for into literally everything. That seemed to do the trick. *cartwheels*

Ygor, I'm directing this question towards you, but anyone else that could clearly explain, (you don't have to dumb it down terribly, but enough to make sense) but I'm getting caught up in understanding how the command is interpreted towards the end of this command:

sed 's/\(.*\)==-=-.*\(\.\)/\1\2/'

I guess it'd be best if I tell you what I'm seeing and what I understand:

sed - I know what the command is to do with associated flags, it's the streamlines editor.
's - I know this is the substitution flag.
\( - I understand this is used to make the character after it be the "literal" character you see. Thus ( means ( as you see it.
.* - I see Ygor that you stated that .* means any number of characters, but I get slightly confused here. From my understanding, the * character is a wildcard character and the . character is only 1 wildcard character. Does .* in "regular expression" sed terms translate into "Any number of characters?"
\) - This means ) as you see it, like the ( pattern above.
==-=- - I understand that pattern is the literal pattern.
.* - This appears again and does this mean the same thing as the first one? I'm getting confused that since the literal string is immediately before the ., the . will be interpreted literally instead of "any number of characters."
\( - Once again escaping out the character to literally mean (.
\. - I came up with this piece and it seems to work in grabbing the .pdf or .txt extensions, but to be honest I'm unsure why it's working. I thought the . character would be interpreted as 1 wildcard character. Instead it is escaped out, if I'm interpreting correctly, and it is taking the literal . character as to what the second pattern its looking for.
\)\1\2/' - I understand the escaped characters and how it fits the patterns together.

A slight explanation on the couple of spots would be greatly appreciated. I like getting the answers, don't get me wrong, but i like to take that one step further and understand the inner workings. It's how you truly learn a command...

Here are some examples of output to see how this is working out:
ls -l
total 6
-rw-r----- 1 root root 0 Mar 19 20:22 cconvey=acnastatusz+23423==-=-2340289723423089724.txt
-rw-r----- 1 root root 17 Mar 19 19:11 cconveyancestatusg5q0aCC1JK-aBRIok8L+jg==-=-43766338.pdf
-rw-r----- 1 root root 18 Mar 19 19:11 cconveyancestatuskYMXtXkxtren0pSQ-l7J+Q==-=-48489900.pdf
-rw-r----- 1 root root 19 Mar 19 19:12 cconveyancestatusz+45hkPLw9xe78iTNMrNwQ==-=-22077524.pdf

Above you see the list of example files in the directory.

ls | sed 's/\(.*\)==-=-.*\(\.pdf\)/\1\2/'
cconvey=acnastatusz+23423==-=-2340289723423089724.txt
cconveyancestatusg5q0aCC1JK-aBRIok8L+jg.pdf
cconveyancestatuskYMXtXkxtren0pSQ-l7J+Q.pdf
cconveyancestatusz+45hkPLw9xe78iTNMrNwQ.pdf

This is using Ygor's first example. It works on the .pdf files but the .txt files are excluded. I then went to work to try and find out a way to have any characters included at the end (pdf and txt are good, but there are some greaterthan3 character extensions out there.)

ls | sed 's/\(.*\)==-=-.*\(\.*\)/\1\2/'
cconvey=acnastatusz+23423
cconveyancestatusg5q0aCC1JK-aBRIok8L+jg
cconveyancestatuskYMXtXkxtren0pSQ-l7J+Q
cconveyancestatusz+45hkPLw9xe78iTNMrNwQ

The third .* was put in because it would stand for Any Number of Characters. As you can see above, it only returns the first part, so I knew I had done something wrong.

ls | sed 's/\(.*\)==-=-.*\(\...\)/\1\2/'
cconvey=acnastatusz+23423.txt
cconveyancestatusg5q0aCC1JK-aBRIok8L+jg.pdf
cconveyancestatuskYMXtXkxtren0pSQ-l7J+Q.pdf
cconveyancestatusz+45hkPLw9xe78iTNMrNwQ.pdf

I used 3 ... to make it grab those 3 characters, whatever they may be. But it still didn't resolve the problem of what if there are extensions greater than 3 characters?

ls | sed 's/\(.*\)==-=-.*\(\....\)/\1\2/'
cconvey=acnastatusz+23423.txt
cconveyancestatusg5q0aCC1JK-aBRIok8L+jg.pdf
cconveyancestatuskYMXtXkxtren0pSQ-l7J+Q.pdf
cconveyancestatusz+45hkPLw9xe78iTNMrNwQ.pdf

I added a 4th ., and it worked, but it seemed like I took the easy way around it, sort of a cheesy way to counter the problem I was having. This led me to my final try at it:

ls | sed 's/\(.*\)==-=-.*\(\.\)/\1\2/'
cconvey=acnastatusz+23423.txt
cconveyancestatusg5q0aCC1JK-aBRIok8L+jg.pdf
cconveyancestatuskYMXtXkxtren0pSQ-l7J+Q.pdf
cconveyancestatusz+45hkPLw9xe78iTNMrNwQ.pdf

I accidentally got rid of 3 ... and hit enter and I received the correct end result. My questions about how it works is in the above, but these were the examples I tried to get it correct. I hope you see where my logic was going when trying to get the answer. Thanks again for your patience and help with this.

~Ryan

if you have Python, here's an alternative, no regular expression needed:

#!/usr/bin/python
import os,glob
os.chdir("yourdir")
for fi in glob.glob("*==-=-*"):
     name,ext = fi.split("==-=-")
     newfilename = name + "." + ext.split(".")[1]
     print newfilename
     # os.rename(fi,newfilename) #uncomment to rename file.        

output:

verylongstringofmixedcharacters.pdf

Regular expressions are very useful, this is from the sed manual....

.        Matches any character

*        Matches a sequence of zero or more repetitions of previous character, grouped regexp, or class.

\CHAR    Matches character CHAR; this is to be used to match special characters

So "." matches any character, "\." matches a dot, "." matches any number of characters and "\.." matches a dot followed by any number of characters.

i agree, but too much of it is not healthy either, especially for maintainability and readability of code.

I love to be thorough and try to test all sorts of output to ensure code stability. I came across an issue and a fix but incorporating it together seems to not be working as planned.

The goal of above was to remove a piece of a file, rename the file, then move/copy it to a different directory. Here is the output after running the sed command previously obtained through all of your help:

sed 's/\(.*\)==-=-.*\(\..*\)/\1\2/'

Present Directory (ls -l)
-rw-r----- 1 root root 54 Mar 23 15:48 cconvey.acalkjdafj+323==-=-2342309808234.xls
-rw-r----- 1 root root 27 Mar 23 15:49 cconveyacalkjdafj+323==-=-2342309808234.xls.txt
-rw-r----- 1 root root 9 Mar 23 15:49 cconvey=acnastatusz+23423==-=-2340289723423089724.txt
-rw-r----- 1 root root 17 Mar 19 19:11 cconveyancestatusg5q0aCC1JK-aBRIok8L+jg==-=-43766338.pdf
-rw-r----- 1 root root 18 Mar 19 19:11 cconveyancestatuskYMXtXkxtren0pSQ-l7J+Q==-=-48489900.pdf
-rw-r----- 1 root root 19 Mar 19 19:12 cconveyancestatusz+45hkPLw9xe78iTNMrNwQ==-=-22077524.pdf

Target Directory After Copy
-rw-r----- 1 root root 54 Mar 23 16:56 cconvey.acalkjdafj+323.xls
-rw-r----- 1 root root 27 Mar 23 16:56 cconveyacalkjdafj+323.txt
-rw-r----- 1 root root 9 Mar 23 16:56 cconvey=acnastatusz+23423.txt
-rw-r----- 1 root root 17 Mar 23 16:56 cconveyancestatusg5q0aCC1JK-aBRIok8L+jg.pdf
-rw-r----- 1 root root 18 Mar 23 16:56 cconveyancestatuskYMXtXkxtren0pSQ-l7J+Q.pdf
-rw-r----- 1 root root 19 Mar 23 16:56 cconveyancestatusz+45hkPLw9xe78iTNMrNwQ.pdf

I was trying to test using the character "." within the filename. The first file listed has the "." before the "==-=-" string and doesn't get affected, which is good. I tested at the end however with two extensions (sometimes found on unix servers where a file is tar'ed and gzipped). I made a file with an imaginary extension of .xls.txt and only the .txt portion remains (as you can see by the bolded results of the 2nd "ls" command.

So I went to work trying to fix it so it'll look for not only .### at the end, but the string .*.* in case where * meant any extension type of characters.

I tweaked the end of the sed command to look like the following:
sed 's/\(.*\)==-=-.*\(\..*\..*\)/\1\2/'

It works but only for its specific case:

Current Directory
-rw-r----- 1 root root 54 Mar 23 15:48 cconvey.acalkjdafj+323==-=-2342309808234.xls
-rw-r----- 1 root root 27 Mar 23 15:49 cconveyacalkjdafj+323==-=-2342309808234.xls.txt
-rw-r----- 1 root root 9 Mar 23 15:49 cconvey=acnastatusz+23423==-=-2340289723423089724.txt
-rw-r----- 1 root root 17 Mar 19 19:11 cconveyancestatusg5q0aCC1JK-aBRIok8L+jg==-=-43766338.pdf
-rw-r----- 1 root root 18 Mar 19 19:11 cconveyancestatuskYMXtXkxtren0pSQ-l7J+Q==-=-48489900.pdf
-rw-r----- 1 root root 19 Mar 19 19:12 cconveyancestatusz+45hkPLw9xe78iTNMrNwQ==-=-22077524.pdf

Target Directory After Copy
-rw-r----- 1 root root 54 Mar 23 17:18 cconvey.acalkjdafj+323==-=-2342309808234.xls
-rw-r----- 1 root root 27 Mar 23 17:18 cconveyacalkjdafj+323.xls.txt
-rw-r----- 1 root root 9 Mar 23 17:18 cconvey=acnastatusz+23423==-=-2340289723423089724.txt
-rw-r----- 1 root root 17 Mar 23 17:18 cconveyancestatusg5q0aCC1JK-aBRIok8L+jg==-=-43766338.pdf
-rw-r----- 1 root root 18 Mar 23 17:18 cconveyancestatuskYMXtXkxtren0pSQ-l7J+Q==-=-48489900.pdf
-rw-r----- 1 root root 19 Mar 23 17:18 cconveyancestatusz+45hkPLw9xe78iTNMrNwQ==-=-22077524.pdf

You'll see that all of the other files have not been touched, but the bolded file, the file we had trouble with the original sed command, has been fixed.

So the question is, "Is there a way to combine the two to work in one sed statement?" I'm thinking along the lines of how if you want more then one grep, you use an egrep with pipes. Is there something similiar available to us in sed for this question? As always, thanks in advance for your input!