Error while using sed command

Jag02 · January 7, 2019, 5:12am

I have a few csv files in a directory and i am using sed command for processing the results from the filename.Details are given below:

Directory: /tmp/test/output
Files: JAN_DAT_TES1_201807181400.csv
         JAN_DAT_TES2_201807181500.csv

I want to get the output as

/tmp/test/output/TES1  2018071814
/tmp/test/output/TES2  2018071815

Itried with following code,but its giving error as Unmatched ) or \)

find /tmp/test/output -type f -name "*.csv" | sed -e 's/JAN_DAT_(.*\)\_([0-9]\{10\}\).*.csv/\1 \2/g'

Kindly help to correct the mistake in the above command

bakunin · January 7, 2019, 5:33am

The good news is: you do not need sed for this at all. In fact, as long as you are not manipulating a data stream you usually should not use it at all and use "variable expansion" instead. It is the shells equivalent of substr() , trim() , strtok() and similar functions found in other high-level-languages.

The first thing you want to do is to cut off the extension ".csv" from the filename:

file="JAN_DAT_TES1_201807181400.csv"
echo "${file%.csv}"

This works the following way: ${variable%pattern} cuts off the pattern from the end of the variables content if it is found. "pattern" is everything you could use as a filename pattern in the shell: "" would mean any number of any characters (like in "file"), "?" means any one character, etc. Notice also that this only DISPLAYS the resulting string, it does NOT CHANGE the content of the variable! If you want to change it lastingly you need to assign the new value:

file="${file%.csv}"

Since you want to cut off the last two zeroes either we can do that in a single run:

file="JAN_DAT_TES1_201807181400.csv"
echo "${file%00.csv}"

The next thing we want is the date to be cut off but we need to preserve it, so we assign a new variable with a copy of "$file"s contents, but with everything up to the last underscore removed. For this there is another expansion which works like the one i showed you but it cuts off from the beginning instead of the end:

file="JAN_DAT_TES1_2018071814"
echo "${file##*_}"

Notice that i used "##" instead of "#". "##" and its companion "%%" cut off the longest possible match whereas "#" and "%" cut off the shortest possible match. That means:

file="JAN_DAT_TES1_2018071814"
echo "${file#*_}"     # gives "DAT_TES1_2018071814"
echo "${file##*_}"    # gives "2018071814"

Now, putting it all together (notice that you can use a variable as pattern too!):

for file in /some/where/*csv ; do
     file="${file%00.csv}"
     date="${file##*_}"
     file="${file%_${date}}"
     echo "my file is: $file , my date is: $date"
done

I hope this helps.

bakunin

Jag02 · January 7, 2019, 5:52am

Thank you. But i would like the output to be

/tmp/test/output/TES1  2018071814

The above code gives output as

/tmp/test/JAN_DAT_TES1 2018071814

bakunin · January 7, 2019, 6:38am

In this case: cut off the other parts with what i have showed you:

for file in /some/where/*csv ; do
     path="${file%/*}"        # remove the filename, leaving the path  "/some/where/JAN_DAT_TES1_201807181400.csv" => "/some/where"
     file="${file%00.csv}"    # remove the extension and the last "00"  "/some/where/JAN_DAT_TES1_201807181400.csv" => "/some/where/JAN_DAT_TES1_2018071814"
     date="${file##*_}"       # extract the date: "/some/where/JAN_DAT_TES1_2018071814" => "2018071814"
     file="${file%_${date}}"  # remove the date from the filename: "/some/where/JAN_DAT_TES1_2018071814"  => "/some/where/JAN_DAT_TES1"
     file="${file##*_}"       # remove everything up to the last "_" from the filename: "/some/where/JAN_DAT_TES1"  => "TES1"

     echo "my file is: $path/$file , my date is: $date"
done

I hope this helps.

bakunin

Jag02 · January 7, 2019, 7:08am

Thank you!!. It helped.
If you could pl tell me what was the error in the sed code, it will help me to learn better.

bakunin · January 7, 2019, 8:42am

sed -e 's/JAN_DAT_(.*\)\_([0-9]\{10\}\).*.csv/\1 \2/g'

The first thing is: sed -patterns are always "greedy", which means they match the longest possible part of the line. A pattern like A.*B will NOT match the part marked bold but the WHOLE of the following string:

AXYZBBBBBBBBBBBBBBB

Therefore /JAN_DAT_(.*\) is already at least problematic if not wrong. What you want to do is to match up to the next underscore, so you better exclude the underscore from the matching set:

/JAN_DAT_\([^_]*\)_....

In the example above, you would not write A.*B but A[^B]*B ("A, followed by any number of non-Bs, followed by B") to match only from A to the next following B.

The second thing was the unescaped "(", which i corrected (marked in red). The same goes for the second opening bracket, before the "[0-9]". You see, sed uses "BRE", "basic regular expressions", unlike i.e. awk , which uses "ERE"s, "extended regular expressions". In BRE you have to escape brackets whereas in ERE you don't. I do not know which sed you use, some are able to use EREs too, but i wouldn't recommend it, even if they can. It is easier to use BREs, which work everywhere than writing a script on one system only to have it fail on another just because the one sed has a non-standard extension and the other doesn't.

I hope this helps.

bakunin