Need help with sed and regexp

Boxtuna · August 23, 2012, 5:49pm

Hi everyone, I would really appreciate any help I could get on the following topic.
I am not very familiar with reg expressions nor with sed, I just know the basic uses. What I am trying to do is the following: I have a huge text file where I would like to replace all occurnces of a certain pattern with another one. Here an example:

 "name1" = "12;15";
  "name5" = "7";
  "abc1" = "3";
  "5" = "";
  "-7" = "";
  "hgf" = "12;15";
  "e1" = "8";
  "-5" = "";

Should change to:

  "name1" = "12;15";
  "name5" = "7";
  "abc1" = "3;5;-7";
  "hgf" = "12;15";
  "e1" = "8;-5";

The rule is: any assign statement to a variable starting with a letter ( like "name1" ) is preserved, while any assignment to a numerical variable ( like "1" or "-1" should be appended to the previous line.

I couldnt get it to work because of the condition is across more than one line. I know I am supposed to use the N instruction but dunno how. Any help how to do the above with sed or any other way is cery appreciated.

Thanks.

agama · August 23, 2012, 10:01pm

I think it's easier to code, read, maintain in awk:

awk -F = '
    match( $1, /"(-*[0-9]+)"/, hit ) {
        gsub( "\";", ";" hit[1] "\";", p );
        next;
    }
    {
        if( p )
            print p;
        p = $0;
    }

    END {
        if( p )
            print p;
    }
' input-file >output-file

RudiC · August 24, 2012, 3:51am

Pls help me out: which awk version supports match(s, r, X) ? mawk does not. And what is the resulting action? Couldn't find in this forum's man pages.

elixir_sinari · August 24, 2012, 4:00am

From man page(s) for gawk:

       match(s, r [, a])       Return the position in s where the regular expression r occurs, or 0 if r is not present, and set
                               the  values  of RSTART and RLENGTH.  Note that the argument order is the same as for the ~ opera
                               tor: str ~ re.  If array a is provided, a is cleared and then elements 1  through  n  are  filled
                               with  the  portions of s that match the corresponding parenthesized subexpression in r.  The 0'th
                               element of a contains the portion of s matched by the entire regular  expression  r.   Subscripts
                               a[n,  "start"],  and  a[n,  "length"] provide the starting index in the string and length respec
                               tively, of each matching substring.

RudiC · August 24, 2012, 5:06am

Ah, gawk. Got it, thank you!

Boxtuna · August 24, 2012, 6:53am

agama, thanks a lot! Your awk example works like a charm.

Don_Cragun · August 25, 2012, 1:11am

Note that if you're on a system that doesn't have gawk and awk's match() function only takes two arguments (such as on OS X), the following should also work:

awk -F = '$1 ~ /"(-*[0-9]+)"/ {
        split($1, f, "\"")
        sub( "\";", ";" f[2] "\";", p )
        next
}
        { if(p) print p
        p = $0
}
END { if(p) print p
}' input_file > output_file

Scrutinizer · August 25, 2012, 2:38am

Alternatively

awk -F\" '$2~/[a-z]/{if(p) print p; p=$0; next}{sub(/";/, ";" $2 "&", p)} END{print p}' infile

Don_Cragun · August 25, 2012, 6:42am

I like it. But, the END clause still needs to be:

END{if(p)print p}

in case infile is an empty file.

Scrutinizer · August 25, 2012, 8:55am

Hi, yes.. I left the condition out because the effect would be limited to an extra newline added to an empty file. And I assumed the script would be run against a non-empty file...

Boxtuna · August 27, 2012, 12:57pm

I have actually noticed two small issues with the suggeted script;

an empty lien in the input file will be removed in the output file
the more serious issue is that any line that has teh pattern "123" will be completely removed. Example:

      stop_bin "926","mem_ddr_iobist_vmax"

got removed in the output file, while it shouldn't.

How can I change teh condition /"(-*[0-9.]+)"/ to also include the " ="?

Don_Cragun · August 27, 2012, 2:37pm

boxtuna:

I have actually noticed two small issues with the suggeted script;

an empty lien in the input file will be removed in the output file

the more serious issue is that any line that has teh pattern "123" will be completely removed. Example:
   stop_bin "926","mem_ddr_iobist_vmax"
got removed in the output file, while it shouldn't.

How can I change teh condition /"(-*[0-9.]+)"/ to also include the " ="?

You aren't giving us enough information:

When lines that don't match the format of the lines you said your input file contained in your first message appear, are they just supposed to be copied to the output?
If one of these lines appears appears after a line like "name" = "1"; and before an associated line like "2" = ""; , what is supposed to happen?
Will there ever be lines with more than one <equals-sign> character? If so, what is supposed to be done with them?
Will there ever be lines with one <equals-sign> character that is not in one of the two forms specified in your first message? If so, what is supposed to be done with them?

Boxtuna · August 27, 2012, 5:43pm

[LEFT]I apoligize for not being specific. I know how misleading that can be. And thanks again for the help.

Here's an actual example of an input file and how the input should be:

INPUT

tm_27:

  "Pinlist" = "VDAC_VREF";
  "VDAC_G" = "";
  "ForceCurrent" = "-0.1";
  "-0.1" = "";
  "0.1" = "";
  "-0.122" = "";
  "45.1120" = "";
  "-0.1" = "";
  "-0.1" = "";
  "3456" = "";
  "1" = "";
  "PassVoltMin" = "-950";
  "PassVoltMax" = "-850";
  "-850" = "";
  "-850" = "";
  "-850" = "";
  "-850" = "";
  "-850" = "";
  "-850" = "";
  stop_bin "926","mem_ddr_iobist_vmax",,bad,noreprobe,red,5,over_on;
  stop_bin "927","mem_ddr_iobist_vmax",,bad,noreprobe,red,5,over_on;

OUTPUT

tm_27:
  "Pinlist" = "VDAC_VREF";
  "VDAC_G" = "";
  "ForceCurrent" = "-0.1;-0.1;0.1;-0.122;45.1120;-0.1;-0.1;3456;1";
  "PassVoltMin" = "-950";
  "PassVoltMax" = "-850;-850;-850;-850;-850;-850;-850";
  stop_bin "926","mem_ddr_iobist_vmax",,bad,noreprobe,red,5,over_on;
  stop_bin "927","mem_ddr_iobist_vmax",,bad,noreprobe,red,5,over_on;

The pattern to match is anything like:
"123" =
or
"-123" =
or
"123.12" =
The equal signs is important. In addition there should be nothing but sapaces before this statment. I assume the regex for this is /"(-?[0-9]*\.?[0-9]+)"/
If the pattern is found then that number should be moved the previous line as shown above. The previous line will always have something like "name1" = "3456";
Any line that does not have the special pattern should be kept unchanged even if it is empty.
I hope this is more clear.
I appreciate the help!
[/LEFT]

---------- Post updated at 11:43 PM ---------- Previous update was at 10:22 PM ----------

managed to do this by making the pattern

/^[ ]*"(-?[0-9]*\.?[0-9]+)"/

agama · August 27, 2012, 9:19pm

Ha! yes the pattern will do the trick; nice job. Changes to my original suggestion which will also keep blank lines:

awk -F = '
    match( $1, /^[ \t]*"(-*[0-9.]+)"/, hit ) {
        gsub( "\";", ";" hit[1] "\";", p );
        next;
    }
    {
        if( p )
            print p;
        if( ! NF )
            print;
        p = $0;
    }

    END {
        if( p )
            print p;
    }
' input-file

If gawk like match isn't available:

awk -F = '
    match( $1, /^[ \t]*"[ \t]*-*[0-9.]+"/ ) {
#print ">>>" substr($1, RSTART+1, RLENGTH-2);
        gsub( "\";", ";" substr($1, RSTART+1, RLENGTH-2) "\";", p );
        next;
    }
    {
        if( p )
            print p;
        if( ! NF )
            print;
        p = $0;
    }

    END {
        if( p )
            print p;
    }
' input-file