Use sed/awk to do like copy and paste

I have rrd file which is have the gaps and I want to fill it out with some value , I've got 10 NaN record and I try to populate data from 10 records be for NaN to change instead of NaN :frowning:

                        <!-- 2016-05-19 14:10:00 CST / 1463638200 --> <row><v>8.2147341249e+06</v><v>1.1121553155e+07</v></row>
                        <!-- 2016-05-19 14:15:00 CST / 1463638500 --> <row><v>8.2746054049e+06</v><v>1.1065925833e+07</v></row>
                        <!-- 2016-05-19 14:20:00 CST / 1463638800 --> <row><v>8.5873130347e+06</v><v>1.1330819417e+07</v></row>
                        <!-- 2016-05-19 14:25:00 CST / 1463639100 --> <row><v>8.7400510244e+06</v><v>1.1466312671e+07</v></row>
                        <!-- 2016-05-19 14:30:00 CST / 1463639400 --> <row><v>8.8934198478e+06</v><v>1.1612750075e+07</v></row>
                        <!-- 2016-05-19 14:35:00 CST / 1463639700 --> <row><v>8.6970836405e+06</v><v>1.1451382750e+07</v></row>
                        <!-- 2016-05-19 14:40:00 CST / 1463640000 --> <row><v>8.6519721561e+06</v><v>1.1490998652e+07</v></row>
                        <!-- 2016-05-19 14:45:00 CST / 1463640300 --> <row><v>8.7016410558e+06</v><v>1.1758066110e+07</v></row>
                        <!-- 2016-05-19 14:50:00 CST / 1463640600 --> <row><v>8.7026601272e+06</v><v>1.1808612970e+07</v></row>
                        <!-- 2016-05-19 14:55:00 CST / 1463640900 --> <row><v>8.8925026101e+06</v><v>1.2043713450e+07</v></row>
                        <!-- 2016-05-19 15:00:00 CST / 1463641200 --> <row><v>8.7455789370e+06</v><v>1.2420369632e+07</v></row>
                        <!-- 2016-05-19 15:05:00 CST / 1463641500 --> <row><v>8.6264080619e+06</v><v>1.2602318595e+07</v></row>
                        <!-- 2016-05-19 15:10:00 CST / 1463641800 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:15:00 CST / 1463642100 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:20:00 CST / 1463642400 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:25:00 CST / 1463642700 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:30:00 CST / 1463643000 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:35:00 CST / 1463643300 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:40:00 CST / 1463643600 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:45:00 CST / 1463643900 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:50:00 CST / 1463644200 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:55:00 CST / 1463644500 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 16:00:00 CST / 1463644800 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 16:05:00 CST / 1463645100 --> <row><v>9.4998580882e+06</v><v>1.3563528239e+07</v></row>
                        <!-- 2016-05-19 16:10:00 CST / 1463645400 --> <row><v>9.3834388158e+06</v><v>1.3154323934e+07</v></row>

expected result shoule be with out NaN , actually the reson behind this is need to get rid of the gaps when plot the graph..

                        <!-- 2016-05-19 13:50:00 CST / 1463637000 --> <row><v>7.8650237427e+06</v><v>1.1633105231e+07</v></row>
                        <!-- 2016-05-19 13:55:00 CST / 1463637300 --> <row><v>7.7744480110e+06</v><v>1.1049584241e+07</v></row>
                        <!-- 2016-05-19 14:00:00 CST / 1463637600 --> <row><v>8.0235789005e+06</v><v>1.1089589128e+07</v></row>
                        <!-- 2016-05-19 14:05:00 CST / 1463637900 --> <row><v>8.1035284930e+06</v><v>1.1177274228e+07</v></row>
                        <!-- 2016-05-19 14:10:00 CST / 1463638200 --> <row><v>8.2147341249e+06</v><v>1.1121553155e+07</v></row>
                        <!-- 2016-05-19 14:15:00 CST / 1463638500 --> <row><v>8.2746054049e+06</v><v>1.1065925833e+07</v></row>
                        <!-- 2016-05-19 14:20:00 CST / 1463638800 --> <row><v>8.5873130347e+06</v><v>1.1330819417e+07</v></row>
                        <!-- 2016-05-19 14:25:00 CST / 1463639100 --> <row><v>8.7400510244e+06</v><v>1.1466312671e+07</v></row>
                        <!-- 2016-05-19 14:30:00 CST / 1463639400 --> <row><v>8.8934198478e+06</v><v>1.1612750075e+07</v></row>
                        <!-- 2016-05-19 14:35:00 CST / 1463639700 --> <row><v>8.6970836405e+06</v><v>1.1451382750e+07</v></row>
                        <!-- 2016-05-19 14:40:00 CST / 1463640000 --> <row><v>8.6519721561e+06</v><v>1.1490998652e+07</v></row>
                        <!-- 2016-05-19 14:45:00 CST / 1463640300 --> <row><v>8.7016410558e+06</v><v>1.1758066110e+07</v></row>
                        <!-- 2016-05-19 14:50:00 CST / 1463640600 --> <row><v>8.7026601272e+06</v><v>1.1808612970e+07</v></row>
                        <!-- 2016-05-19 14:55:00 CST / 1463640900 --> <row><v>8.8925026101e+06</v><v>1.2043713450e+07</v></row>
                        <!-- 2016-05-19 15:00:00 CST / 1463641200 --> <row><v>8.7455789370e+06</v><v>1.2420369632e+07</v></row>
                        <!-- 2016-05-19 15:05:00 CST / 1463641500 --> <row><v>8.6264080619e+06</v><v>1.2602318595e+07</v></row>
                        <!-- 2016-05-19 15:10:00 CST / 1463641800 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:15:00 CST / 1463642100 --> <row><v>8.7400510244e+06</v><v>1.1466312671e+07</v></row> --
                        <!-- 2016-05-19 15:20:00 CST / 1463642400 --> <row><v>8.8934198478e+06</v><v>1.1612750075e+07</v></row> --
                        <!-- 2016-05-19 15:25:00 CST / 1463642700 --> <row><v>8.6970836405e+06</v><v>1.1451382750e+07</v></row> --
                        <!-- 2016-05-19 15:30:00 CST / 1463643000 --> <row><v>8.6519721561e+06</v><v>1.1490998652e+07</v></row> --
                        <!-- 2016-05-19 15:35:00 CST / 1463643300 --> <row><v>8.7016410558e+06</v><v>1.1758066110e+07</v></row> --
                        <!-- 2016-05-19 15:40:00 CST / 1463643600 --> <row><v>8.7026601272e+06</v><v>1.1808612970e+07</v></row> --
                        <!-- 2016-05-19 15:45:00 CST / 1463643900 --> <row><v>8.8925026101e+06</v><v>1.2043713450e+07</v></row> --
                        <!-- 2016-05-19 15:50:00 CST / 1463644200 --> <row><v>8.7455789370e+06</v><v>1.2420369632e+07</v></row> --
                        <!-- 2016-05-19 15:55:00 CST / 1463644500 --> <row><v>8.6264080619e+06</v><v>1.2602318595e+07</v></row> -- 
                        <!-- 2016-05-19 16:00:00 CST / 1463644800 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row> --
                        <!-- 2016-05-19 16:05:00 CST / 1463645100 --> <row><v>9.4998580882e+06</v><v>1.3563528239e+07</v></row> 
                        <!-- 2016-05-19 16:10:00 CST / 1463645400 --> <row><v>9.3834388158e+06</v><v>1.3154323934e+07</v></row>  

Parsing XML is not trivial. Because of frequent requests, I've got an awk script which works in some common situations however.

Contents of xml.awk:

BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# These should be special variables for match() but aren't.
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($1, /^[^\/ \r\n\t>]+/) {
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS, "", "", "");
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
}

How to use it in this case:

$ awk -f xml.awk -e 'BEGIN { P[0]=0; P[1]=0; ORS="" }
                /^row>/ { D=0 }
                /^v>/ {
                        if($2 == "NaN") $2=P[D];
                        else            P[D]=$2;
                        D++;
                }
                { print "<" $0; }' datafile

<!-- 2016-05-19 14:10:00 CST / 1463638200 --> <row><v>8.2147341249e+06</v><v>1.1121553155e+07</v></row>
                        <!-- 2016-05-19 14:15:00 CST / 1463638500 --> <row><v>8.2746054049e+06</v><v>1.1065925833e+07</v></row>
                        <!-- 2016-05-19 14:20:00 CST / 1463638800 --> <row><v>8.5873130347e+06</v><v>1.1330819417e+07</v></row>
                        <!-- 2016-05-19 14:25:00 CST / 1463639100 --> <row><v>8.7400510244e+06</v><v>1.1466312671e+07</v></row>
                        <!-- 2016-05-19 14:30:00 CST / 1463639400 --> <row><v>8.8934198478e+06</v><v>1.1612750075e+07</v></row>
                        <!-- 2016-05-19 14:35:00 CST / 1463639700 --> <row><v>8.6970836405e+06</v><v>1.1451382750e+07</v></row>
                        <!-- 2016-05-19 14:40:00 CST / 1463640000 --> <row><v>8.6519721561e+06</v><v>1.1490998652e+07</v></row>
                        <!-- 2016-05-19 14:45:00 CST / 1463640300 --> <row><v>8.7016410558e+06</v><v>1.1758066110e+07</v></row>
                        <!-- 2016-05-19 14:50:00 CST / 1463640600 --> <row><v>8.7026601272e+06</v><v>1.1808612970e+07</v></row>
                        <!-- 2016-05-19 14:55:00 CST / 1463640900 --> <row><v>8.8925026101e+06</v><v>1.2043713450e+07</v></row>
                        <!-- 2016-05-19 15:00:00 CST / 1463641200 --> <row><v>8.7455789370e+06</v><v>1.2420369632e+07</v></row>
                        <!-- 2016-05-19 15:05:00 CST / 1463641500 --> <row><v>8.6264080619e+06</v><v>1.2602318595e+07</v></row>
                        <!-- 2016-05-19 15:10:00 CST / 1463641800 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:15:00 CST / 1463642100 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:20:00 CST / 1463642400 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:25:00 CST / 1463642700 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:30:00 CST / 1463643000 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:35:00 CST / 1463643300 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:40:00 CST / 1463643600 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:45:00 CST / 1463643900 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:50:00 CST / 1463644200 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:55:00 CST / 1463644500 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 16:00:00 CST / 1463644800 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 16:05:00 CST / 1463645100 --> <row><v>9.4998580882e+06</v><v>1.3563528239e+07</v></row>
                        <!-- 2016-05-19 16:10:00 CST / 1463645400 --> <row><v>9.3834388158e+06</v><v>1.3154323934e+07</v></row>

$
1 Like

I have this after try to simulate

awk: xml.awk:3: fatal: cannot open file `-e' for reading (No such file or directory)

Append these lines to the end of xml.awk:

BEGIN { P[0]=0; P[1]=0; ORS="" }
                /^row>/ { D=0 }
                /^v>/ {
                        if($2 == "NaN") $2=P[D];
                        else            P[D]=$2;
                        D++;
                }
                { print "<" $0; }

...then just run awk -f xml.awk datafile

1 Like

I just realized that my awk bundle with Ubuntu 12.04 GNU Awk 3.1.8 after compile with new version GNU Awk 4.1.3, API: 1.1 your script work like a charm.

1 Like

Sorry about that, ubuntu uses mawk by default these days as it's faster.

It should work with nearly any version of GNU awk. It even works with busybox's minimal awk. But mawk lacks -e, annoyingly.

You can work around that by just adding the lines to the file as noted above.

Is it possible to filter only from 2016-05-19 15:15:00 - 2016-05-19 15:55:00 CST ? , haven't look in to script and understand how it really work yet .

Follow your suggestion all NaN in rrd dump file will change to some value.

                        <!-- 2016-05-19 15:15:00 CST / 1463642100 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:20:00 CST / 1463642400 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:25:00 CST / 1463642700 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:30:00 CST / 1463643000 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:35:00 CST / 1463643300 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:40:00 CST / 1463643600 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:45:00 CST / 1463643900 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:50:00 CST / 1463644200 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 15:55:00 CST / 1463644500 --> <row><v>NaN</v><v>NaN</v></row>

Luckily your file uses YYYY-MM-SS HH:MM:SS timestamps, which sort alphabetically. Those can be compared with < > <= etc without worrying about any date math. MIN="..." and MAX="..." are where those values are input into awk.

$ awk -f xml.awk -e 'BEGIN { P[0]=0; P[1]=0; ORS="" }
                # Extract YYYY-MM-DD HH:MM:SS time from special tag
                SPEC && match($0, /[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9:]*/) {
                        TIME=rall($0);
                }
                /^row>/ { D=0 }
                /^v>/ {
                        if($2 != "NaN") { P[D]=$2; }
                        else if((TIME >= MIN) && (TIME <= MAX) && P[D])
                        {       $2=P[D] }
                        D++;
                }

                { print "<" $0; }' MIN="2016-05-19 15:15:00" MAX="2016-05-19 15:55:00" datafile

<!-- 2016-05-19 14:10:00 CST / 1463638200 --> <row><v>8.2147341249e+06</v><v>1.1121553155e+07</v></row>
                        <!-- 2016-05-19 14:15:00 CST / 1463638500 --> <row><v>8.2746054049e+06</v><v>1.1065925833e+07</v></row>
                        <!-- 2016-05-19 14:20:00 CST / 1463638800 --> <row><v>8.5873130347e+06</v><v>1.1330819417e+07</v></row>
                        <!-- 2016-05-19 14:25:00 CST / 1463639100 --> <row><v>8.7400510244e+06</v><v>1.1466312671e+07</v></row>
                        <!-- 2016-05-19 14:30:00 CST / 1463639400 --> <row><v>8.8934198478e+06</v><v>1.1612750075e+07</v></row>
                        <!-- 2016-05-19 14:35:00 CST / 1463639700 --> <row><v>NaN</v><v>1.1451382750e+07</v></row>
                        <!-- 2016-05-19 14:40:00 CST / 1463640000 --> <row><v>8.6519721561e+06</v><v>1.1490998652e+07</v></row>
                        <!-- 2016-05-19 14:45:00 CST / 1463640300 --> <row><v>8.7016410558e+06</v><v>1.1758066110e+07</v></row>
                        <!-- 2016-05-19 14:50:00 CST / 1463640600 --> <row><v>8.7026601272e+06</v><v>1.1808612970e+07</v></row>
                        <!-- 2016-05-19 14:55:00 CST / 1463640900 --> <row><v>8.8925026101e+06</v><v>1.2043713450e+07</v></row>
                        <!-- 2016-05-19 15:00:00 CST / 1463641200 --> <row><v>8.7455789370e+06</v><v>1.2420369632e+07</v></row>
                        <!-- 2016-05-19 15:05:00 CST / 1463641500 --> <row><v>8.6264080619e+06</v><v>1.2602318595e+07</v></row>
                        <!-- 2016-05-19 15:10:00 CST / 1463641800 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:15:00 CST / 1463642100 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:20:00 CST / 1463642400 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:25:00 CST / 1463642700 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:30:00 CST / 1463643000 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:35:00 CST / 1463643300 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:40:00 CST / 1463643600 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:45:00 CST / 1463643900 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:50:00 CST / 1463644200 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 15:55:00 CST / 1463644500 --> <row><v>8.8378430034e+06</v><v>1.3041342755e+07</v></row>
                        <!-- 2016-05-19 16:00:00 CST / 1463644800 --> <row><v>NaN</v><v>NaN</v></row>
                        <!-- 2016-05-19 16:05:00 CST / 1463645100 --> <row><v>9.4998580882e+06</v><v>1.3563528239e+07</v></row>
                        <!-- 2016-05-19 16:10:00 CST / 1463645400 --> <row><v>9.3834388158e+06</v><v>1.3154323934e+07</v></row>

$

I added an NaN to an earlier line to test if it would be fixed or not.

Note the last line of NaN's not corrected from specifying 15:55 instead of 16:00

awk '
NR==FNR {
   if ($7 !~ /NaN/) {f7[NR]=$7; for (i=NR-gc7; i<NR; i++) f7=f7[i-gc7] ? f7[i-gc7] : l7 ; l7=$7; gc7=0} else {gc7++}
   if ($11 !~ /NaN/) {f11[NR]=$11; for (i=NR-gc11; i<NR; i++) f11=f11[i-gc11] ? f11[i-gc11] : l11 ; l11=$11; gc11=0} else {gc11++}
   next
}
{
   if (($7+0) != $7) {sub("NaN", f7[FNR])}
   if (($11+0) != $11) {sub("NaN", f11[FNR])}
}
1
' FS="[<>]" infile infile
1 Like

@rdrtx1 can fill the gapbut when try to restore xml into rrd it's not able to plot the graph.

@Corona688 I'm thinking to duplicate data from yesterday with same time frame to fill the Nan value. so the graph will become to strait line , actually try do some tricky way

 $2=P[D] * rand()

but value is too different than SD ,will try to work around on your code. xml.awk let me iterate through xml a lot easier

Could you completely explain what you want? Show input and output. It will be much easier to make what you actually want than modify something you didn't want into something you do, one tiny but drastic rewrite at a time.

sorry for mess the requested just my random outburst idea btw the script in #8 #9 are really fixing the gaps in rrd xml dump file and definitely solved my problem.