Adding Leading Zeros for date in a file

Mannu2525 · September 8, 2016, 4:04am

Hello,

I have a pipe separated file with two major lines. One is header and another is detail line. Header starts with H and Detail start with D.

Sample Content:

 H|123456|Joes Watson|UK|4/5/2016|12/5/2016|3456|HC|NW||||||
 D|123456|Joes Watson|4/5/2016|12/5/2016|LR|LW||||||||||||||||||
 D|123456|Joes Watson|4/5/2016|6/11/2016|WR|LW||||||||||||||||||
 H|123457|Jack Roes|UK|11/11/2015|12/11/2015|3456|HC|NW||||||
 D|123457|Jack Roes|1/5/2016|1/5/2016|LR|LW||||||||||||||||||
 D|123457|Jack Roes|4/5/2016|6/12/2016|WR|LW||||||||||||||||||

Here Date format should be in mm/dd/yyyy but file has date format in different format - m/dd/yyyy or mm/d/yyyy or m/d/yyyy. If the format is not correct then a leading zero should be added to make it in mm/dd/yyyy like 4/5/2016 to 04/05/2016.

I tried few awk printf commands but nothing works out for multiple occurrences within the line.

Please let me know your possible way to crack this scenario.

Thanks,
Mannu

apmcd47 · September 8, 2016, 4:29am

It looks to me as though when it's a header ($1 == 'H') you are looking at fields 5 and 6 ($5 and $6). When it's a detail line you are looking at fields 4 and 5.

I don't know awk but is seems to me you need a construct like:

if ($1 == 'H') {
   # code for $5
   # identical code but for $6
} elsif ($1 == 'D') {
   # code for $5
   # identical code but for $4
}

As for the modification for the date you need to split on '/' and rebuild using printf.

I don't really know awk so the above construct is probably nonsensical.

Concentrate on modifying field 5 and then try putting the above construct in for the other fields.

Andrew

hergp · September 8, 2016, 4:48am

You can try this awk script. It looks for fields looking like dates, splits them into month, day and year and recombines them using a proper format.

BEGIN {
    FS="|"
    OFS="|"
}
{
    for (field = 1; field <= NF; field++) {
        if (match ($field, ".*/.*/.*")) {
            split ($field, mdy, "/");
            $field = sprintf ("%02d/%02d/%d", mdy[1], mdy[2], mdy[3]);
        }
    }
    print
}

Very old versions of awk, like /usr/bin/awk on Solaris do not understand sprintf . You should be on the safe side with GNU-awk or nawk though.

bakunin · September 8, 2016, 4:54am

In your sample there seem to be no other data similar to dates, so the following abbreviation may (or may not, depending on how representative your sample is) work:

Replace every instance of a single digit followed by a slash by this sequence with a "0" prepended. This would amount to a simple sed -command:

sed 's/\([^0-9]\)\([0-9]\/\)/\10\2/g' /path/to/your/file

This will replace in i.e. "|4/5/2016" the "|4/" with "|04/" and then the "/5/" to "/05/".

I hope this helps.

bakunin

RavinderSingh13 · September 8, 2016, 5:07am

hergp:

You can try this awk script. It looks for fields looking like dates, splits them into month, day and year and recombines them using a proper format.
BEGIN {
   FS="|"
   OFS="|"
}
{
   for (field = 1; field <= NF; field++) {
   if (match ($field, ".*/.*/.*")) {
   split ($field, mdy, "/");
   $field = sprintf ("%02d/%02d/%d", mdy[1], mdy[2], mdy[3]);
   }
   }
   print
}
Very old versions of awk, like /usr/bin/awk on Solaris do not understand sprintf . You should be on the safe side with GNU-awk or nawk though.

Hello hergp,

Thank you for nice code. I think you could change (match ($field, ".*/.*/.*")) to (match($field,/[0-9]+\/[0-9]+\/[0-9]+/)) , which will get only digits(for matching dates only).

Thanks,
R. Singh

drl · September 8, 2016, 7:45am

Hi.

A suite of programs dealing with many aspects of dates is dateutils, found in many repositories ( ArchLinux, Debian, Fedora, FreeBSD, Gentoo, NetBSD, OpenSuSE, OS, Slackware, Ubuntu ).

Most of the codes scan the input lines for data that looks like a date format, then processes it. In this case we ask dconv to read it and write it with the same format:

#!/usr/bin/env bash

# @(#) s1       Demonstrate date reformatting, dateutils.dconv
# Suite dateutils is in many repositories, and:
# https://github.com/hroptatyr/dateutils (verified 2016.09.08)

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C dateutils.dconv
which dateutils.dconv

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

# ...4/5/2016|12/5/2016...
pl " Results:"
dateutils.dconv -S -i "%m/%d/%Y" -f "%m/%d/%Y" < $FILE

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.4 (jessie) 
bash GNU bash 4.3.30
dateutils.dconv dconv 0.3.1
/usr/bin/dateutils.dconv

-----
 Input data file data1:
 H|123456|Joes Watson|UK|4/5/2016|12/5/2016|3456|HC|NW||||||
 D|123456|Joes Watson|4/5/2016|12/5/2016|LR|LW||||||||||||||||||
 D|123456|Joes Watson|4/5/2016|6/11/2016|WR|LW||||||||||||||||||
 H|123457|Jack Roes|UK|11/11/2015|12/11/2015|3456|HC|NW||||||
 D|123457|Jack Roes|1/5/2016|1/5/2016|LR|LW||||||||||||||||||
 D|123457|Jack Roes|4/5/2016|6/12/2016|WR|LW||||||||||||||||||

-----
 Results:
 H|123456|Joes Watson|UK|04/05/2016|12/05/2016|3456|HC|NW||||||
 D|123456|Joes Watson|04/05/2016|12/05/2016|LR|LW||||||||||||||||||
 D|123456|Joes Watson|04/05/2016|06/11/2016|WR|LW||||||||||||||||||
 H|123457|Jack Roes|UK|11/11/2015|12/11/2015|3456|HC|NW||||||
 D|123457|Jack Roes|01/05/2016|01/05/2016|LR|LW||||||||||||||||||
 D|123457|Jack Roes|04/05/2016|06/12/2016|WR|LW||||||||||||||||||

If not in your repository or you cannot change your system, see the github source -- that is how I originally obtained the codes until dateutils appeared in my repository.

Best wishes ... cheers, drl

Don_Cragun · September 8, 2016, 5:08pm

bakunin:

In your sample there seem to be no other data similar to dates, so the following abbreviation may (or may not, depending on how representative your sample is) work:

Replace every instance of a single digit followed by a slash by this sequence with a "0" prepended. This would amount to a simple sed -command:
sed 's/$[^0-9]$$[0-9]\/$/\10\2/g' /path/to/your/file
This will replace in i.e. "|4/5/2016" the "|4/" with "|04/" and then the "/5/" to "/05/".

I hope this helps.

bakunin

Not quite. If I have one of the OP's input lines:

 D|123457|Jack Roes|1/5/2016|1/5/2016|LR|LW|||||||||||||||||

the output I get is:

 D|123457|Jack Roes|01/5/2016|01/5/2016|LR|LW||||||||||||||||||

This is because the first match on the BRE is |1/ and you want to match /5/ to change the 5 to 05 , but the leading / for the 2nd match was consumed as the trailing / in the previous match.

This can be fixed by applying that substitute command twice:

sed -e 's/\([^0-9]\)\([0-9]\/\)/\10\2/g' -e 's/\([^0-9]\)\([0-9]\/\)/\10\2/g' /path/to/your/file

or with a slightly simpler second substitute command:

sed -e 'sx\([^0-9]\)\([0-9]/\)x\10\2xg' -e 'sx/\([0-9]/\)x/0\1xg' /path/to/your/file

(using x as the substitution delimiter instead of / to avoid a few backslashes).

Note that this sed solution will also work if a date field also includes time (e.g., 1/5/2016 11:23:45 ) while the awk solutions suggested will fail in this case. And note that this sed solution may "fix" too much if there is a field containing a simple fraction where the numerator is a single digit (e.g., 1/2 off will become 01/2 off and 2/3 majority will become 02/3 majority ). None of these are problems with the given sample input, but with all of those empty fields in the sample we have to wonder what else might appear in those field in other input lines???

MadeInGermany · September 9, 2016, 5:58am

Some sed versions do have the \< and \> boundary anchors; they do not consume a character. And not only allow a /g in one go but also make the substitution easier.

sed 's#\<[0-9]/#0&#g'

--
Just seeing that the Solaris /usr/xpg4/bin/sed counts the \< as a character that is truely a bug (/usr/bin/sed works).

looney · September 9, 2016, 8:56am

madeingermany:

Some sed versions do have the \< and \> boundary anchors; they do not consume a character. And not only allow a /g in one go but also make the substitution easier.
sed 's#\<[0-9]/#0&#g'
--
Just seeing that the Solaris /usr/xpg4/bin/sed counts the \< as a character that is truely a bug (/usr/bin/sed works).

Hi MIG,
How below highlighted part is working.

sed 's#\<[0-9]/#0&#g'

.Why only one sided boundary is being used. Please explain.

MadeInGermany · September 9, 2016, 9:05am

The right side is a /
If it were a \> then the following would be changed as well

4.2.2000
4-2-2001

jlliagre · September 9, 2016, 9:56am

I see no difference in behavior between the Solaris xpg4 and the regular svr4 sed implementations.

Your syntax only works with GNU sed, /usr/gnu/bin/sed on Solaris 11.

Don_Cragun · September 9, 2016, 5:52pm

madeingermany:

Some sed versions do have the \< and \> boundary anchors; they do not consume a character. And not only allow a /g in one go but also make the substitution easier.
sed 's#\<[0-9]/#0&#g'
--
Just seeing that the Solaris /usr/xpg4/bin/sed counts the \< as a character that is truely a bug (/usr/bin/sed works).

Why do you think this is a bug in /usr/xpg4/bin/sed ? The standards say that the REs accepted by sed are Basic Regular Expressions (BREs) with three minor tweaks that don't affect this command. In a BRE the meaning of a backslash followed by a character that is not a BRE special character, a <left-parenthesis>, <right-parenthesis>, <left-curly-bracket>, <right-curly-bracket>, the current delimiter character (the character that delimits the BRE and replacement from the command and flags), and the decimal digits 1 through 9 is undefined (and all of these are valid after a <backslash> only in certain contexts). The BRE special characters are <period>, <left-square-bracket>, <backslash>, <asterisk>, <circumflex>, and <dollar-sign> and all of these are also only special in certain contexts.

According to the standards, using \< or \> in a BRE can be treated by the implementation as a request to match a < or > , respectively; match a word boundary; abort with a BRE syntax error; hunt down and kill the programmer who wrote that BRE, or anything else it chooses to do. (Legal and marketing implications make it unlikely that an implementation will make the next to the last choice in that list. )

MadeInGermany · September 12, 2016, 5:11am

The Solaris man page for both /usr/bin/sed and /usr/xpg4/bin/sed refers to the regexp man page, that states

...
     3.1             \< constrains a RE to match the beginning of
                     a  string  or  to follow a character that is
                     not a  digit,  underscore,  or  letter.  The
                     first  character  matching  the RE must be a
                     digit, underscore, or letter.

Even if it does not explicitly say that the \< should never count as a character, the implementations in /usr/bin/sed and /usr/xpg4/bin/sed should be consistent.

Don_Cragun · September 12, 2016, 1:09pm

madeingermany:

The Solaris man page for both /usr/bin/sed and /usr/xpg4/bin/sed refers to the regexp man page, that states
...
   3.1             \< constrains a RE to match the beginning of
   a  string  or  to follow a character that is
   not a  digit,  underscore,  or  letter.  The
   first  character  matching  the RE must be a
   digit, underscore, or letter.
Even if it does not explicitly say that the \< should never count as a character, the implementations in /usr/bin/sed and /usr/xpg4/bin/sed should be consistent.

Hi MadeInGermany,
OK. Yes, the fact that /usr/xpg4/bin/sed does not behave as documented by the Solaris sed man page is definitely a bug. (It isn't a standards conformance issue, but it is a bug.)

Thanks,
Don