Extract date and time part from filename

Hi,

I am facing one scenario in which I need to extract exact position of date and time from the name of the files. For example, Below is the record in which I need to extract position of YYYYMMDD,HHMISS and YYMMDD . Date and time variables can come more than once. I need to use these position and rename ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt file in ABC_123_V01P.YYYYMMDD.HHMISS.txt format i.e. If I receive file as ABC_123_20190401_V01P1234190401190242.txt then I should be able to rename it as ABC_123_V01P.20190401.190242.txt.

$ cat RECORD
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY
$

I have tried below code but, I am not getting desired result.

My Code -

$ cat timestamp_format.csv | awk -F"," -v OFS="," '{print $1,$2,$3,$4}' | while IFS="," read FILE_TIMESTAMP FILE_TIMESTAMP_PARAM FILE_TIMESTAMP_REGEX EXP_FILE_TIMESTAMP_FORMAT
do
awk -v OFS="|" -v FILE_TIMESTAMP="${FILE_TIMESTAMP}" -v FILE_TIMESTAMP_REGEX="${FILE_TIMESTAMP_REGEX}" -v EXP_FILE_TIMESTAMP_FORMAT="${EXP_FILE_TIMESTAMP_FORMAT}" 'FNR==NR{a[NR]=$1;b[NR]=$3;c[NR]=$4;CNT=NR;next}
{
SOURCE_FILENAME=$4
TARGET_FILENAME=$5
for (i=1;i<=CNT;i++)
{
VAR=a;
DATE_FORMAT_REGEX=b;
S1=index($4,VAR);
L1=length(VAR);
S_FTP_FILE_TIMESTAMP1=index(SOURCE_FILENAME,FILE_TIMESTAMP);
L_FTP_FILE_TIMESTAMP1=length(FILE_TIMESTAMP);
TEMP_SOURCE_FILENAME=sub(VAR,DATE_FORMAT_REGEX,SOURCE_FILENAME);
S2=index(SOURCE_FILENAME,VAR);
L2=length(VAR);
S_FTP_FILE_TIMESTAMP2=index(SOURCE_FILENAME,FILE_TIMESTAMP);
L_FTP_FILE_TIMESTAMP2=length(FILE_TIMESTAMP);
for (j=1;j<=CNT;j++)
{
DVAR=a[j];
EXP_DATE_FORMAT=c[j];
DS1=index($5,DVAR);
DL1=length(DVAR);
DS_FILE_TIMESTAMP1=index(TARGET_FILENAME,FILE_TIMESTAMP);
DL_FILE_TIMESTAMP1=length(FILE_TIMESTAMP);
TEMP_TARGET_FILENAME=sub(VAR,DATE_FORMAT_REGEX,TARGET_FILENAME);
DS2=index(TARGET_FILENAME,DVAR);
DL2=length(DVAR);
DS_FILE_TIMESTAMP2=index(TARGET_FILENAME,FILE_TIMESTAMP);
DL_FILE_TIMESTAMP2=length(FILE_TIMESTAMP);
if ( S1 > 0 && DS1 > 0 && S_FTP_FILE_TIMESTAMP1 > 0 && DS_FILE_TIMESTAMP1 > 0 )
{
print $0,S1,L1,substr($4,S1,L1),DATE_FORMAT_REGEX,S_FTP_FILE_TIMESTAMP1,L_FTP_FILE_TIMESTAMP1,FILE_TIMESTAMP,FILE_TIMESTAMP_REGEX,S2,L2,substr($4,S2,L2),DATE_FORMAT_REGEX,S_FTP_FILE_TIMESTAMP2,L_FTP_FILE_TIMESTAMP2,FILE_TIMESTAMP,FILE_TIMESTAMP_REGEX,DS1,DL1,substr($5,DS1,DL1),EXP_DATE_FORMAT,DS_FILE_TIMESTAMP1,DL_FILE_TIMESTAMP1,FILE_TIMESTAMP,FILE_TIMESTAMP_REGEX,EXP_FILE_TIMESTAMP_FORMAT,DS2,DL2,substr($5,DS2,DL2),EXP_DATE_FORMAT,DS_FILE_TIMESTAMP2,DL_FILE_TIMESTAMP2,FILE_TIMESTAMP,FILE_TIMESTAMP_REGEX,EXP_FILE_TIMESTAMP_FORMAT
}
}
}
}' FS="," date_format.csv FS="|" RECORD
done
$

Mapping files used are -

$ cat date_format.csv
YYYYMMDD,[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9],........,%Y%m%d
YYYY-MM-DD,[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9],....-..-..,%Y-%m-%d
YYYY_MM_DD,[0-9][0-9][0-9][0-9]_[0-9][0-9]_[0-9][0-9],...._.._..,%Y_%m_%d
YYMMDD,[0-9][0-9][0-9][0-9][0-9][0-9],......,%y%m%d
YY-MM-DD,[0-9][0-9]-[0-9][0-9]-[0-9][0-9],..-..-..,%y-%m-%d
DD-MMM-YYYY,[0-9][0-9]-[a-z|A-Z][a-z|A-Z][a-z|A-Z]-[0-9][0-9][0-9][0-9],..-...-....,%d-%b-%Y
DD_MMM_YYYY,[0-9][0-9]_[a-z|A-Z][a-z|A-Z][a-z|A-Z]_[0-9][0-9][0-9][0-9],.._..._....,%d_%b_%Y
D-MMM-YYYY,[0-9]-[a-z|A-Z][a-z|A-Z][a-z|A-Z]-[0-9][0-9][0-9][0-9],.-...-....,%e-%b-%Y
D_MMM_YYYY,[0-9]_[a-z|A-Z][a-z|A-Z][a-z|A-Z]_[0-9][0-9][0-9][0-9],._..._....,%e_%b_%Y
$
$ cat timestamp_format.csv
XXXXXXXXXX,[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9],..........,%s
HHMISS,[0-9][0-9][0-9][0-9][0-9][0-9],......,%H%M%S
$

Above code is giving me this output, which is wrong -

SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|........|32|6|HHMISS|......|0|8|ABC_123_|........|32|6|HHMISS|......|14|8|YYYYMMDD|%Y%m%d|23|6|HHMISS|......|%H%M%S|0|8|ABC_123_|%Y%m%d|23|6|HHMISS|......|%H%M%S
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|........|32|6|HHMISS|......|0|8|ABC_123_|........|32|6|HHMISS|......|16|6|YYMMDD|%y%m%d|23|6|HHMISS|......|%H%M%S|0|6|ABC_12|%y%m%d|23|6|HHMISS|......|%H%M%S
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|11|6|YYMMDD|......|32|6|HHMISS|......|0|6|ABC_12|......|32|6|HHMISS|......|14|8|YYYYMMDD|%Y%m%d|23|6|HHMISS|......|%H%M%S|0|8|ABC_123_|%Y%m%d|23|6|HHMISS|......|%H%M%S
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|11|6|YYMMDD|......|32|6|HHMISS|......|0|6|ABC_12|......|32|6|HHMISS|......|16|6|YYMMDD|%y%m%d|23|6|HHMISS|......|%H%M%S|0|6|ABC_12|%y%m%d|23|6|HHMISS|......|%H%M%S

Expected Output is as follows -

SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|........|32|6|HHMISS|......|26|6|YYMMDD|......|0|0|||14|8|YYYYMMDD|%Y%m%d|23|6|HHMISS|......|%H%M%S|0|0||||0|0||||

Can someone advise on this code or suggest better approach?

1 Like

Hello Prathmesh,

Very good that you have shared all your efforts and tried to post nice stuff with complete details. Sorry but it is still NOT fully clear, if you edit your post with more descriptive details and more descriptive sample input and specially sample expected output it will be easy for us to help you on this one. Keep up the good job of learning and sharing, cheers and enjoy learning on this site :b:

Thanks,
R. Singh

Thanks Ravinder for your attention.

Basically, I have below pipe delimited record as an input in which column 4 has the filename as received on the source side and column 5 has the filename required as an output. In short, we have to rename filename from format in column 4 to format in column 5. But, here date and time parts are variables in specific format. So, I have created 2 mapping files one with name date_format.csv and other with name timestamp_format.csv which will give predefined formats which are expected in input file.

Input Record -

SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY

Expected Output is as follows -

SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|........|32|6|HHMISS|......|26|6|YYMMDD|......|0|0|||14|8|YYYYMMDD|%Y%m%d|23|6|HHMISS|......|%H%M%S|0|0||||0|0||||

In the above expected output,
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY ---- This part is same as input
9|8|YYYYMMDD|........ ---- 9 is start position of format YYYYMMDD, 8 is length of format YYYYMMDD, YYYYMMDD is date format as mentioned in date_format.csv and ........ is the number of characters in format YYYYMMDD as metioned in date_format.csv.
32|6|HHMISS|...... ---- 32 is start position of format HHMISS, 6 is length of format HHMISS, HHMISS is time format as mentioned in timestamp_format.csv and ...... is the number of characters in format HHMISS as metioned in timestamp_format.csv.
26|6|YYMMDD|...... ---- 26 is start position of format YYMMDD, 6 is length of format YYMMDD, YYMMDD is date format as mentioned in date_format.csv and ...... is the number of characters in format YYMMDD as metioned in date_format.csv.
0|0|| ---- As we do not have second set of HHMISS the results are start_position=0,length=0,format blank (we can keep format as HHMISS|...... if it is easy)
Above part is checked against column 4 and below part is checked against column 5.
14|8|YYYYMMDD|%Y%m%d ---- 14 is start position of format YYYYMMDD, 8 is length of format YYYYMMDD, YYYYMMDD is date format as mentioned in date_format.csv , ........ is the number of characters in format YYYYMMDD as metioned in date_format.csv and %Y%m%d is the date format for YYYYMMDD in date command also defined in date_format.csv.
23|6|HHMISS|......|%H%M%S ---- 23 is start position of format HHMISS, 6 is length of format HHMISS, HHMISS is time format as mentioned in timestamp_format.csv , ...... is the number of characters in format HHMISS as metioned in timestamp_format.csv and %H%M%S is the time format for HHMISS in date command also defined in timestamp_format.csv.
0|0||||0|0|||| ---- remaining parameters are 0 or blank as they represent second occurance of date/time in start_position, length, date/time format, character regex and respective format of date command.

Hope this clarifies the requirement. Let me know if it is still not clear or if you have any better approach to extract date/time part.

This is not easy a request, both to understand and to implement. I have understood that you want to search two input fields, $4 and $5, for date/Ttime format descriptors, and append their metadata to the line. Additional data that shows up in your desired output, as ...... , |0|0| , or %Y%m%d are even harder to implement as specified. Try this a a first starting point:

awk -F"|" '
BEGIN   {MXFM = split ("YYYYMMDD|YYMMDD|HHMISS", FMT)
        }

        {printf "%s", $0
         for (j=4; j<=5; j++)   {TMP = $j
                                 PTR = 0
                                 for (i=1; i<=MXFM; i++) if (match (TMP, FMT))       {printf "|%s|%s|%s|...", PTR + RSTART, RLENGTH, substr (TMP, RSTART, RLENGTH)
                                                                                         TMP = substr (TMP, RSTART+RLENGTH)
                                                                                         PTR += RSTART + RLENGTH - 1
                                                                                        }
                                }
         print ""
        }
'  OFS="|" file
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|...|26|6|YYMMDD|...|32|6|HHMISS|...|14|8|YYYYMMDD|...|23|6|HHMISS|...

and see how far it gets you.

1 Like

Thanks for your help. I will try this out and see how it goes.

Thanks RudiC. This is very close, I will try to build on your solution and see how I can use it.
Only difference is I want same number of "." characters as in YYYYMMDD, YYMMDD or HHMISS.

So, Output will look like this -

SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|........|26|6|YYMMDD|......|32|6|HHMISS|......|14|8|YYYYMMDD|........|23|6|HHMISS|......

Also, Instead of hardcoding YYYYMMDD, YYMMDD and HHMISS in split ("YYYYMMDD|YYMMDD|HHMISS", FMT) can we use separate files to pass them. I am trying to use below files.

$ cat RECORD
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY
$ cat date_format.csv
YYYYMMDD,[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9],........,%Y%m%d
YYYY-MM-DD,[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9],....-..-..,%Y-%m-%d
YYYY_MM_DD,[0-9][0-9][0-9][0-9]_[0-9][0-9]_[0-9][0-9],...._.._..,%Y_%m_%d
YYMMDD,[0-9][0-9][0-9][0-9][0-9][0-9],......,%y%m%d
YY-MM-DD,[0-9][0-9]-[0-9][0-9]-[0-9][0-9],..-..-..,%y-%m-%d
DD-MMM-YYYY,[0-9][0-9]-[a-z|A-Z][a-z|A-Z][a-z|A-Z]-[0-9][0-9][0-9][0-9],..-...-....,%d-%b-%Y
DD_MMM_YYYY,[0-9][0-9]_[a-z|A-Z][a-z|A-Z][a-z|A-Z]_[0-9][0-9][0-9][0-9],.._..._....,%d_%b_%Y
D-MMM-YYYY,[0-9]-[a-z|A-Z][a-z|A-Z][a-z|A-Z]-[0-9][0-9][0-9][0-9],.-...-....,%e-%b-%Y
D_MMM_YYYY,[0-9]_[a-z|A-Z][a-z|A-Z][a-z|A-Z]_[0-9][0-9][0-9][0-9],._..._....,%e_%b_%Y
$ cat timestamp_format.csv
XXXXXXXXXX,[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9],..........,%s
HHMISS,[0-9][0-9][0-9][0-9][0-9][0-9],......,%H%M%S

So, I am trying to write awk using three files RECORD, date_format.csv and timestamp_format.csv to achieve below output.

SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|........|26|6|YYMMDD|......|32|6|HHMISS|......|14|8|YYYYMMDD|........|23|6|HHMISS|......

Kindly advise how can I use awk on these three files.

Hmmmm - not quite clear yet when you want the dots, and when the printf format strings, or when both... See how far this gets you:

awk '
FNR == 1        {FILENR++
                }
FILENR <= 2     {CNT++
                 CHFM[CNT] = $1
                 RGEX[CNT] = $2
                 DOTS[CNT] = $3
                 FMTS[CNT] = $4
                 next
                }
                {printf "%s", $0
                 for (j=4; j<=5; j++)   {TMP = $j
                                         PTR = 0
                                         for (i=1; i<=CNT; i++) if (match (TMP, CHFM))       {printf "|%s|%s|%s|%s|%s", PTR + RSTART, RLENGTH, CHFM, DOTS, FMTS
                                                                                                 TMP = substr (TMP, RSTART+RLENGTH)
                                                                                                 PTR += RSTART + RLENGTH - 1
                                                                                                }
                                        }
                 print ""
                }
'  FS=","  *format.csv FS="|" OFS="|" RECORD
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|........|%Y%m%d|26|6|YYMMDD|......|%y%m%d|32|6|HHMISS|......|%H%M%S|14|8|YYYYMMDD|........|%Y%m%d|23|6|HHMISS|......|%H%M%S
2 Likes

Thanks RudiC. This works perfectly well for me.
If I can ask for one more thing that arise with this result. Now, with the same three input files RECORD, date_format.csv and timestamp_format.csv, I need to group how many dates are in column 4 and 5.

For example, If the RECORD file has records -

$ cat RECORD
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY
SOURCE|LOGIN|SERVER|P_XYZ_1234_ABCD.YYYYMMDD.HHMISS.csv|Report.ABCD_csv.YYYYMMDD.HHMISS.csv|CATEGORY
$

Output should be -

SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|2|1|1|1
SOURCE|LOGIN|SERVER|P_XYZ_1234_ABCD.YYYYMMDD.HHMISS.csv|Report.ABCD_csv.YYYYMMDD.HHMISS.csv|CATEGORY|1|1|1|1

where,
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY and SOURCE|LOGIN|SERVER|P_XYZ_1234_ABCD.YYYYMMDD.HHMISS.csv|Report.ABCD_csv.YYYYMMDD.HHMISS.csv|CATEGORY represents the input record.

Next 2 columns 2|1|1|1 represents, number of date and time fields which is part of date_format.csv and timestamp_format.csv in fields ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt and P_XYZ_1234_ABCD.YYYYMMDD.HHMISS.csv i.e. column 4 of input file RECORD.
and Next 2 1|1|1|1 columns represents, number of date and time fields which is part of date_format.csv and timestamp_format.csv in fields ABC_123_V01P.YYYYMMDD.HHMISS.txt and Report.ABCD_csv.YYYYMMDD.HHMISS.csv i.e. column 5 of input file RECORD.

So you want to print out info ACROSS lines, i.e. in line 1 should appear info about field 4 of line 1 and 2, and in line2 there should be info on field 5 of line 1 and 2? What if more lines will appear, i.e. line3, line 4, etc.? Where and how should those infos be printed?

Hi RudiC,

Let me explain why and what I am looking exactly.

For input file RECORD, earlier code provided by you provides below output -

Input File -

$ cat RECORD
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY
SOURCE|LOGIN|SERVER|P_XYZ_1234_ABCD.YYYYMMDD.HHMISS.csv|Report.ABCD_csv.YYYYMMDD.HHMISS.csv|CATEGORY
$

Output -

SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|........|%Y%m%d|26|6|YYMMDD|......|%y%m%d|32|6|HHMISS|......|%H%M%S|14|8|YYYYMMDD|........|%Y%m%d|23|6|HHMISS|......|%H%M%S
SOURCE|LOGIN|SERVER|P_XYZ_1234_ABCD.YYYYMMDD.HHMISS.csv|Report.ABCD_csv.YYYYMMDD.HHMISS.csv|CATEGORY|17|8|YYYYMMDD|........|%Y%m%d|26|6|HHMISS|......|%H%M%S|17|8|YYYYMMDD|........|%Y%m%d|26|6|HHMISS|......|%H%M%S

Here, As you can see in input file for 1st record column 4 filename is ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt which contains 2 date values (as specified in date_format.csv) and 1 time value (as specified in timestamp_format.csv).
For 2nd record column 4 filename is P_XYZ_1234_ABCD.YYYYMMDD.HHMISS.csv which contains 1 date value (as specified in date_format.csv) and 1 time value (as specified in timestamp_format.csv).

Similarly, For column 5 we have different number of date/time values. Which results in different number of columns in output depending on number of date/time values in filename. Hence, It is difficult to identify exact position of date/time variables for column 4 and 5.

For this reason, I was thinking of adding 4 more columns in output which will help me to identify how much date/time variables present in column 4 and 5.

SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|2|1|1|1
SOURCE|LOGIN|SERVER|P_XYZ_1234_ABCD.YYYYMMDD.HHMISS.csv|Report.ABCD_csv.YYYYMMDD.HHMISS.csv|CATEGORY|1|1|1|1

Here, For 1st output record, 2|1|1|1
-- 2 represents number of date variables in columns 4 as specified in date_format.csv
-- 1 represents number of time variables in columns 4 as specified in timestamp_format.csv
-- 1 represents number of date variables in columns 5 as specified in date_format.csv
-- 1 represents number of time variables in columns 5 as specified in timestamp_format.csv
For 2nd output record, 1|1|1|1
-- 1 represents number of date variables in columns 4 as specified in date_format.csv
-- 1 represents number of time variables in columns 4 as specified in timestamp_format.csv
-- 1 represents number of date variables in columns 5 as specified in date_format.csv
-- 1 represents number of time variables in columns 5 as specified in timestamp_format.csv

Hope this clarifies why I am looking for this. It will be even better if we can use our earlier code and produce output as below -

SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|2|1|1|1|9|8|YYYYMMDD|........|%Y%m%d|26|6|YYMMDD|......|%y%m%d|32|6|HHMISS|......|%H%M%S|14|8|YYYYMMDD|........|%Y%m%d|23|6|HHMISS|......|%H%M%S
SOURCE|LOGIN|SERVER|P_XYZ_1234_ABCD.YYYYMMDD.HHMISS.csv|Report.ABCD_csv.YYYYMMDD.HHMISS.csv|CATEGORY|1|1|1|1|17|8|YYYYMMDD|........|%Y%m%d|26|6|HHMISS|......|%H%M%S|17|8|YYYYMMDD|........|%Y%m%d|26|6|HHMISS|......|%H%M%S

Here, column 7,8,9 and 10 will serve the same purpose as mentioned above to represent number of date/time variables in column 4 and 5 respectively.

That's much clearer, thank you. How about (adding the new fields at the end, so we can print the format info as we go through the file)

awk '
FNR == 1        {FILENR++
                }

FILENR <= 2     {CNT++
                 CHFM[CNT] = $1
                 RGEX[CNT] = $2
                 DOTS[CNT] = $3
                 FMTS[CNT] = $4
                 FIDX[CNT] = FILENR
                 next
                }

                {printf "%s", $0
                 for (j=4; j<=5; j++)   {TMP = $j
                                         PTR = 0
                                         for (i=1; i<=CNT; i++) if (match (TMP, CHFM))       {printf "|%s|%s|%s|%s|%s", PTR + RSTART, RLENGTH, CHFM, DOTS, FMTS
                                                                                                 TMP = substr (TMP, RSTART+RLENGTH)
                                                                                                 PTR += RSTART + RLENGTH - 1
                                                                                                 TYP[j-3,FIDX]++
                                                                                                }
                                        }
                 for (j=1; j<=2; j++) printf "|%s|%s", TYP[j,1], TYP[j,2]
                 print ""
                 split ("", TYP)
                }
'  FS=","  file[45] FS="|" OFS="|" file3
1 Like

Thanks RudiC. Adding new fields at the end will work too.
But, I am not getting desired output for below input -

$ cat RECORD
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDD.HHMISS.YYYY-MM-DD.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDD.HHMISS.YYYY-MM-DD.XXXXXXXXXX.txt|ABC_123_V01P.YYYY-MM-DD.YYYYMMDD.HHMISS.XXXXXXXXXX.txt|CATEGORY
SOURCE|LOGIN|SERVER|P_XYZ_1234_ABCD.YYYYMMDD.HHMISS.csv|Report.ABCD_csv.YYYYMMDD.HHMISS.csv|CATEGORY
$

Getting below output -

SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|........|%Y%m%d|26|6|YYMMDD|......|%y%m%d|32|6|HHMISS|......|%H%M%S|14|8|YYYYMMDD|........|%Y%m%d|23|6|HHMISS|......|%H%M%S|2|1|1|1
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDD.HHMISS.YYYY-MM-DD.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|........|%Y%m%d|40|10|YYYY-MM-DD|....-..-..|%Y-%m-%d|14|8|YYYYMMDD|........|%Y%m%d|23|6|HHMISS|......|%H%M%S|2||1|1
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDD.HHMISS.YYYY-MM-DD.XXXXXXXXXX.txt|ABC_123_V01P.YYYY-MM-DD.YYYYMMDD.HHMISS.XXXXXXXXXX.txt|CATEGORY|9|8|YYYYMMDD|........|%Y%m%d|40|10|YYYY-MM-DD|....-..-..|%Y-%m-%d|51|10|XXXXXXXXXX|..........|%s|25|8|YYYYMMDD|........|%Y%m%d|41|10|XXXXXXXXXX|..........|%s|2|1|1|1
SOURCE|LOGIN|SERVER|P_XYZ_1234_ABCD.YYYYMMDD.HHMISS.csv|Report.ABCD_csv.YYYYMMDD.HHMISS.csv|CATEGORY|17|8|YYYYMMDD|........|%Y%m%d|26|6|HHMISS|......|%H%M%S|17|8|YYYYMMDD|........|%Y%m%d|26|6|HHMISS|......|%H%M%S|1|1|1|1

Good grief! What a heck of a problem. We have a "race condidtion" of what pattern occurs first in the pattern file and what occurs first in the string. When found a pattern, and reduced the string, we need to start over with the first pattern again. I think I have found a solution; try and report back:

FNR == 1        {FILENR++
                }

FILENR <= 2     {CHFM = CHFM DL $1
                 RGEX[$1] = $2
                 DOTS[$1] = $3
                 FMTS[$1] = $4
                 FIDX[$1] = FILENR
                 DL = "|"
                 next
                }

                {printf "%s", $0
                 for (j=4; j<=5; j++)   {TMP = $j
                                         PTR = 0
                                         while (match (TMP, CHFM))      {MTCH = substr (TMP, RSTART, RLENGTH)
                                                                         printf "|%s|%s|%s|%s|%s", PTR + RSTART, RLENGTH, MTCH, DOTS[MTCH], FMTS[MTCH]
                                                                         TMP = substr (TMP, RSTART+RLENGTH)
                                                                         PTR += RSTART + RLENGTH - 1
                                                                         TYP[j-3,FIDX[MTCH]]++
                                                                        }
                                        }
                 for (j=1; j<=2; j++) printf "|%s|%s", TYP[j,1]+0, TYP[j,2]+0
                 print ""
                 split ("", TYP)
                }
'  FS=","  file[45] FS="|" OFS="|" file3
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDDHHMISS.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|........|%Y%m%d|26|6|YYMMDD|......|%y%m%d|32|6|HHMISS|......|%H%M%S|14|8|YYYYMMDD|........|%Y%m%d|23|6|HHMISS|......|%H%M%S|2|1|1|1
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDD.HHMISS.YYYY-MM-DD.txt|ABC_123_V01P.YYYYMMDD.HHMISS.txt|CATEGORY|9|8|YYYYMMDD|........|%Y%m%d|26|6|YYMMDD|......|%y%m%d|33|6|HHMISS|......|%H%M%S|40|10|YYYY-MM-DD|....-..-..|%Y-%m-%d|14|8|YYYYMMDD|........|%Y%m%d|23|6|HHMISS|......|%H%M%S|3|1|1|1
SOURCE|LOGIN|SERVER|ABC_123_YYYYMMDD_V01P1234YYMMDD.HHMISS.YYYY-MM-DD.XXXXXXXXXX.txt|ABC_123_V01P.YYYY-MM-DD.YYYYMMDD.HHMISS.XXXXXXXXXX.txt|CATEGORY|9|8|YYYYMMDD|........|%Y%m%d|26|6|YYMMDD|......|%y%m%d|33|6|HHMISS|......|%H%M%S|40|10|YYYY-MM-DD|....-..-..|%Y-%m-%d|51|10|XXXXXXXXXX|..........|%s|14|10|YYYY-MM-DD|....-..-..|%Y-%m-%d|25|8|YYYYMMDD|........|%Y%m%d|34|6|HHMISS|......|%H%M%S|41|10|XXXXXXXXXX|..........|%s|3|2|2|2
SOURCE|LOGIN|SERVER|P_XYZ_1234_ABCD.YYYYMMDD.HHMISS.csv|Report.ABCD_csv.YYYYMMDD.HHMISS.csv|CATEGORY|17|8|YYYYMMDD|........|%Y%m%d|26|6|HHMISS|......|%H%M%S|17|8|YYYYMMDD|........|%Y%m%d|26|6|HHMISS|......|%H%M%S|1|1|1|1
2 Likes

Thank you so much RudiC for your valuable time and help. Solution works perfectly fine.