Help with complex merg of files with common field

Please help, I am new to shell Programming. I have three files each containg a unique text (key) field (e.g. ABCDEF, XCDUD as shown below), line return followed by some data of which there can be more then one instance. In addition, in some cases there may be no data but only a key field. Please see example below:

File A contains:
ABCDEF ----> Key
DataA-1 ---> Data
DataA-2 ---> Data
DataA-3 ---> Data
XCDUD -----> Key
DataA-1 ------> Data
UUUUA -----> Key
DataA-1 ------> Data

File B contains:
ABCDEF
DataB-1
DataC-1
XCDUD
DataB-1
UUUUA

File C contains:
ABCDEF
DataC-1
XCDUD
UUUUA
DataC-1

I want to merge these files by the unique key; I am only interested in the merged data separated by line return as shown below:

ABCDEF
DataA-1
DataB-1
DataC-1
DataA-2
DataA-3

XCDUD
Data A-1
Data B-1

UUUUA
Data A-1
Data C-1

:confused: Is it possible to script this? Please indicate how?

How can we distinguish between key and data (is Data ... the real pattern)?
Do you need the output sorted as per your example, or it does not matter?

The key is a unique field, I have egreped the original data. It will be as below:

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 121238123... </Error:Exeption>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 4554641..... </Error:Exeption>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 1277123.... </Error:Exeption>


<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<06:Detail> Code XYZ... </06:Detail>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<06:Detail> Code ABC... </06:Detail>

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>

As I have shown, in some cases there will only be the key and no data to report.

And keys/data in the above sample are?

Sorry don't know what happened in the previous post.

The key is a unique field, I have egreped the original data. It will be as below:

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 121238123... </Error:Exeption>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 4554641..... </Error:Exeption>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 1277123.... </Error:Exeption>


<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<06:Detail> Code XYZ... </06:Detail>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<06:Detail> Code ABC... </06:Detail>
<06:Detail> Code AAA... </06:Detail>


<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>

As I have shown, in some cases there will only be the key and no data to report.

Sorry don't know what happened in the previous post.

The key is a unique field, I have egreped the original data. It will be as below:

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 121238123... </Error:Exeption>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 4554641..... </Error:Exeption>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 1277123.... </Error:Exeption>

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Detail> Code XYZ... </Detail>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Detail> Code ABC... </Detail>
<Detail> Code AAA... </Detail>

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>

As I have shown, in some cases there will only be the key and no data to report.

See my previous post (from the example I cannot understand what you consider a key and what data).

Sorry, the example was incorrect.

Basically, what I am calling the key is the field: <_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier> as the number within is always unique eg. 115781057901, 1215781057902, 1215781057903 and so on. In each file the data is placed after the key. Each file contains one type of data, so I am trying to report on the data by the key.

Originally, I have one file that contains all the data. So I egrep <_05_1:MessageIdentifier> and <Error:Exception> in one file, <_05_1:MessageIdentifier> and <06:Detail> in another and finally <_05_1:MessageIdentifier> and <DataPosted> in another. The reason I am doing this is because I am going to CUT the data to get what we want before I merge the files. If there is way of egreping all the fields and cutting each piece of data, that would sort my problem in one go.

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 121238123... </Error:Exeption>
<_05_1:MessageIdentifier>ERR:38736086_1215781057903</_05_1:MessageIdentifier>
<Error:Exception> Error was 4554641..... </Error:Exeption>
<_05_1:MessageIdentifier>ERR:38736086_1215781057905</_05_1:MessageIdentifier>
<Error:Exception> Error was 1277123.... </Error:Exeption>

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<06:Detail> Code XYZ... </06:Detail>
<_05_1:MessageIdentifier>ERR:38736086_1215781057903</_05_1:MessageIdentifier>
<_05_1:MessageIdentifier>ERR:38736086_1215781057905</_05_1:MessageIdentifier>
<06:Detail> Code ABC... </06:Detail>
<06:Detail> Code AAA... </06:Detail>

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<_05_1:MessageIdentifier>ERR:38736086_1215781057903</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<_05_1:MessageIdentifier>ERR:38736086_1215781057905</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>

Sorry, the example was incorrect.

Basically, what I am calling the key is the field: <_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier> as the number within is always unique eg. 115781057901, 1215781057902, 1215781057903 and so on. In each file the data is placed after the key. Each file contains one type of data, so I am trying to report on the data by the key.

Originally, I have one file that contains all the data. So I egrep <_05_1:MessageIdentifier> and <Error:Exception> in one file, <_05_1:MessageIdentifier> and <06:Detail> in another and finally <_05_1:MessageIdentifier> and <DataPosted> in another. The reason I am doing this is because I am going to CUT the data to get what we want before I merge the files. If there is way of egreping all the fields and cutting each piece of data, that would sort my problem in one go.

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 121238123... </Error:Exeption>
<_05_1:MessageIdentifier>ERR:38736086_1215781057903</_05_1:MessageIdentifier>
<Error:Exception> Error was 4554641..... </Error:Exeption>
<_05_1:MessageIdentifier>ERR:38736086_1215781057905</_05_1:MessageIdentifier>
<Error:Exception> Error was 1277123.... </Error:Exeption>

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<06:Detail> Code XYZ... </06:Detail>
<_05_1:MessageIdentifier>ERR:38736086_1215781057903</_05_1:MessageIdentifier>
<_05_1:MessageIdentifier>ERR:38736086_1215781057905</_05_1:MessageIdentifier>
<06:Detail> Code ABC... </06:Detail>
<06:Detail> Code AAA... </06:Detail>

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<_05_1:MessageIdentifier>ERR:38736086_1215781057903</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<_05_1:MessageIdentifier>ERR:38736086_1215781057905</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>

Something like this?

perl -ne'
  $key = $1 and next if /ERR:\d+_(\d+)/;
  $data{$key} = $data{$key} ?
    $data{$key} . "\n" . $1 :
      $1 if m|DataPosted>(.+?)</Data|;
  END {
    print map { $_ . "\n" . $data{$_} . "\n" } keys %data;
  }' logfile

Radoulov,

Thank you for persevering with my query, I really apprecaite. As I am new Shell Scripting could you please give me some idea of what each line is doing? I have some idea but I do not completely apprecaite the code. Where is the files to process specified?

Well,
try to execute it first passing your files as arguments:
(just copy/paste it on the command line)

perl -ne'
  $key = $1 and next if /ERR:\d+_(\d+)/;
  $data{$key} = $data{$key} ?
    $data{$key} . "\n" . $1 :
      $1 if m|DataPosted>(.+?)</Data|;
  END {
    print map { $_ . "\n" . $data{$_} . "\n" } keys %data;
  }' fileA fileB fileC ... 

I'm trying to guess here, I'm not sure if you want only the lines inside the <DataPosted> tags and if they span over multiple lines.

Radoulov,

I need the data from all the files. For example, for the first key, I would want:

<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 121238123... </Error:Exception>
<O6:Detail> Code XYZ... </O6:Detail>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>

and so on

The code works and it does give me the data portion. Can it be expanded to give me the data from the other files too for each key?

1215781057901
Data.......
Data.......
Data.......
1215781057903
Data.......
1215781057905
Data.......
Data.......
Data.......
Data.......
Data.......
Data.......

Yep:

perl -ne'
  $key = $1 and next if /ERR:\d+_(\d+)/;
  $data{$key} = $data{$key} ?
    $data{$key} . "\n" . $1 :
      $1 if m|>(.+?)</|;
  END {
    print map { $_ . "\n" . $data{$_} . "\n" } keys %data;
  }' fileA fileB fileC ...

The new code gives the same result as before:

Code executed:
$ perl -ne'
$key = $1 and next if /ERR:\d+(\d+)/;
$data{$key} = $data{$key} ?
$data{$key} . "\n" . $1 :
$1 if m|DataPosted>(.+?)</Data|;
END {
print map { $
. "\n" . $data{$_} . "\n" } keys %data;
}' G01.txt G02.txt G03.txt

Results:
1215781057901
Data.......
Data.......
Data.......
1215781057903
Data.......
1215781057905
Data.......
Data.......
Data.......
Data.......
Data.......
Data.......

G01.txt contents:
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<Error:Exception> Error was 121238123... </Error:Exeption>
<_05_1:MessageIdentifier>ERR:38736086_1215781057903</_05_1:MessageIdentifier>
<Error:Exception> Error was 4554641..... </Error:Exeption>
<_05_1:MessageIdentifier>ERR:38736086_1215781057905</_05_1:MessageIdentifier>
<Error:Exception> Error was 1277123.... </Error:Exeption>

G02.txt contents:
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<06:Detail> Code XYZ... </06:Detail>
<_05_1:MessageIdentifier>ERR:38736086_1215781057903</_05_1:MessageIdentifier>
<_05_1:MessageIdentifier>ERR:38736086_1215781057905</_05_1:MessageIdentifier>
<06:Detail> Code ABC... </06:Detail>
<06:Detail> Code AAA... </06:Detail>

G03.txt contents:
<_05_1:MessageIdentifier>ERR:38736086_1215781057901</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<_05_1:MessageIdentifier>ERR:38736086_1215781057903</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<_05_1:MessageIdentifier>ERR:38736086_1215781057905</_05_1:MessageIdentifier>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>
<DataPosted> Data....... </DataPosted>

I've modified the code after the first post (it was wrong),
use the new one:

m|DataPosted>(.+?)</Data|

is now:

 m|>(.+?)</|

Sorry the 06:Detail is zero six :Detail. This is being changed to emicon.

Radoulov the code below works as expected :). Brilliant.. Thank you:b:. Could you please explain what each line is doing as I am new and it all seems a little bewildering.

perl -ne'
  $key = $1 and next if /ERR:\d+_(\d+)/;
  $data{$key} = $data{$key} ?
    $data{$key} . "\n" . $1 :
      $1 if m|>(.+?)</|;
  END {
    print map { $_ . "\n" . $data{$_} . "\n" } keys %data;
  }' G01.txt G02.txt G03.txt