Joining fixed width files

shash · January 31, 2018, 3:29pm

Hi All,

I need to join fixed width files on a column which is position 1 to 3 and need to have all the records from file1

file1.txt

Cu1nullL1L2
Cu2nullL1L2
Cu3nullL1L2

file2.txt

Cu1B1B2
Cu3B1B2

output.txt

Cu1L1B1L2B2
Cu2L1L2
Cu3L1B1L2B3

I tried but not getting the expected resuls.

Any inputs please.

Thanks
Shashi

Don_Cragun · January 31, 2018, 3:52pm

What did you try?

What operating system are you using?

What shell are you using?

Is output.txt supposed to be a fixed-width file too? If so, what fill character is supposed to be used to fill the "empty" space at the end of the Cu2 record?

shash · January 31, 2018, 4:09pm

Hi Don,

I tried to print but unable to join the files and getting error.

awk '{print substr($0,1,3),substr($0,8,2),substr($0,10,2)}' file1.txt
awk '{print substr($0,1,3),substr($0,4,2),substr($0,6,2)}' file2.txt

I 'm using bash.
output.txt should be fixed width file and if there is no matching record then it should have white spaces at the end.

Thanks
Shash

RudiC · January 31, 2018, 5:09pm

How about - for exactly your sample files given in post#1 -

awk '
                {$0 = substr($0,1,3) " " substr($0,length-3,2) " " substr($0,length-1,2)
                }

NR == FNR       {T1[$1] = $2
                 T2[$1] = $3
                 next
                }

                {$0 = $1 $2 T1[$1] $3 T2[$1]
                 $0 = sprintf ("%s%*s", $0, 11-length, "")
                }

1

' file2 file1
Cu1L1B1L2B2
Cu2L1L2    
Cu3L1B1L2B2

Don_Cragun · January 31, 2018, 5:13pm

Hi Shash,
Your desired output doesn't make any sense to me. When you find lines in both files for a given key, the data from the fields in both files are intermixed. When data is only found in one file, why isn't the output supposed to have the data from each field in that input file in the output columns related to those input fields. In other words, with your sample input data, why isn't the output:

Cu1L1B1L2B2
Cu2L1  L2  
Cu3L1B1L2B2

instead of the output you said you want:

Cu1L1B1L2B2
Cu2L1L2    
Cu3L1B1L2B3

especially since there is no B3 anywhere in either of your sample input files?

To get the output shown above, you could use something like:

awk '
FNR == NR {
	data1[key = substr($0, 1, 3), 1] = substr($0, 8, 2)
	data1[key, 2] = substr($0, 10, 2)
	keys[key]
	next
}
{	data2[key = substr($0, 1, 3), 1] = substr($0, 4, 2)
	data2[key, 2] = substr($0, 6, 2)
	keys[key]
}
END {	for(key in keys)
		printf("%s%2.2s%2.2s%2.2s%2.2s\n", key, data1[key, 1],
		    data2[key, 1], data1[key, 2], data2[key, 2])
}' file1.txt file2.txt > output.txt

but note that the order of the output lines may vary. If the output order matters, you need to clearly state how the output order should be determined when:

file1.txt contains keys that do not appear in file2.txt (as in your example),
file2.txt contains keys that do not appear in file1.txt, and
both files contain keys that do not appear in the other file.

Note that I asked what operating system you're using and you didn't answer.

If you're using a Solaris/SunOS system and want to try the above code, change awk to /usr/xpg4/bin/awk or nawk .

The missing 2nd components of the 2nd data1[] and data2[] assignments have been fixed as noted by RudiC in post #5.

RudiC · January 31, 2018, 5:28pm

Hmmm - I'm afraid the second index is missing in the

and

assignments, or do I get this wrong?

Don_Cragun · January 31, 2018, 5:57pm

Yes, indeed. You got it right.

Post #5 has been fixed.

Thanks,
Don

rdrtx1 · January 31, 2018, 7:14pm

awk '
NR==FNR {c=0; for (i=4; i<=length; i+=2) a[substr($0,1,3),c++]=substr($0,i,2); next }
{
   sub(substr($0,1,7),substr($0,1,3));
   c=0;
   l=substr($0,1,3);
   for (i=4; i<=length; i+=2) l=l substr($0,i,2) a[substr($0,1,3),c++];
   print l;
}
' file2.txt file1.txt

shash · February 1, 2018, 4:56am

Thanks for all the inputs. It was of great help.

---------- Post updated at 04:56 AM ---------- Previous update was at 04:21 AM ----------

rdrtx1:

awk '
NR==FNR {c=0; for (i=4; i<=length; i+=2) a[substr($0,1,3),c++]=substr($0,i,2); next }
{
   sub(substr($0,1,7),substr($0,1,3));
   c=0;
   l=substr($0,1,3);
   for (i=4; i<=length; i+=2) l=l substr($0,i,2) a[substr($0,1,3),c++];
   print l;
}
' file2.txt file1.txt

Hi,

Can you please explain me about the code?

Thanks
Shash

rdrtx1 · February 1, 2018, 10:24am

awk '
NR==FNR {                                                                # for first file
   c=0;                                                                  # set column counter to 0
   for (i=4; i<=length; i+=2) a[substr($0,1,3),c++]=substr($0,i,2);      # load associative array with 2 character columns starting at column 4
   next;                                                                 # skip execution of script to next file
}
{                                                                        # for second file
   sub(substr($0,1,7),substr($0,1,3));                                   # remove columns 4 to 7
   c=0;                                                                  # initialize columns counter to 0
   l=substr($0,1,3);                                                     # set base name in output line (first 3 characters)
   for (i=4; i<=length; i+=2) l=l substr($0,i,2) a[substr($0,1,3),c++];  # concatenate 2 character columns starting at column 4 and associative array value
   print l;                                                              # print output line
}
' file2.txt file1.txt

Don_Cragun · February 1, 2018, 10:54am

Hi Shash,
You might note that the code rdrtx1 suggested will not produce any output for any lines in file2.txt unless there is also a line in file1.txt with the same first three characters on a line.

Note also that you said that the output file should be fixed width, but the code suggested by rdrtx1 produces lines that are 12 characters long (including the terminating <newline> character) for lines that are constructed from one line in each input file with a given 1st three characters, but produces 8 character long lines in the output file for lines that are constructed from a line in file1.txt that does not have a matching line in file2.txt .

Did you change your requirements from what you stated in post #3?

Cheers,
Don

shash · February 2, 2018, 4:13am

Hi Don,

rdrtx1 code helped me to only have the records which are in file1.txt which I forgot to specify and I wanted to understand the code.

Thanks
Shash

Don_Cragun · February 2, 2018, 1:02pm

Hi Shash,
With your newly stated requirements, the code RudiC suggested in post #4 does exactly as you requested (all lines from file1.txt with added data from corresponding records in file2.txt when available and space-filled fixed width output).

The code I suggested doesn't meet your updated requirements.

The code rdrtx1 provided gives you two of your three requirements, but does not provide fixed width output.

I was just surprised you chose rdrtx1's suggestion since it is the only one that didn't meet your original requirement of fixed width output. (Note, however, that it deed produce the sample output you posted as your desired output in post #1 in this thread.)

Cheers,
Don