Concatenation lines based on first field of the lines

Hello All,

This is to request some assistance on the issue that I encountered until recently.
Problem is:
I have a pipe delimited file in which some lines/records are broken. Now, I have to join/concatenate broken lines in the file to form actual record to make sure that the count of records before and after processing the file stays the same.

Sample data looks like this:
113321|107|E|1|828|20|4032832|EL POETA|VILLALOBOS MIJARES PABLO
NEPTALI RICARDO ELIECER
SABREZ|CA|2000|10000|10600|201407201412
113321|107|E|1|828|20|3924814|ME HACE TANTO BIEN|GUERRERO DE LA PENA
MUNOZ CARLOS ISSAC|CA|1666|10000|8800|201407201412
113321|107|E|1|828|20|4055313|PEPE|ALVAREZ GONZALEZ
ANDERSON MIGUEL|CA|2500|10000|13200|201407201412
113321|107|E|1|828|20|4034084|SIN TI|VILLALOBOS MIJARES PABLO
NEPTALI RICARDO ELIECER|CA|1000|10000|5300|201407201412
Expected output would be like this:
113321|107|E|1|828|20|4032832|EL POETA|VILLALOBOS MIJARES PABLO NEPTALI RICARDO ELIECER SABREZ|CA|2000|10000|10600|201407201412
113321|107|E|1|828|20|3924814|ME HACE TANTO BIEN|GUERRERO DE LA PENA MUNOZ CARLOS ISSAC|CA|1666|10000|8800|201407201412
113321|107|E|1|828|20|4055313|PEPE|ALVAREZ GONZALEZ ANDERSON MIGUEL|CA|2500|10000|13200|201407201412
113321|107|E|1|828|20|4034084|SIN TI|VILLALOBOS MIJARES PABLO NEPTALI RICARDO ELIECER|CA|1000|10000|5300|201407201412
Code that I have tried so far:
awk -v var="$pattern" '/var"\n"/{printf "\n" $0;next}{printf $0}' file.txt
$pattern is variable that I am passing as 113321

Any assistance would be greatlly appreciated

Hello svks1985,

Could you please try following and let me know if this helps you.

awk '{printf("%s%s",($0 ~ /^[[:digit:]]/ && NR>1)?RS:((NR>1)?FS:""),$0)} END{print X}'  Input_file

Output will be as follows.

113321|107|E|1|828|20|4032832|EL POETA|VILLALOBOS MIJARES PABLO NEPTALI RICARDO ELIECER SABREZ|CA|2000|10000|10600|201407201412
113321|107|E|1|828|20|3924814|ME HACE TANTO BIEN|GUERRERO DE LA PENA MUNOZ CARLOS ISSAC|CA|1666|10000|8800|201407201412
113321|107|E|1|828|20|4055313|PEPE|ALVAREZ GONZALEZ ANDERSON MIGUEL|CA|2500|10000|13200|201407201412
113321|107|E|1|828|20|4034084|SIN TI|VILLALOBOS MIJARES PABLO NEPTALI RICARDO ELIECER|CA|1000|10000|5300|201407201412

NOTE: Considering here you actual data will be same as sample data shown.

Thanks,
R. Singh

2 Likes

Hello RavinderSingh13

Thanks much for the response!
Solution provided by you certainly worked. However, I would like to inform you that the data could be different but the very first "numeral (113321)" part in all the other data files would be same. i.e. another file could have another numeral (say 123456) but that would stay the same for all the records. In other words, occurrence of 123456 or 113321 in cited case shows start of new record.

Also, I would really appreciate if you can explain your code.

Hello svks1985,

For any digits(which are present in starting of any line) above code should work. Following explanation could help you in same but it is only for explanation you have to run it in previous post form only.

awk '{printf("%s%s"                 #### Use printf for printing the values, awk's keyword.
,($0 ~ /^[[:digit:]]/ && NR>1)      #### Checking condition here if a line is starting with digits and line number is greater than 1 then do 
?                                   #### ? we use for mentioning that if above condition is TRUE execute next actions.
RS                                  #### print RS(record separator) which will be a new line by default.
:                                   #### : we use for mentioning that if condition is NOT TRUE then following statements/actions should be done.
        ((NR>1)                     #### (NR>1) again checking the condition if NR>1(means current line number) is greater than 1
        ?                           #### ? if above condition is TRUE then perform following actions. 
        FS                          #### print FS(field separator) whose default value is space.
        :                           #### : If above conditions are NOT TRUE then perform following actions.
        ""),                        #### print NOTHING by mentioning "".
$0)}                                #### print complete line by mentioning $0.
END{                                #### Mentioning END section here.
print X}'                           #### print X(variable whose value is NULL), so it will print a new line at last.
Input_file                          #### Mentioning Input_file here.
 

Thanks,
R. Singh

2 Likes
$ awk -F\| '$1~/^[0-9]/{printf("\n%s ",$0);next}{printf("%s",$0)}END{print "\n"}' input.txt

113321|107|E|1|828|20|4032832|EL POETA|VILLALOBOS MIJARES PABLO NEPTALI RICARDO ELIECERSABREZ|CA|2000|10000|10600|201407201412
113321|107|E|1|828|20|3924814|ME HACE TANTO BIEN|GUERRERO DE LA PENA MUNOZ CARLOS ISSAC|CA|1666|10000|8800|201407201412
113321|107|E|1|828|20|4055313|PEPE|ALVAREZ GONZALEZ ANDERSON MIGUEL|CA|2500|10000|13200|201407201412
113321|107|E|1|828|20|4034084|SIN TI|VILLALOBOS MIJARES PABLO NEPTALI RICARDO ELIECER|CA|1000|10000|5300|201407201412
1 Like

Provided there are 14 fields and there is no line break in the last field, try:

awk -F\| '{while(NF<14 && (getline n)>0) $0=$0 OFS n}1' file
1 Like
perl -pe 's/(?<!\d)\n/ /' file.txt
1 Like

Nice angle Aia. That should work provided that none of the fields are broken after a digit as the last character and that the last character in the last field is a digit..

Given what was stated in post #3, you might also try something like:

awk '
BEGIN {	FS = OFS = "|"
}
NR == 1 {
	num = $1
	printf("%s", $0)
	next
}
{	printf("%s%s", $1 == num ? "\n" : " ", $0)
}
END {	print ""
}' file.txt

which captures the 1st field from the 1st line in the input and joins following lines with a <space> separator until it find another line with the same first field.

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

1 Like