Pattern Match and Rearrange the Fields in UNIX

For an Output like below

Input :

<Subject A="I" B="1039502" C="2015-06-30" D="010101010101">
Output : 
<Subject D="010101010101" B="1039502" C="2015-06-30" A="I">

I have been using something like below but not getting the desired output :

awk -F ' ' '/Subject/ BEGIN{OFS=" ";} {print $1,$5,$3,$4,$2}' input_File > output_file

Could someone pl help

I don't know how generic this must be but you can try this:

$ awk '{sub(/>/,"",$NF); print $1,$5,$3,$4,$2">"}' infile
<Subject D="010101010101" B="1039502" C="2015-06-30" A="I">
1 Like

Hello arunkesi,

Could you please try following and let me know if this helps, considering that you need to reverse all fields separated by space apart from <Subject string in starting.

awk '{for(i=NF;i>1;i--){sub(/>/,X,$i);A=A?A OFS $i:$i};print "<Subject " A ">";A=""}'  Input_file

Output will be as follows.

<Subject D="010101010101" C="2015-06-30" B="1039502" A="I">

Thanks,
R. Singh

Hi Ravinder,

if it is not an error by the OP, he wants to have the order not reversed but "D B C A".

Thanks Zaxxon, I would want to change only the lines which start with <Subject and all the other lines should be retained as in the original file

Adapt Zaxxon's proposal

$ awk '/subject/ {sub(/>/,"",$NF); print $1,$5,$3,$4,$2">"; next} 1' infile 

I have used below but I am only getting the records which match the pattern and the other lines which dont match pattern are omitted

awk '/subject/ {sub(/>/,"",$NF); print $1,$5,$3,$4,$2">"; next} 1' infile > outfile

Please use code tags as required by forum rules!

Difficult to believe. The default action for the "1" pattern is "print the line in $0 unconditionally".

BTW, the "subject" might be lead in by an upper case "S".

Thank you Zaxxon. I have tried to make it more generic. May be following can help OP.

awk --re-interval '{match($0,/[A-Z]=\"[0-9]{4}\-[0-9]{2}\-[0-9]{2}\"/);val3=substr($0,RSTART,RLENGTH)};{match($0,/[A-Z]=\"[0-1]{12}\"/);val2=substr($0,RSTART,RLENGTH);{match($0,/[A-Z]=\"[0-9]{7}\"/);val4=substr($0,RSTART,RLENGTH)};{match($0,/[A-Z]=\"[a-zA-Z]+\"/);val5=substr($0,RSTART,RLENGTH);print "<Subject " val2 OFS val4 OFS val3 OFS val5 ">"}}' Input_file

Output will be as follows.

<Subject D="010101010101" B="1039502" C="2015-06-30" A="I">

Now in above solution I have assumed that pattern of Input_file will be like [A-Z]="[A-Za-z]" then [A-Z]=[0-9] till 7 digits then [A-Z]=[0-9] 4 digits -[0-9] 2 digits - [0-9] 2 digits like YYYY-MM-DD format and finally [A-Z]=[0-1] till 12 digits . So if all Input_file is having mentioned syntax input then above solution may help OP.

EDIT: Adding a non one-liner form of solution as follows.

awk --re-interval '{
                        match($0,/[A-Z]=\"[0-9]{4}\-[0-9]{2}\-[0-9]{2}\"/);
                        val3=substr($0,RSTART,RLENGTH)};
                   {
                        match($0,/[A-Z]=\"[0-1]{12}\"/);
                        val2=substr($0,RSTART,RLENGTH);
                   {
                        match($0,/[A-Z]=\"[0-9]{7}\"/);
                        val4=substr($0,RSTART,RLENGTH)};
                   {
                        match($0,/[A-Z]=\"[a-zA-Z]+\"/);
                        val5=substr($0,RSTART,RLENGTH);
                        print "<Subject " val2 OFS val4 OFS val3 OFS val5 ">"
                   }
                   }
                  ' Input_file
 

Thanks,
R. Singh

I've come-up with this using (GNU) sed.
Started to learn the command and would gladly appreciate input on efficiency and also stylistically!

sed -n -r 's/(<Subject )(.*)(.* )(.*)(.* )(.*)(.* )(.*)>/\1\8 \6 \4 \2>/p' testfile

<Subject D="010101010101" C="2015-06-30" B="1039502" A="I">

Thanks

.* stands for "anystring", so .*.* is anystring followed by anystring, which is equivalent to .* . Your script yields the same as

sed -n -r 's/(<Subject )(.* )(.* )(.* )(.*)>/\1\5 \4 \3 \2>/p' file

. And it does not print the unmodified, i.e. unmatched lines, which are requested by the OP.

1 Like

The command:

sed -n -r 's/(<Subject )(.*)(.* )(.*)(.* )(.*)(.* )(.*)>/\1\8 \6 \4 \2>/p' testfile

should work because of greedy matching forcing the 1st .* to grab everything that the following .* could also grab. Stylistically, I generally avoid putting expressions in parentheses when I don't need the string matched by that expression in the replacement. With that in mind consider this simplification:

sed -n -r 's/(<Subject )(.*) (.*) (.*) (.*)>/\1\5 \4 \3 \2>/p' testfile

But, the requested output was:

<Subject D="010101010101" B="1039502" C="2015-06-30" A="I">

while the above sed scripts produce the output:

<Subject D="010101010101" C="2015-06-30" B="1039502" A="I">

The requested output could be achieved by rearranging the replacement string references:

sed -n -r 's/(<Subject )(.*) (.*) (.*) (.*)>/\1\5 \3 \4 \2>/p' testfile

And, of course, to print the lines that don't match the search pattern without changing them, get rid of the -n option:

sed -r 's/(<Subject )(.*) (.*) (.*) (.*)>/\1\5 \3 \4 \2>/p' testfile

And, if you don't have GNU sed (with the -r option), it can be done with standard basic regular expressions with:

sed 's/\(<Subject \)\(.*\) \(.*\) \(.*\) \(.*\)>/\1\5 \3 \4 \2>/p' testfile
1 Like

Input

<Subject Q="I" W="1039502" E="2015-06-30" R="010101010101">

Output

<Subject R="010101010101" W="1039502" E="2015-06-30" Q="I">

Code

awk '/subject/ {sub(/>/,"",$NF); print $1,$5,$3,$4,$2">"; next} 1' infile > outfile

Thanks for all the help, this seems to be working fine. But can we tweak this awk one-liner so that it handles all the cases in the input, meaning to say in the input we dont have a fixed order for
Q W E R (they can be any positions in the input) but we need to search for them and place the output in R W E Q order

Thanks so much for all your help thus far

Try:

awk '
  /Subject/ {
    split($0,F,/(^|=)[^ ]*( |$)/)
    for(i in F) P[F]=i
    sub(/>/,x)
    print $1,$P["R"],$P["W"],$P["E"],$P["Q"] ">"
    next
  }
  1
' file

Or using the order as a variable:

awk -v order="R W E Q" '
  BEGIN{
    split(order,O)
  } 
  /Subject/ {
    split($0,F,/(^|=)[^ ]*( |$)/)
    for(i in F) P[F]=i
    sub(/>/,x)
    print $1,$P[O[1]],$P[O[2]],$P[O[3]],$P[O[4]] ">"
    next
  }
  1
' file
1 Like

For the latest fixed order you have specified you could also try:

awk '
/^<[Ss]ubject/ {
	sub(/>$/, "", $NF)
	for(i = 2; i <= NF; i++)
		d[substr($i, 1, 1)] = $i
	$0 = $1 OFS d["R"] OFS d["W"] OFS d["E"] OFS d["Q"] ">"
}
1' infile > outfile

Hi...thanks for your inputs but seem to be getting an awk field error when the pattern matches...

You have suggestions from me and from Scrutinizer since your post #13 in this thread. Which awk script of those three awk scripts is giving you an error?

What EXACTLY is the error?

What EXACTLY was the input line awk was processing when it reported the error?

Hi Zaxxon,
in code

awk '{sub(/>/,"",$NF); print $1,$5,$3,$4,$2">"}' infile
<Subject D="010101010101" B="1039502" C="2015-06-30" A="I">

please explain the highlighted..
Thanks,

The awk command sub(/>/, "", $NF) changes (or substitutes) the string matching the extended regular expression > (which matches a literal greater than sign) to an empty string (specified by "" ), in the last field (specified by $NF ) on the line. Or, simply stated, it removes the trailing greater than sign from the end of that input line.

Thanks Mr Don.