combine duplicate records

kshuser · October 26, 2009, 12:30pm

I have a .DAT file like below

23666483030000653-B94030001OLFXXX000000120081227
23797049900000654-E71060001OLFXXX000000220081227
23699281320000655 E71060002OLFXXX000000320081227
22885068900000652 B86860003OLFXXX592123320081227
22885068900000652 B86860003ODL-SP592123420081227
22885068900000652-B94030001ODL-CH592123520081227

I would like to combine duplicate records into a single record with the new single record containing additional fields appending at the end of line record (for example see below ) . In the example file above, the first field is the unique field. So I would like my output to be like below:

If any duplicate record exists in this case 288506890 has 3 records check only for the position ODL-SP & ODL-CH
if ODL-SP exists then get the amount position 34:40
if ODL-CH exists then get the amount position 34:40

then get/append the final record (288506890) for this no is like below, if no duplcate record exists just create the line as is

2885068900000652 B86860003OLFXXX592123320081227  5921234 5921235

ben_type=`echo $line|cut -c28-33`
(you get ODL-SP spouse, ODL-CH child)
amount=`echo $line|cut -c34-40`
(you get spouse=5921234, child=5921235)

23666483030000653-B94030001OLFXXX000000120081227
23797049900000654-E71060001OLFXXX000000220081227
23699281320000655 E71060002OLFXXX000000320081227
22885068900000652 B86860003OLFXXX592123320081227 5921234 5921235

Can someone please please help me with a solution using Unix ksh scripting Thank you.

[/FONT][/SIZE][/SIZE][/FONT][/FONT][/SIZE][/SIZE][/FONT]

binlib · October 26, 2009, 2:07pm

With a name like kshuser and asked for ksh only solution, I assume you use ksh93.

while read x; do
  k=${x:0:17}
  if [ "$k" = "$ok" ]; then
    p="$p ${x:33:7}"
  else
    [ -n "$p" ] && echo "$p"
    ok=$k
    p=$x
  fi
done
echo "$p"

If you can add a blank line at the end of file (e.g.

(cat file;echo)

), you can omit the last echo outside the loop.

kshuser · October 26, 2009, 2:19pm

I am kind of new to KSH scripting.

FILE1.DAT has the following records.

23666483030000653-B94030001ODL-Ch000000120081227
23797049900000654-E71060001OLFXXX000000220081227
23699281320000655 E71060002OLFXXX000000320081227
22885068900000652 B86860003OLFXXX592123320081227
22885068900000652 B86860003ODL-Sp592123420081227
22885068900000652-B94030001ODL-Ch592123520081227

i am writing the script like below and getting the outfile mondaytest.txt

rec_cnt=1
while read line
do
no=`echo $line|cut -c2-10`
ben_type=`echo $line|cut -c28-33`
amount=`echo $line|cut -c34-40`
if [[ $rec_cnt -eq 1 ]]
then
echo $line >> mondaytest.txt
prior_no=$no
prev_line=$line
else
if [[ $no -eq $prior_no ]]
then
if [[ $ben_type = "ODL-SP" ]]
then
spouse_amt=$amount
prev_line="$prev_line $spouse_amt"
elif [[ $ben_type = "ODL-CH" ]]
then
child_amt=$amount
#prev_line="$prev_line $spouse_amt"
else 
echo 'invalid ben_type'
fi
#echo $prev_line $spouse_amt $child_amt>> mondaytest.txt
echo 'Insert_1' $prev_line $child_amt >> mondaytest.txt
else
echo 'Insert_2' $line >> mondaytest.txt
prev_line=$line
fi
spouse_amt=""
child_amt=""
fi 
(( rec_cnt=rec_cnt + 1 )) 
prior_no=$no
done <FILE.DAT

OUT FILE mondaytest.txt

23666483030000653-B94030001ODL-Ch000000120081227
23797049900000654-E71060001OLFXXX000000220081227
23699281320000655 E71060002OLFXXX000000320081227
22885068900000652 B86860003OLFXXX592123320081227
22885068900000652 B86860003OLFXXX592123320081227 5921234
22885068900000652 B86860003OLFXXX592123320081227 5921234 5921235

I want the outfile should have only 4 records like this.

23666483030000653-B94030001ODL-Ch000000120081227
23797049900000654-E71060001OLFXXX000000220081227
23699281320000655 E71060002OLFXXX000000320081227
22885068900000652 B86860003OLFXXX592123320081227 5921234 5921235

Can you please correct me in my code to get the above expected result.

binlib:

With a name like kshuser and asked for ksh only solution, I assume you use ksh93.
while read x; do
  k=${x:0:17}
  if [ "$k" = "$ok" ]; then
   p="$p ${x:33:7}"
  else
   [ -n "$p" ] && echo "$p"
   ok=$k
   p=$x
  fi
done
echo "$p"
If you can add a blank line at the end of file (e.g.
(cat file;echo)
), you can omit the last echo outside the loop.

Scrutinizer · October 26, 2009, 4:48pm

E.g. like so?

#!/bin/ksh
 echo|cat infile -|while read line; do
  case ${line:27:6} in
    ODL-SP|ODL-CH)
        prev+=" ${line:33:7}" ;;
    *)  [[ -n $prev ]] && print $prev
        prev=$line ;;
  esac
done > outfile

kshuser · October 26, 2009, 4:59pm

But when i ran your code it is generating the outfile file but no changes compared to INPUT file.

>echo|cat FILE2.DAT -|while read line
> do
> case {$line:27:6} in
> ODL-SP|ODL-CH)
> prev+=" ${line:33:7}" ;;
> *) [[ -n $prev ]] && print $prev
> prev=$line ;;
> esac
> done > OUT.txt

OUT.txt ...is the same as input file FILE2.DAT

23666483030000653-B94030001OLFXXX000000120081227
23797049900000654-E71060001OLFXXX000000220081227
23699281320000655 E71060002OLFXXX000000320081227
22885068900000652 B86860003OLFXXX592123320081227
22885068900000652 B86860003ODL-SP592123420081227
22885068900000652-B94030001ODL-CH592123520081227

Scrutinizer · October 26, 2009, 5:10pm

${line:27:6}

danmero · October 26, 2009, 6:12pm

What about:

# awk 'NF{a[substr($0,0,9)]=(a[substr($0,0,9)])?a[substr($0,0,9)] FS substr($0,34,7):$0}END{for(i in a)print a}' file
22885068900000652 B86860003OLFXXX592123320081227 5921234 5921235
23797049900000654-E71060001OLFXXX000000220081227
23666483030000653-B94030001OLFXXX000000120081227
23699281320000655 E71060002OLFXXX000000320081227

kshuser · October 27, 2009, 7:54am

In your awk code below where is the input file we are passing and where is outfile, i see "a" in your code is this the input file name....??? i also see word "file"..is this INPUT or OUTPUT file..?? what is this doing...???

thanks

# awk 'NF{a[substr($0,0,9)]=(a[substr($0,0,9)])?a[substr($0,0,9)] FS substr($0,34,7):$0}END{for(i in a)print a[i]}' file

danmero · October 27, 2009, 8:19am

awk 'NF{a[substr($0,0,9)]=(a[substr($0,0,9)])?a[substr($0,0,9)] FS substr($0,34,7):$0}END{for(i in a)print a}' Input_file > Output_file