A script to format a file (ideally PERL)

Hi forum members. It has been several years since my last post. Currently I am using fairly large datasets on a day to day basis for handling immigration cases at a law firm. Our Input file is filled out by our secretary staff. The first column is the case ID-sample ID then the second column is the sample ID, third is the relationship status and the fourth is the name.

What I need is a output file where the father (or mother) is compared to the child (daughter or son) so that the output file would be in rows with a specific syntax (please see the output file).

The file is tab seperated

  1. Father and mother would be compared to all children (child, son and daughter)
  2. If it is a son then a M would be used (e.g. ...[Jim Smith][M])
  3. If it is a daughter then a F would be used (e.g.....[Jane Smith][F])
  4. If the name is a child in the third column then it would be left blank (e.g..... [Randy Davis][])
  5. Sometimes the list can have more than one child (e.g. up to 8 children) so then the father would have to be compared to all children in the output format.

Input file

USIM1357-11A	11A	Father	Jim Smith
USIM1357-11B	11B	Mother	Jane Smith
USIM1357-11C	11C	Son	        Jack Smith
V106866-12A	12A	Father	Ralph Davis
V106866-12B	12B	Child	        Randy Davis
V106864-14A	14A	Mother	Jane Jones
V106864-14B	14B	Son	        Jim Jones
V106879-15A	15A	Father	Andre Busby
V106879-15B	15B	Daugther    Jenny Busby
V106611-2A        2A     Father       Kyle Mike
V106611-2B        2B     Son           Evan Mike
V106611-2C        2C     Son           Bob Mike
V106611-2D        2D    Daughter    Jane Mike

Output file

USIM1357-11A11C_[Jim Smith][M] - [Jack Smith][M]
USIM1357-11B11C_[Jane Smith][F] - [Jack Smith][M]
V106866-12A12B_[Ralph Davis][M] - [Randy Davis][]
V106864-14A14B_[Jane Jones][F] - [Jim Jones][M]
V106879-15A15B_[Andre Busby][M] - [Jenny Busby][F]
V106611-2A2B_[Kyle Mike][M] - [Evan Mike][M]
V106611-2A2C_[Kyle Mike][M] - [Bob Mike][M]
V106611-2A2D_[Kyle Mike][M] - [Jane Mike][F]

Above is the output file. It would be best if the script is in perl however any code would help.

THanks

You have shown us the output you want when the input has two parents and one child, and you have shown us the output you want when the input has one parent and more than one child. What output do you want when there are two parents and more than one child, such as with the input:

USIM1357-11A	11A	Father	Jim Smith
USIM1357-11B	11B	Mother	Jane Smith
USIM1357-11C	11C	Son	        Jack Smith
USIM1357-11D	11D	Son	Jim Smith
USIM1357-11E	11E	Daughter	Janet Smith
USIM1357-11F	11F	Son	        Jerry Smith

And, can there be a Father and/or a Mother with no Sons or Daughters? If so, what output do you want?

And, can there be a Son and/or a Daughter with no Mother or Father? If so, what output do you want?

Can there be anything other than Daughter, Father, Mother, and Son (e.g., Grand Daughter, Step Son, Brother, Aunt)?

1 Like

How come Jane Smith is an [F] - there's no daugther in that case. And why [Ralph Davis][F] - there's just one child?

1 Like

Hi thanks for the reply.

No there is always a family with a child and either one or both parents.

There is always a father and/or mother + child in our cases.

When there are two parents and more then one child then the output would be:

USIM1357-11A11C_[Jim Smith][M] - [Jack Smith][M]
USIM1357-11A11D_[Jim Smith][M] - [Jim Smith][M]
USIM1357-11A11E_[Jim Smith][M] - [Janet Smith][F]
USIM1357-11A11F_[Jim Smith][M] - [Jerry Smith][M]
USIM1357-11B11C_[Jane Smith][F] - [Jack Smith][M]
USIM1357-11B11D_[Jane Smith][F] - [Jim Smith][M]
USIM1357-11B11E_[Jane Smith][F] - [Janet Smith][F]
USIM1357-11B11F_[Jane Smith][F] - [Jerry Smith][M]

Thanks

---------- Post updated at 09:25 AM ---------- Previous update was at 09:18 AM ----------

Sorry I made the correction above with Ralph Davis.

M and F just stand for male and female in our cases.

I have posted Perl code you are looking in the below link, can refer the same
Perl - Need a Perl code for the below input and output

1 Like

How about this awk script:

awk -F"\t" '
BEGIN           {A="FSMDC"
                 B="MMFF"
                 C="PCPCC"
                 for (i=1; i<=5; i++)   {GEN[substr(A,i,1)]=substr(B,i,1)
                                         REL[substr(A,i,1)]=substr(C,i,1)
                                        }
                 SEP="XXX"
                }


function PRPnPRT()      {m=split (PAR, PT)
                         n=split (CHL, CT)
                         for (i=1; i<m; i+=2)
                                 for (j=1; j<n; j+=2) print PT CT[j] "_" PT[i+1] " - " CT[j+1]   
                         PAR=CHL=""
                        }


                {TYP=substr($3,1,1)
                 CAS=$1; sub (/-.*$/, "", CAS)
                 $4=$4 "[" GEN[TYP] "]"
                }

LCS != CAS      {PRPnPRT() }
END             {PRPnPRT() }

                {if (REL[TYP] == "P") PAR = PAR $1 FS $4 FS
                 if (REL[TYP] == "C") CHL = CHL $2 FS $4 FS
                 LTP=TYP
                 LCS=CAS
                }

' file
USIM1357-11A11C_Jim Smith[M] - Jack Smith[M]
USIM1357-11B11C_Jane Smith[F] - Jack Smith[M]
V106866-12A12B_Ralph Davis[M] - Randy Davis[]
V106864-14A14B_Jane Jones[F] - Jim Jones[M]
V106879-15A15B_Andre Busby[M] - Jenny Busby[F]
V106611-2A2B_Kyle Mike[M] - Evan Mike[M]
V106611-2A2C_Kyle Mike[M] - Bob Mike[M]
V106611-2A2D_Kyle Mike[M] - Jane Mike[F]

---------- Post updated at 18:53 ---------- Previous update was at 18:45 ----------

Overlooked the square bracket around names - modify the $4 assignment: $4="[" $4 "][" GEN[TYP] "]"

1 Like

Here is a slightly different approach to the problem using awk . Note that the sample input file in post #1 in this thread sometimes uses <tab>, sometimes uses <tab> and a few <space>s, and sometimes uses two or more <space>s as a field separator. (But a single <space> is not a field separator.)

The following code makes the assumption that parents are presented before their children:

awk -F'\t *|  +' '
BEGIN {	parent["Father"] = parent["Mother"] = 1
	sex["Daughter"] = sex["Mother"] = "F"
	sex["Father"] = sex["Son"] = "M"
}
function dump(	c, p) {
	for(p = 1; p <= 2; p++) {
		if(!parent[position[p]])
			continue
		for(c = 2; c <= cnt; c++) {
			if(parent[position[c]])
				continue
			printf("%s-%s%s_[%s][%s] - [%s][%s]\n", last, suf[p],
			    suf[c], name[p], sex[position[p]], name[c],
			    sex[position[c]])
		}
	}
	cnt = 0
}
{	# Strip "-" and suffix from case #.
	case = substr($1, 1, length($1) - length($2) - 1)
#printf("$0=%s\n\tcase=%s\n", $0, case)
}
FNR > 1 && last != case {
	dump()
}
{	last = case
	suf[++cnt] = $2
	position[cnt] = $3
	name[cnt] = $4
}
END {	dump()
}' "${1:-file}"
e file

If you change the limits on the for loops in the dump() function from:

	for(p = 1; p <= 2; p++) {
		... ... ...
		for(c = 2; c <= cnt; c++) {

to:

	for(p = 1; p <= cnt; p++) {
		... ... ...
		for(c = 1; c <= cnt; c++) {

then the code will provide the desired output even if children appear in the input before, after, or in between their parents.

With the sample input currently shown in post #1 in this thread contained in file , the above code produces the output:

USIM1357-11A11C_[Jim Smith][M] - [Jack Smith][M]
USIM1357-11B11C_[Jane Smith][F] - [Jack Smith][M]
V106866-12A12B_[Ralph Davis][M] - [Randy Davis][]
V106864-14A14B_[Jane Jones][F] - [Jim Jones][M]
V106879-15A15B_[Andre Busby][M] - [Jenny Busby][]
V106611-2A2B_[Kyle Mike][M] - [Evan Mike][M]
V106611-2A2C_[Kyle Mike][M] - [Bob Mike][M]
V106611-2A2D_[Kyle Mike][M] - [Jane Mike][F]

and, if file contains the sample input I asked about in post #2, it produces the output:

USIM1357-11A11C_[Jim Smith][M] - [Jack Smith][M]
USIM1357-11A11D_[Jim Smith][M] - [Jim Smith][M]
USIM1357-11A11E_[Jim Smith][M] - [Janet Smith][F]
USIM1357-11A11F_[Jim Smith][M] - [Jerry Smith][M]
USIM1357-11B11C_[Jane Smith][F] - [Jack Smith][M]
USIM1357-11B11D_[Jane Smith][F] - [Jim Smith][M]
USIM1357-11B11E_[Jane Smith][F] - [Janet Smith][F]
USIM1357-11B11F_[Jane Smith][F] - [Jerry Smith][M]

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

1 Like

I have modified the script on the same below link, OR condition(|) was missing for one of the statement - copy paste type error, apologies for the same. Tested working very well
Perl - Need a Perl code for the below input and output

1 Like
#!/usr/bin/perl

use strict;
use warnings;

# map relationships to gender.
# accommodate misspelling of Daughter
my %gender = (
    'Father', '[M]',
    'Mother', '[F]',
    'Daugther', '[F]',
    'Daughter', '[F]',
    'Son', '[M]',
    'Child', '[]',
);

my @parents;
my %children;

# read kylle345.file file line by line
# change to the actual file name to read from
open my $fh, '<', 'kylle345.file' or die;

# map parents and children per family into two categories
while(<$fh>){
    chomp;
    my ($family_id, $person) = split('-\w+\s+');
    my ($id, $relation, $name) = split('\s+', $person, 3);
    if($relation =~ /Father|Mother/) {
         push @parents, [$family_id, $id, "[$name]$gender{$relation}"];
         next;
    }
    push @{$children{$family_id}}, [$id, "[$name]$gender{$relation}"];
}
close $fh;

# report maps of parents and children
for my $session (@parents){
    my ($f, $i, $p) = @{$session};
    for my $child (@{$children{$f}}) {
        print "$f-$i$child->[0]_$p - $child->[1]\n";
    }
}
$ perl kylle345.pl
USIM1357-11A11C_[Jim Smith][M] - [Jack Smith][M]
USIM1357-11B11C_[Jane Smith][F] - [Jack Smith][M]
V106866-12A12B_[Ralph Davis][M] - [Randy Davis][]
V106864-14A14B_[Jane Jones][F] - [Jim Jones][M]
V106879-15A15B_[Andre Busby][M] - [Jenny Busby][F]
V106611-2A2B_[Kyle Mike][M] - [Evan Mike][M]
V106611-2A2C_[Kyle Mike][M] - [Bob Mike][M]
V106611-2A2D_[Kyle Mike][M] - [Jane Mike][F]
1 Like

Has anyone thought about making a data entry form that would handle all of the special cases, as well as verify that the data is correct as its being entered?