New code for modifying text files in a folder

Hi I want to create a code that can do this for all text files in a folder.

The filenames are all listed in the following syntax

UNIQUEID-LABID_[First Name 1][M] - [First Name 2][F].txt

Each file has a unique ID and a different name and the content in the file looks like this:

Kinship Analysis Report --- Likelihood Ratio
Marker	mmm.fsa			jjj.fsa			Parent/Child (LR)	
CSF1PO	10	12		10	12		1.90183	
TPOX	8	9		9	11		0.97656	
TH01	6	9.3		7	9.3		3.42466	
vWA	18			18	19		4.34783	
D16S539	10	12		9	10		2.33645	
D7S820	10	12		9	12		2.50000	
D13S317	11	12		11	12		1.48845	
D5S818	10	11		11	13		1.02459	
FGA	19	25		19	24		3.62319	
D8S1179	10	14		10	12		10.41667	
D18S51	12			12			17.24138	
D21S11	28	31		29	31		3.16456	
D3S1358	16	17		17			2.76243	
PENTA E	12	14		11	12		2.01613	
PENTA D	10	12		10	12		4.53612
							1.28E+07

I want the output to look like this:

Kinship Analysis Report --- Likelihood Ratio
Marker	First Name 1			First Name 2			Parent/Child (LR)	
CSF1PO	10	12		10	12		1.90183	
TPOX	8	9		9	11		0.97656	
TH01	6	9.3		7	9.3		3.42466	
vWA	18			18	19		4.34783	
D16S539	10	12		9	10		2.33645	
D7S820	10	12		9	12		2.50000	
D13S317	11	12		11	12		1.48845	
D5S818	10	11		11	13		1.02459	
FGA	19	25		19	24		3.62319	
D8S1179	10	14		10	12		10.41667	
D18S51	12			12			17.24138	
D21S11	28	31		29	31		3.16456	
D3S1358	16	17		17			2.76243	
PENTA E	12	14		11	12		2.01613	
PENTA D	10	12		10	12		4.53612	
AMEL	X	Y		X	X
							1.28E+07

Basically I want to replace the .fsa with the name on the filename and I want to add the line AMEL in each file where M would be X Y and F would be X X

Each folder contains about 50 plus files so a code to do it all at once would be best.

Thanks

Please, test with a small sample of your given directory.

It creates a file.bk from the file that it finds in the directory, for safety reasons. If the result in the file.bk is correct, remove the
last commented command to rename the backup to the original name and run in the production/real directory

Run as:

perl kylle345_run.pl /path/to/directory/*.txt
#!/usr/bin/perl
# kylle345_run.pl

use strict;
use warnings;
use File::Copy qw(move);

# process each file given at the command line
for my $fname (@ARGV){
    my $fh;
    # open the current file    
    if(not open $fh, '<', $fname){
        # file could not be read; skip
        print STDERR "$fname could not be processed\n";
        next;
    }
    print "Processing file: $fname ...\n";

    # obtain substrings from file name if not skip
    my @names = $fname =~ /\[(.+?)\]\[\w\]/g or next;
    open my $fh_bk, '>', "$fname.bk" or next;
    my $before_last = "AMEL\tX\tY\t\tX\tX\n";

    # read the current file
    while(<$fh>){
         # only at second line substitute the xxx.fsa with parts of filename
         s/\w{3}\.fsa(\s+)\w{3}\.fsa/$names[0]$1$names[1]/ if $. == 2;
         # add a line before last
         if(eof) {
             print $fh_bk "$before_last$_";
             next;
         }
         # write to a copy ending in .bk
         print $fh_bk $_;
    }
    close $fh;
    close $fh_bk;
    # test first and if it does correctly, remove the
    # # from next line to rename backup to original name
    #move "$fname.bk", $fname;
}

Try also

awk '
NR==1   {split(FILENAME, T, "[][]")
        }
NR==2   {sub ("...\.fsa", T[2])
         sub ("...\.fsa", T[6])
         EL="AMEL\tX\t" (T[4]=="M"?"Y":"X") "\t\tX\t" (T[8]=="M"?"Y":"X")
        }
NF==1   {print EL
        }

1

' 'UNIQUEID-LABID_[First Name 1][M] - [First Name 2][F].txt'                        

Thank you for the code but the error "could not be processed comes up"

---------- Post updated at 02:54 PM ---------- Previous update was at 02:54 PM ----------

thanks is there a way to run the entire folder at once?

Thanks

What be the output file names?

---------- Post updated at 21:51 ---------- Previous update was at 21:40 ----------

Anyhow, try

awk '
FNR==1  {if (fnm) close(fnm)
         fnm = FILENAME ".mod"
         split(FILENAME, T, "[][]")
        }
FNR==2  {sub ("...\.fsa", T[2])
         sub ("...\.fsa", T[6])
         EL="AMEL\tX\t" (T[4]=="M"?"Y":"X") "\t\tX\t" (T[8]=="M"?"Y":"X")
        }
NF==1   {print EL > fnm
        }

        {print > fnm
        }
' UNIQUEID*txt

Yes, that's a warning message I built in to notify you for EACH files that could not be opened for reading for whatever reason (permissions, etc). Along with every notification the supposed file name is shown. Take a look and see why it cannot open it.

Oddly enough I cannot seem to figure out what the issue is. Could it be the spacing (number of tabs)?

Please, post some more information that might help in the process of troubleshooting.
i.e
What program are you running?
How are you running the program?:

  • show the full command you are invoking with its arguments
  • what's the name of the directory you are running the program from?

What is the output?:

  • if there are any errors/warnings, please post a representative, real portion of it; not a paraphrased version of it.
  • what it is doing that is not the desired outcome?

A copy and paste listing portion of the file names as given by the command ls , might be helpful; not a paraphrased version of it.

1 Like

The input file is .txt (the only common theme for each of the files). The output file cab be anything (even replacing it).

Sorry I am a bit rusty on coding. So everything does work but the problem is that the names with .fsa do not get replaced by the name on the file. The AMEL row gets filled in but thats about it.

Thanks

I created the code based on your first post example:

Kinship Analysis Report --- Likelihood Ratio
Marker	mmm.fsa			jjj.fsa			Parent/Child (LR)	
CSF1PO	10	12		10	12		1.90183	
TPOX	8	9		9	11		0.97656	
TH01	6	9.3		7	9.3		3.42466	
vWA	18			18	19		4.34783	
D16S539	10	12		9	10		2.33645	
D7S820	10	12		9	12		2.50000	
D13S317	11	12		11	12		1.48845	
D5S818	10	11		11	13		1.02459	
FGA	19	25		19	24		3.62319	
D8S1179	10	14		10	12		10.41667	
D18S51	12			12			17.24138	
D21S11	28	31		29	31		3.16456	
D3S1358	16	17		17			2.76243	
PENTA E	12	14		11	12		2.01613	
PENTA D	10	12		10	12		4.53612
							1.28E+07

I suspect there are discrepancies between what you posted and what you actually have. Let me ask you. Does mmm.fsa jjj.fsa , actually exist in the real file? Is this "Marker" line, in red, the second line of the file?

The Marker line does exist and the name does exist but it not always mmm or jjj. They vary each time. I just need them to be replaced with the names on the file.

Hope that makes sense.

Since you could not confirm if the line starting with the word "Marker" would land always as the second line, I made some modifications.
Please, try this version.

#!/usr/bin/perl

use strict;
use warnings;
use File::Copy qw(move);

for my $fname (@ARGV){
    my $fh;
    if(not open $fh, '<', $fname){
        print STDERR "$fname could not be processed\n";
        next;
    }  
    print "Processing file: $fname ...\n";

    my @names = $fname =~ /\[(.+?)\]\[\w\]/g;
    next unless (scalar @names == 2);
    open my $fh_bk, '>', "$fname.bk" or next;
    my $before_last = "AMEL\tX\tY\t\tX\tX\n";

    while(<$fh>){
         s/(^Marker\s+)\w{3}\.fsa(\s+)\w{3}\.fsa(\s+Parent\/Child.*$)/$1$names[0]$2$names[1]$3/;
         if(eof) {
             print $fh_bk "$before_last$_";
             next;
         }
         print $fh_bk $_;
    }
    close $fh;
    close $fh_bk;
    # test first and if it does correctly, remove the
    # # from next line to rename backup to original name
    #move "$fname.bk", $fname;
}