Extract line from file and save as new file.

Hello,
I have a tab-file "result.txt "that looks like this

CNV.ID    Sample    Correlation    N.comp    Start.b    End.b    CNV.type    N.exons    BF    Reads.expected    Reads.observed    Reads.ratio    Gene
1    S10.Run.variant_ready    0.999411647    7    381    382    duplication    2    7.61    547    714    1.31    Gene1
2    S10.Run.variant_ready    0.999411647    7    1998    2016    duplication    19    133    14396    18691    1.3    Gene2
3    S11.Run.variant_ready    0.999286215    13    302    302    deletion    1    2.67    188    129    0.686    Gene3
4    S11.Run.variant_ready    0.999286215    13    341    341    deletion    1    4.58    548    386    0.704    Gene4
5    S11.Run.variant_ready    0.999286215    13    383    383    duplication    1    3.61    503    646    1.28    Gene5
6    S12.Run.variant_ready    0.999286215    13    388    388    deletion    1    2.8    45    24    0.533    Gene6

I need to extract each "Sample" (Column 2) and store it as a separate file with the contents. For example, from the above file, I need to extract all lines (ie, row 1&2 in this case) corresponding to "S10.Run39.variant_ready" and store it as S10.txt (ie SAMPLE name).

This is what I have tried so far

while read ID SAMPLE; do echo "$SAMPLE" > "$SAMPLE"; done < result.txt

I don't end up with a file with contents but rather each row as a file name. Please advise.

thanks

A crude bash way could be:-

while read ID SAMPLE DATA
do
   outfile="${SAMPLE%%.*}"              Chop off everything after the first full-stop to decide the file name
   echo "$DATA" >> "$outfile"           Quoted in case the line we have has spaces in.
done < result.txt

The >> means we append to a file so you get all matching records.

This probably won't be very efficient for large files. In that case you would be better with an awk but let us know if you need that or if the above is sufficient.

Kind regards,
Robin

Thank you @rbatte1.

The file is not too big. Not expecting the results.txt to have more than 200 lines.
It does create separate files but the header and the first two columns are missing from the output. [The header or first row ends up being a separate file by itself by the name of "Sample.txt"]

Trying this

while read ID SAMPLE DATA
do
   outfile="${SAMPLE%%.*}"              
   echo  "$ID\t$SAMPLE\t$DATA" >> "$outfile"           
done < result.txt
 

but doesn't produce a tab-delimited file with all contents

Please - what is a "tab file"?

EDIT: How far would

awk 'NR > 1 {split ($2, T, "\."); print $0 > (T[1] ".txt")}' file

get you?

Hello,
I meant tab-delimited file, sorry should be been specific.

The code works by including all the columns, but again the header is missing in all the files.

Also, a very minor modification to the code

awk 'NR > 1 {split ($2, T, "\\."); print $0 > (T[1] ".txt")}' file

I was getting the following error with the single backlash

awk: cmd. line:1: warning: escape sequence `\.' treated as plain `.'

That was NOT an error, but a warning.

You did not mention you wanted the header, neither in the written specification nor in the code sample you posted.
Beyond that, do you get a satisfactory result, or not?

True, not an error, but perhaps it is interesting to ponder why exactly the warning was issued here.

The reason that the escape \ does not matter in this case is that a single character, other than space, is not treated as an (extended) regex string, but as a single, literal character, and so is an escape sequence like \. (So "\." is not a regular expression here).

So in short:

  • "." a literal . character (dot).
  • "\." an escape sequence that does not have a meaning here because it does not turn a . character into a special character, hence the warning. So this also gets interpreted as a single literal dot.
  • "\\." a regular expression denoting a literal dot.

So in this case split ($2, T, "\\.") , split ($2, T, "\.") and split ($2, T, ".") have the same meaning and produce identical results, while only the second one gives a warning.

--
Not relevant in the above case, but to get a regex that consists of a special . (denoting "any" character) one would need to use a regex constant instead: split ($2, T, /./) . On the other hand, to use a single literal dot within a regex constant split ($2, T, /\./) would need to be used.

2 Likes

yes, Apart from the header missing, the code generates output files. Thank you.

Try

awk '
NR == 1         {HD = $0
                 next
                }
                {split ($2, T, "\.")
                 OUTF = T[1] ".txt"
                 if (!(T[1] in HDPR))   {print HD > OUTF
                                         HDPR[T[1]]
                                        }
                 print $0 > OUTF
                }
' file
1 Like

This works. thank you