I have a file1.txt with several 100k lines, each of which has a column 9 containing one of 60 "label" identifiers. Using an labels.txt file containing a list of labels, I'd like to extract 200 random lines from file1.txt for each of the labels in index.txt.
It seems like the problem is with the " awk '/${file}/' "? I say this because I can extract lines for each label but only if I explicitly specify the label regex (in this case g10.txt also has two random lines with "label=c12" instead of g10):
This successfully extracts lines with the correct label. But I am looking for only two lines for each label in labels.txt, and the lines should be randomly chosen. Can this script be piped to " gshuf -n 2"--or the equivalent--before the file is created?
With FS='[=;]' , the 1st field on each line in your sample data is the string of characters before the 1st semicolon, the 2nd field is the string barcodelabel , and the 3rd field (referenced in awk by $3 ) is the string between the equals sign and the 2nd semicolon (which is what you're looking for to match a string read from labels.txt ).
{
A[$1]; # Build array on the first column of the file
next # Skip all proceeding blocks and process next line
}
$3 in A # Check in the value in column one of the second files is in the array
{
# If so print it it to file named based on $3
print > sprintf("%s.txt",$3)
}
If so, how can I adapt this code to give me only a random subset of matched lines for each array value? Is there a way to redirect the match specified by $3 in A to gshuf -n 2 before printing?
I know that I could pipe the output file and get my desired output. But I would like to accomplish it all at once if possible--find matching lines in file1.txt based on the labels in labels.txt, and print a random subset of two of these matching lines to an output file named according to the label.
Do you want random lines from the input file? Then shuf the input file before feeding it into above awk proposals.
If you want always the same lines, but presented in random order, shuf the output files lateron.
or something along these lines (without pre/post randomization) - not tested
BEGIN {
srand()
}
function genrand(n)
{
return (int(n*rand())+1);
}
{
A[$1]; # Build array on the first column of the file
next # Skip all proceeding blocks and process next line
}
$3 in A { # Check in the value in column one of the second files is in the array
a[++c]=$0
file[c]=sprintf("%s.txt",$3)
}
END {
for(i=1;i<=2;i++) {
n=genrand(c)
print a[n] > file[n]
}