How to delete a specific number of rows and columns randomly?

rDamascena · March 16, 2024, 10:32pm

01 02 03 04 05 06 07 08 09 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60

Using this range from 1 to 60, formatted in 6 lines and 10 columns.

How to delete a specific number of rows and columns randomly (using awk, preferably). It can be separately.

Examples:

Delete 2 (or any number from 1 to 6) from any of the 6 lines, for example: from 1 to 10 and from 31 to 40.

Result:

11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60

Delete 3 (or any number from 1 to 10) from any of the 10 columns, for example: those ending in 1, 8 and 0.

Result:

02 03 04 05 06 07 09
12 13 14 15 16 17 19
22 23 24 25 26 27 29
32 33 34 35 36 37 39
42 43 44 45 46 47 49
52 53 54 55 56 57 59

munkeHoller · March 16, 2024, 11:28pm

@rDamascena , Welcome. The forum is a collaboration, you post your challenge AND your attempts, the team respond with fixes, alternatives etc ( some of which might be completel solutions).
You specify awk, why ?

rDamascena · March 16, 2024, 11:39pm

Sorry, but I don't have a starting point. I'm new (maybe not even that) at this and I wanted a complete solution.

munkeHoller · March 16, 2024, 11:47pm

@rDamascena , is this courswork (homework) ? ,

Asking for a specific language (awk) raises a number of questions ...
Why (as originally asked )

What [other] programming experience do you have ?

rDamascena · March 16, 2024, 11:58pm

It can be any language, I didn’t think that would be an issue. In fact, the purpose is to reduce the sample space to generate a draw among the remaining dozens. Almost none, to be quite honest.

munkeHoller · March 17, 2024, 12:03am

@rDamascena , please, answer all questions , do not be vague.

PS: I have written a working example, but as you appear to be guarded in responding to basic questions it remains hidden until you are open and frank wrt your request.
Otherwise, try asking one of the LLM's to help you out you may make some progress.

example output below

awk -vXcols=2,6,9 -vXrows=2,4 -f prune.awk inputs
01 03 04 05 07 08 10 
21 23 24 25 27 28 30 
41 43 44 45 47 48 50 
51 53 54 55 57 58 60

Paul_Pedant · March 17, 2024, 9:52pm

The shuf command is often used to produce a selection of random numbers. You could use these to control which columns and which rows to delete (or which to keep). You can look that up with the command man -s 1 shuf.

The col command can be used to select columns by number.

Selecting lines would be quite difficult in some commands like sed. That is a line-based editor, so each line number would need a separate command (but you could generate those commands quite easily from a random sequence).

You might check out Knuth's algorithm S - Rosetta Code

As it happens, awk can make random numbers, remove columns, and remove lines all in the same process.

MadeInGermany · March 18, 2024, 8:00am

My perception of "random" is "arbitrary" here.
Like "random access" meaning "not in a certain order".

Paul_Pedant · March 18, 2024, 9:44am

But the random part is most of the fun!! OK, I sort the random choices for clarity (and so as not to attempt to rearrange the order of lines or columns). In fact, how you come up with the lists of numbers (random or user-declared) has no real impact on doing the edits.

I can generate these variables through simple pipelines, for use by the obvious commands. The user can do the same by hand. One catch is that the Cols is the same for every row, but the Rows list can be as long as half of the file (if you are prepared to deal with both "retain" and "delete" lists).

declare -- Rows="1 p; 3 p; 4 p; 5 p; "
declare -- Cols="2,3,5,6,7,9,10"

Those are positive -- the rows and columns to retain. It turns out it is trivial to make the negative version -- the rows and columns to remove.

Anything smarter (like rearranging the column order, or deleting all rows whose line number is prime, or a multiple of 42) needs awk (or similar).

MadeInGermany · March 19, 2024, 7:43am

Implementing the idea from the revious post (bash, but without declare, what is it for?)

Pcols=2,3,5,6,7,9,10 Prows=1,3,4,5
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" {01..60} | cut -f "$Pcols" | sed -n "${Prows//,/p;}p"

02      03      05      06      07      09      10
22      23      25      26      27      29      30
32      33      35      36      37      39      40
42      43      45      46      47      49      50

The printf prints a table, then cut selects the columns then sed selects the rows.
${Prows//,/p;}p translates to sed commands.

Paul_Pedant · March 19, 2024, 9:38am

Sorry -- the declare is merely debug from the script I was withholding until we saw some effort from the OP. I prefer declare over printf because it shows names, quotes, arrays etc, which user debug tends to skimp on.

As the script is now out of the bag, this was mine (the positive "keep-these" version). I am still clinging to the "random" interpretation, simply because shuf is quite cute for generating bulk test data. The OP did mention "to reduce the sample space" which seems like the kind of thing you might prefer to randomise.

Rows="$( shuf -i 1-6 -n 4 | sort -n | sed 's/$/ p; /;' | tr -d '\n' )"
declare -p Rows

Cols="$( shuf -i 1-10 -n 7 | sort -n | sed 's/.*/&,/' | tr -d '\n' )"
Cols="${Cols%,}"
declare -p Cols

sed -n "${Rows}" Cut2D.in | cut -d ' ' -f "${Cols}"

I don't use sed often -- I forgot it will take a comma-separated list of row numbers to avoid those repeated p commands.

The negative "remove-these" version is very similar -- 3 changes. This version would be more compact, if you are removing only a few lines and columns:

.. Change the 'p' print to 'd' delete in the Row-generating sed command.

.. Remove the -n option in the final sed.

.. Add the --complement option to the cut.

MadeInGermany · March 19, 2024, 10:51am

Really?? I am aware of one comma making a "from,to" range of lines.

A CPU cycle less is sed 's/$/,/'
And sed '$!s/$/,/' will omit the last comma.

Thanks, quite useful!

rDamascena · March 19, 2024, 7:09pm

Seems to have found a solution.

shuf -i 1-60 | awk '{printf "%02d\n", $1}' | sort -n | xargs -L 10 > numbers.txt

I use the command above to create the text file with 6 lines and 10 columns.

From this, how to proceed to use your solution? What is the next step?

Paul_Pedant · March 19, 2024, 11:03pm

My bad on the list of line numbers -- I missed your embedded global substitution. For some reason, I expected sed to behave for lines like cut does for fields: 2,4,7 for a list of single lines, 3-6 for a range of lines.

Kudos for the sed '$!s/$/,/'. I suspect I am too old to learn all of sed's syntax.

Paul_Pedant · March 19, 2024, 11:24pm

You don't need to shuffle the lines and then sort them. Also, there is a Bash sequence expansion, and if you give that leading zeros it will format with leading zeros. So making the block of data could be done like:

echo {01..60} | xargs -n 10

I'm not sure what you want to achieve. I assumed you wanted to make a random choice of lines and columns, but your question can also be interpreted as wanting to choose those yourself. I also assumed the numbered table was just an example -- that the real data (the sample space) would be something like personal names, proteins, zipcodes, cities.

@MadeInGermany posted the complete thing -- the one that starts "Implementing the idea from the previous post". It starts by having you specify which columns and rows you want to keep. Then it produces the table with a printf and the expansion of the numbers, cuts out some columns, then cuts out some lines. The three commands are connected in a chain (called a pipeline) and output the cut-down table.

Can you state what else you think is missing ?

munkeHoller · March 19, 2024, 11:37pm

@Paul_Pedant , Given the essence of this forum is collaboration , I had asked the requester to share details/attempts and that posting solution(s) would not be forthcoming (at least from me) until they showed at least some attempt ... alas that request was ignored.

c'est la vie

rDamascena · March 20, 2024, 12:43am

I think I was confusing in my explanation.

How many rows and columns = predefined.

What rows and columns = random.

The choice of rows and columns to be deleted will happen completely randomly.

Only the amount of rows and columns that will be deleted is defined by me, but never which ones. They will be deleted completely at random, without my interference.

Only the quantity is pre-defined, but never which ones: between 1 and 6, for rows and between 1 and 10, for columns.

For example, I want 2 rows and 4 columns to be deleted: 2 between any of the 6 rows and 4 between any of the 10 columns.

Thank you for the help!

rDamascena · March 20, 2024, 7:38am

Based on the solutions that were suggested to me, I found another one that was enough for me. I made minor changes and the result was this:

Rows="$( shuf -i 1-6 -n 2 | sort -n | sed -e 's/$/d;/;' -e '$ s/.$//' | sed -z 's/\n//g' )"
declare -p Rows

Columns="$( shuf -i 1-10 -n 3 | sort -n | sed '$!s/$/,/' | sed -z 's/\n//g' )"
Columns="${Columns%,}"
declare -p Columns

sed "${Rows}" < input.txt | cut --complement -d " " -f "${Columns}"
declare -- Rows="1d;4d"
declare -- Columns="5,9,10"
11 12 13 14 16 17 18
21 22 23 24 26 27 28
41 42 43 44 46 47 48
51 52 53 54 56 57 58

MadeInGermany · March 20, 2024, 10:39am

Well done!

A generalized version; the constants are configured at the beginning.

# Number of columns and rows
nCOLS=10 nROWS=6
# Number of columns and rows to delete
nxCOLS=3 nxROWS=2
 
xrows="$( shuf -i 1-"$nROWS" -n "$nxROWS" | sort -n | sed 'H; 1h; $!d; x; s/\n/,/g' )"
echo "Delete rows: $xrows"
xcols="$( shuf -i 1-"$nCOLS" -n "$nxCOLS" | sort -n | sed 'H; 1h; $!d; x; s/\n/,/g' )"
echo "Delete columns: $xcols"

# Using the bash/ksh/zsh builtin // modifier
sedcode="${xrows//,/d;}d"
# Generate the input table
seq --format="%02.f" 1 $(( nCOLS * nROWS )) | xargs -n "$nCOLS" |
# Delete rows and columns from the input
  sed "$sedcode" | cut --complement -d " " -f "$xcols"

Explanation of the sed code:
H add to the hold space with a newline separator
1h in input line 1 overwrite the hold space
$!d unless it's the last line, delete and jump to next input cycle (nothing is printed)
Only run in the last input line:
x get the hold space
s/\n/,/g substitute newlines with commas

Paul_Pedant · March 20, 2024, 11:25am

That looks a little odd.

declare -p Rows just prints the variable Rows, but it does that in a format that you would use if you were creating the variable (quotes, full syntax for arrays etc).

$ Rows="3;4;5;6;9"
$ declare -p Rows
declare -- Rows="3;4;5;6;9"   # <= This is a diagnostic output from the -p.
$ declare -- Rows="Whatever"  # <= This is a command that is overwriting Rows
$ declare -p Rows
declare -- Rows="Whatever"

You either want the top 6 lines of code (which create the random values you asked for), or the two specific initialisations just before the table is output (which will always delete the same rows and columns). Not both.

As previously mentioned, using shuf and sort in the same pipeline, along with three sed expressions, is not needed. My best call is echo {01..60} | xargs -n 10.