For loop and read from different directories

baris35 · July 27, 2017, 6:59pm

Hello,
I need to grep/read files from multiple directories, one by one.
I mean something like shuffling the cards uniformly.

cd /directoryA
for i in *.txt;
do
some codes 
cd ../directoryB
for i in *.txt;
do
some codes
cd ../directoryC
for i in *.txt;
do
some codes
done
done
done

Directory A Includes: (Lets say totally 100 files)

1.txt
2.txt
3.txt
4.txt
5.txt
..
..
..
100.txt

Directory B Includes: (Lets say totally 30files)

a.txt
b.txt
c.txt
d.txt
e.txt
..
..
..
z.txt

Directory C Includes: (Let's say totally 901 files)

101.txt
102.txt
103.txt
104.txt
105.txt
..
..
..
1000.txt

In first for loop 1.txt should be read from directory A

1st loop -> directory A, file 1.txt
2nd loop ->directory B, file a.txt
3rd loop -> directory C, file 101.txt
4th loop -> directory A, file 2.txt

If all files in a certain folder had already been read in previous loop, skip that folder...

I'd appreciate if you could explain how I can do that. for is causing problem. Other solutions are also welcome.

Many thanks
Boris

rovf · July 28, 2017, 2:37am

Let's line out the basic idea:

I would first read the content of each directory into a separate array, in this case 3 arrays (A, B, C). Then I would set up an associative array (let's call it SEEN), where I enter the basename from each file which has already been processed. Checking SEEN before processing a file allows me to skip the files which have already been processed. I then would have a single loop, ranging over the indexes of the longest array. Inside the loop, I would use the loop index to access the arrays A, B and C.

One design decision is, whether the number of directories is always constant (3) or can be variable too. If there is no inherent necessity, why it must be 3 directories, and not 2 or 4, I would make this variable too.

Now it comes for choosing the programming language. You need a language which supports arrays and associative arrays. For shell scripting, it means that you can use Zsh or bash or - I think - ksh.

If you decided to make the number of directories a variable too, you **can** do it in shell scripts, but I find it a bit invonvenient. For this type of task, I would consider a more general programming language, such as Ruby or Perl.

baris35 · July 28, 2017, 4:41am

Hello Rovf,
Thanks for your answer. I will try a different method.
I am gonna mark the thread as solved now.

Hello Rbatte1,
I am sorry as I missed code tag in text body.

Kind regards
Boris

rbatte1 · July 28, 2017, 4:49am

Hello Boris/baris35,

I think I understand the principle, but as a starting point, let me indent your code for clarity:-

cd /directoryA
for loopa in *.txt;
do
   some codes 
   cd ../directoryB
   for loopb in *.txt;
   do
      some codes
      cd ../directoryC
      for loopc in *.txt;
      do
         some codes
      done
   done
done

As you can see, I have changed the variable for the loops else results will be unpredictable.

With this, you would be trying to read everything in directoryC 3,000~ish times (for every file in directoryA multiplied by every file in directoryB) Is this really what you want?

You also have the problem that you are changing directory just before a loop, but on leaving the loop you do not change back, so for the second loop and after (e.g. file b.txt) of directoryB, your shell would be in directoryC. When processing the second loop and after of files in directoryA (e.g. 2.txt), your shell would also be either in directoryC so your some codes statement would have to handle being in various places. The for loop will already have been formed, so the loop as a whole will process as you are telling it, but very likely in the wrong directory.

You you just want to process each file once, you need to move the done statements, which would give you this:-

cd /directoryA
for i in *.txt;
do
   some codes 
   cd ../directoryB
done

for i in *.txt;
do
   some codes
done

cd ../directoryC
for i in *.txt;
do
   some codes
done

If you really do want to process every file in directoryC 3,000~ish times, consider using pushd & popd to handle directory transitions like this:-

pushd /directoryA
for loopa in *.txt;
do
   some codes 
   pushd /directoryB
   for loopb in *.txt;
   do
      some codes
      pushd /directoryC
      for loopc in *.txt;
      do
         some codes
      done
      popd
   done
   popd
done
popd

They will handle the moving in and out of directories safely. It is better to use a fully qualified directory path rather than trying to assume where you are and issuing cd ../directoryX

Sorry I've gone on for a while, but I hope that this helps.

Can you tell us what you are actually trying to achieve? So sort of logical steps you want to do and we can work no it a bit better.

Kind regards,
Robin

MadeInGermany · July 29, 2017, 3:08am

The following bash script is typed from a mobile device and untested.
It uses 3 extra file handles, so it can read in a round-robin order.

while
  read a <&3; aexit=$?
  read b <&4; bexit=$?
  read c <&5; cexit=$?
  [ $aexit -eq 0 ] || [ $bexit -eq 0 ] || [ $cexit -eq 0 ]
do
  [ $aexit -eq 0 ] && printf "%s\n" "$a"
  [ $bexit -eq 0 ] && printf "%s\n" "$b"
  [ $cexit -eq 0 ] && printf "%s\n" "$c"
done 3< <( ls A/ ) 4< <( ls B/ ) 5< <( ls C/ )

and produces the following order

directory A, file 1.txt
directory B, file a.txt
directory C, file 101.txt
directory A, file 2.txt
directory B, file b.txt
directory C, file 102.txt
...

Improvements are welcome.

drl · July 30, 2017, 2:49pm

Hi.

Apologies for the lengthy post.

This is probably not an improvement as much as it is a meta-answer. This is how we went about solving problems like this by generalizing.

I'll assume that you could get the contents of the directories by looking at the content, and one such command might be:

ls -1 A > data1

and so on for B, C, etc.

We then would use a local code gather to obtain at least one line from each of the files data1, data2, etc.

The companion program is scatter .

As I have noted before we have not yet decided to publish our codes, but when we post meta-solutions like this, we can post the documentation. Then folks can decide whether this is a reasonable approach for them to pursue.

Here is a demonstration that exercises code gather :

 
#!/usr/bin/env bash

# @(#) s1       Demonstrate intersperse, shuffle of lines, gather.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C gather dixf scatter

FILE=${1-data1}

pl " Input data files data*:"
head data*

pl " Results, 1 item from each list:"
gather data*

pl " Results, 2 from column A, one from B and C:"
gather data1:2 data2 data3

pl " Results, as previous, but separator is newline:"
gather -s '\n' data1:2 data2 data3

pl " Help from gather:"
gather -h

pl " Details about gather, scatter:"
dixf gather scatter

exit 0

producing:

$ ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.8 (jessie) 
bash GNU bash 4.3.30
gather (local) 1.1
dixf (local) 1.49
scatter (local) 1.4

-----
 Input data files data*:
==> data1 <==
1.txt
2.txt
3.txt
4.txt
5.txt

==> data2 <==
a.txt
b.txt
c.txt
d.txt
e.txt

==> data3 <==
101.txt
102.txt
103.txt
104.txt
105.txt

-----
 Results, 1 item from each list:
1.txt
a.txt
101.txt
2.txt
b.txt
102.txt
3.txt
c.txt
103.txt
4.txt
d.txt
104.txt
5.txt
e.txt
105.txt

-----
 Results, 2 from column A, one from B and C:
1.txt 2.txt
a.txt
101.txt
3.txt 4.txt
b.txt
102.txt
5.txt 103.txt
c.txt
104.txt
d.txt
105.txt
e.txt

-----
 Results, as previous, but separator is newline:
1.txt
2.txt
a.txt
101.txt
3.txt
4.txt
b.txt
102.txt
5.txt
103.txt
c.txt
104.txt
d.txt
105.txt
e.txt

-----
 Help from gather:

gather - Read, shuffle, weave, intersperse lines from multiple inputs to STDOUT.

usage: gather [options] -- [files]

options:

--separator=SEP
  Set output token separator to SEP, default " ", and "\n" is
  accepted as NEWLINE.  The SEP is used when more than one line is
  read from a file. This allows one to easily capture short lines
  into a single output line.

--help (or -h)
  print this message and quit.

[files]
  filename1:count1 filename2:count2 ... filename3:count3
  Each filenamei will be read in succession, and will be written
  to standard output, continuing until EOF on every file.


-----
 Details about gather, scatter:

gather	Read, shuffle, weave, intersperse lines from multiple input files. (what)
Path    : ~/bin/gather
Version : 1.1
Length  : 224 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/env perl
Help    : probably available with -h,--help
Modules : (for perl codes)
 strict	1.08
 warnings	1.23
 English	1.09
 Carp	1.3301
 Getopt::Long	2.42
 feature	1.36_01

scatter	Write, deal, unravel disperse lines to multiple output files. (what)
Path    : ~/bin/scatter
Version : 1.4
Length  : 190 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/env perl
Modules : (for perl codes)
 strict	1.08
 warnings	1.23
 English	1.09
 Carp	1.3301
 Data::Dumper	2.151_01
 Getopt::Long	2.42
 feature	1.36_01

You can also, with the right shell, use embedded commands like this:

$ gather <( ls -1 A ) <( ls -1 B )
1.txt
a.txt
2.txt
b.txt
3.txt
c.txt
4.txt
d.txt
5.txt
e.txt

Best wishes ... cheers, drl

baris35 · August 2, 2017, 3:40pm

Hello All,
Thanks for your comments.

I'd like to inform you that MadeInGermany's script worked out as expected.
As I am not familiar with different kind of softwares/scripts etc, I am unable to provide feedback about the output of other alternatives.

madeingermany:

The following bash script is typed from a mobile device and untested.
It uses 3 extra file handles, so it can read in a round-robin order.

while
  read a <&3; aexit=$?
  read b <&4; bexit=$?
  read c <&5; cexit=$?
  [ $aexit -eq 0 ] || [ $bexit -eq 0 ] || [ $cexit -eq 0 ]
do
  [ $aexit -eq 0 ] && printf "%s\n" "$a"
  [ $bexit -eq 0 ] && printf "%s\n" "$b"
  [ $cexit -eq 0 ] && printf "%s\n" "$c"
done 3< <( ls A/ ) 4< <( ls B/ ) 5< <( ls C/ )

and produces the following order

directory A, file 1.txt
directory B, file a.txt
directory C, file 101.txt
directory A, file 2.txt
directory B, file b.txt
directory C, file 102.txt
...

Improvements are welcome.

Thank You All for your valuable comments / recommendations

Kind regards
Boris

MadeInGermany · August 14, 2017, 9:15am

Also paste reads the input files (here: the ls commands) in round robin order:

paste -d '\n' <( ls A/ ) <( ls B/ ) <( ls C/ )

Empty lines might occur if an input file is exhausted; you can discard them with

paste -d '\n' <( ls A/ ) <( ls B/ ) <( ls C/ ) | grep .

baris35 · August 15, 2017, 3:27pm

Hello,
I am sorry for my delayed return
I will test it tomorrow and keep you posted about the result

Thanks
Boris

---------- Post updated 08-15-17 at 02:27 PM ---------- Previous update was 08-14-17 at 07:07 PM ----------

Hello MadeInGermany,
This is what I wish to accomplish!

Here is the output:

root@localhost:/home/ubuntu# paste -d '\n' <( ls 1/ ) <( ls 2/ ) <( ls 3/ )
aa
11
xx
bb
22
yy
cc
33
zz

Thank you so much for your help!

Kind regards
Boris