Search text file for lines present in another text file and redirect output

Hi I am Rupesh from India and I have a system with intel i3 10th gen processor and Asus prime H 510 me motherboard. I have installed Arch Linux and it is working fine.

I have two text files say input1.txt and input2.txt and I want to search input1 file line by line present in another text file called input2.txt and redirect output to another text file missing.txt.

input1 text file consists of lines in the following pattern.

This is a video file 1[abcdef].mp4
This is a video file 2[ghijklm].mp4
This is a video file 3[nopqrst].mp4
This is a video file 4[uvwxyz].mp4

input2 text file consists of lines in the following pattern.

This is a video file 1.mp4
This is a video file 2.mp4
This is a video file 4.mp4

Note that in the second text file one line is missing ie., "This is a video file 3.mp4"

The second text file file consists of lines from first text file last few characters removed.

Now my requirement is I want a script which searches input1.txt text file for lines containing "This is a video file 1" and after that search for "This is a video file 2" and so on.

If a match is found ignore the line. If a match is not found the line must be redirected to another text file called missing.txt.

In the above case missing.txt must contain the following line

This is a video file 3[nopqrst].mp4

You can say that use comm utility or diff utility but they are not applicable in the present situation because some characters are missing in the second text file.

You can even suggest that use text editor like gedit but I have 1000 lines in first text file and 800 lines in second text file.

I want all the 200 remaining missing lines in missing.txt file.

I have a script which has while loop and reads all the lines present in second text file input2.txt.

#!/bin/bash

while read name; do
   
   echo $name

done < input2.txt

But I don't know how to search input1 text file and how to use if construct.

At present I am reading linux operating system and utilities from the beginning and at present I completed how Linux boots and introduction to systemd. It will take time to create a script on my own.

Kindly try to provide a script which searches input1 text file and redirect the unique lines to missing.txt file.

Regards,
Rupesh.

sounds like this thread is related to your previous post from 2 years ago - at least on the surface.
You'd need to put a little bit more effort into the "comparison" part though and I'd not do it in pure shell either - use other Linux text processing tools/utilities.

1 Like

Hi @rupeshforu3,

Here is a very quick proof of concept to demonstrate that this is possible to do from shell:

% cat input1.txt | while read LINE; do fgrep "$(echo ${LINE} | sed 's/\[.*//')" input2.txt > /dev/null || echo "${LINE}" >> missing.txt; done

% cat missing.txt
This is a video file 3[nopqrst].mp4

I used your input1.txt and input2.txt.

As @vgersh99 alluded to, this is somewhat fragile to do in shell, especially with the need to modify the line to look for a subset of the text.

N.B. you will want to read the larger / superset file and look for it's lines in the smaller / subset file to identify the missing lines that exist in the former but are missing from the latter.

Thanks for your positive response.

You're welcome.

Please remember to let us know what worked for you.

If appropriate, please mark a suggestion as the solution.

This might be just a teeny-tiny bit faster, than running echo+sed few hunderd times:

while read -r LINE; do
  grep -F "${LINE%[*}" input2.txt > /dev/null || echo "$LINE" >> missing.txt;
done < input1.txt

Not much of an improvement, yet still an improvement :slight_smile:

2 Likes

if you have wdiff wdiff , this'll get you most of the way

wdiff input1.txt input2.txt |grep -v -E '{.*}$'
This is a video file [-3[nopqrst].mp4

Suggestions from other people are as follows

sed 's/\([0-9][0-9]*\).*\./\1./' <(sort input1.txt)  | comm -3 - <(sort input2.txt) 

while read -r line1; do
    if ! grep -q "$line1" input2.txt; then
        echo "$line1" >> missing.txt
    fi
done < input1.txt

diff -u -d -c 0 input1.txt input2.txt

This will not work as mentioned in the first post - it will return only a "truncated" file names e.g. This is a video file 3.mp4 (missing [nopqrst]), which exist in neither of input1.txt nor input2.txt

This loop will not work correctly, as it searches for exact, complete lines from input1.txt inside input2.txt, neither of which (in input2.txt) contain any [abcdef] segments. The correct loop(s) was/were already suggested.

This set of flags cannot be used with diff command, because some are mutually exclusive/conflicting. Is this a contest of "how many failed attempts it's possible for inexperienced users to come up with"? :smiley:

1 Like

Hi the code provided by you is working fine for text files which have lines in short.

The code provided by you and other people are not working on files which have long lines. Long lines mean input1.txt file consists of lines which more than 100 characters.

Bash is not working properly for files which contains more than 600 lines and for files which have long lines which have lines more than 100 columns.

I am going to illustrate what happens actually below.

The script I am using worked properly for short files but it is not working for longer files.

I am going to illustrate step by step what happens actually below.

The following are the contents of the current directory with file size.

[work@Rupesh work3]$ ls -l
total 12
-rw-r--r-- 1 work work 446 Apr  9 19:24 input1.txt
-rw-r--r-- 1 work work 390 Apr  9 19:24 input2.txt
-rwxr-xr-x 1 work work 142 Apr  9 19:10 script.sh

The following are the contents of the text file called input1.txt

[work@Rupesh work3]$ cat input1.txt 
Abhaya Veeranjaneya Swami Temple _ Pallagiri Konda _ Nandigama _ Teerthayatra _ 26th July 2022 _ AP [z-mTkgzd9dE].mp4
Acchamamba Perantalu Temple _ Gudipudi _ Guntur _ Teerthayatra _ 30th March 2021 _AP [BHWdnosP2mw].mp4
Achanteswara Swamy Temple _ Achanta _ West Godavari Dist _ Teerthayatra _ 24th April 2023 _ AP [MgQo5Tm726Y].mp4
Adiyogi Shiva Devalayam _ Coimbatore _ Tamil Nadu _ Teerthayatra _ 18th November 2022 _ETV AP [peKr3q83nvM].mp4

The following are the contents of the text file called input2.txt

[work@Rupesh work3]$ cat input2.txt 
Abhaya Veeranjaneya Swami Temple _ Pallagiri Konda _ Nandigama _ Teerthayatra _ 26th July 2022 _ AP.mp4
Acchamamba Perantalu Temple _ Gudipudi _ Guntur _ Teerthayatra _ 30th March 2021 _AP.mp4
Adiyogi Shiva Devalayam _ Coimbatore _ Tamil Nadu _ Teerthayatra _ 18th November 2022 _ETV AP.mp4

The following are the contents of the script file which previously succeeded but now not.

[work@Rupesh work3]$ cat script.sh 
cat input1.txt | while read LINE; do grep -F "$(echo ${LINE} | sed 's/\[.*//')" input2.txt > /dev/null || echo "${LINE}" >> missing.txt; done

The following shows that the script executed fine

[work@Rupesh work3]$ ./script.sh 

The following is the current directory listing including file size.

[work@Rupesh work3]$ ls -l
total 16
-rw-r--r-- 1 work work 446 Apr  9 19:24 input1.txt
-rw-r--r-- 1 work work 390 Apr  9 19:24 input2.txt
-rw-r--r-- 1 work work 446 Apr  9 19:29 missing.txt
-rwxr-xr-x 1 work work 142 Apr  9 19:10 script.sh

After executing the script the missing.txt file must contain only one line which is not found in input2.txt file but instead it consists of all lines present in input1.txt file.

[work@Rupesh work3]$ cat missing.txt 
Abhaya Veeranjaneya Swami Temple _ Pallagiri Konda _ Nandigama _ Teerthayatra _ 26th July 2022 _ AP [z-mTkgzd9dE].mp4
Acchamamba Perantalu Temple _ Gudipudi _ Guntur _ Teerthayatra _ 30th March 2021 _AP [BHWdnosP2mw].mp4
Achanteswara Swamy Temple _ Achanta _ West Godavari Dist _ Teerthayatra _ 24th April 2023 _ AP [MgQo5Tm726Y].mp4
Adiyogi Shiva Devalayam _ Coimbatore _ Tamil Nadu _ Teerthayatra _ 18th November 2022 _ETV AP [peKr3q83nvM].mp4
[work@Rupesh work3]$ 

Previously I have created a similar thread related to package searching in Arch Linux. The same thing is happening in that case also.

I am requesting all of you to copy the contents of input1.txt and input2.txt and script.sh and run the script in your system.

Even if you get wrong result there is need to modify the original source code of bash software.

This thread must be forwarded to bash shell maintainer so that he can make it work properly.

For any questions I am ready to answer.

If you had mentioned that there are multiple "white characters" (spaces) in each very long filename, you'd quickly avoid the misunderstanding. There's nothing wrong with bash shell, just insufficient information initially given regarding your actual input data - no shell maintainer involvement required :slight_smile:
Removing leading spaces before each [ is an easy fix. But I'll let you try figuring it out on your own first, as you probably have enough guidance from all previous replies.

3 Likes

@rupeshforu3 , why should the forum do ALL your work with you simply now assuming that you can give directions/ orders … make a concerted effort of your own then post that then maybe folks might be compelled to help

2 Likes

Thank you for the optimization @Matt-Kita.

I spent some time trying to get the expansion substitution to work, but couldn't get it to my satisfaction so I fell back to and punted with sed.

#incrementalImprovement

Ok thanks for your suggestion and patience.

Your question is why we should do homework for you etc.,.

So I am aborting this work and after learning bash scripting thoroughly I will write a script on my own.

My final question is why a piece of code works if we supply a short input and why the same code doesn't work if for long input.

You can argue that provide the output of the command etc., and I am ready to provide but you don't have time to examine what is happening.

Thanks for your patience and I will not ask any questions related to the current topic.

That's difficult to say.

We'd need to see which code you're using as well as the long lines of input and output that you're seeing.

There are some different things that may affect the outcome, a couple of the big things is special characters and quoting.

Hi I think if we follow the correct procedure the issue can be resolved and I think that bash is working fine.

  1. first for each line in input1.txt the square brackets and the characters with in square brackets must be removed.

  2. The trailing space character must be removed.

  3. The resultant line must be searched with in input2.txt.

  4. If the line is found in input2.txt the line must be ignored.

  5. If the line is not found the original line including square brackets and it's contents from input1.txt must be redirected to missing.txt file.

Previously I have used the following code to remove square brackets including its contents from file name with the following code.

for x in *.mp4; do mv "$x" "${x// \[*\]/}"; done

We must modify the above code and use it in step 1 and step 2.

I have thought how to solve the problem and found the above. I am sharing this to you to share my effort.

Let me try if I can succeed. Here the problem is I am not as talented person like you. I have not studied sed awk grep etc.,.

From the past 4 days I am working on the same problem and not succeeded and so I have made comments that bash is not working properly. Sorry for that.

@rupeshforu3 , it is good that you are working to get a solution, we are happy to help when the process is one of collaboration.

simple, the sample supplied is NOT REPRESENTATIVE of the data to be processed. Therefore, a generalised solution cannot be provided when the shapes of the input is unknown.

Given the 'latest' info you gave wrt data in the input1/2 files, below is another potential , when you share your solution I can reveal it

cat input1.txt 
This is a video file 1[abcdef].mp4
This is a video file 01[abcdef].mp4
This is a video file 2[ghijklm].mp4
This is a video file 3[nopqrst].mp4
This is a video file 4[uvwxyz].mp4
This is a video file 3334[uvwxyz].mp4
This is a video file 0013334[z].mp4
This is a video file 88888999999[uvwxyz].mp4
This is a video file 55[nopqrst].mp4
This is a video file 65[nopqrst].mp4
Abhaya Veeranjaneya Swami Temple _ Pallagiri Konda _ Nandigama _ Teerthayatra _ 26th July 2022 _ AP [z-mTkgzd9dE].mp4
Acchamamba Perantalu Temple _ Gudipudi _ Guntur _ Teerthayatra _ 30th March 2021 _AP [BHWdnosP2mw].mp4
Achanteswara Swamy Temple _ Achanta _ West Godavari Dist _ Teerthayatra _ 24th April 2023 _ AP [MgQo5Tm726Y].mp4
Adiyogi Shiva Devalayam _ Coimbatore _ Tamil Nadu _ Teerthayatra _ 18th November 2022 _ETV AP [peKr3q83nvM].mp4
hunkemoller@Z800:~/cuc/393779$ cat input2.txt 
This is a video file 1.mp4
This is a video file 2.mp4
This is a video file 4.mp4
This is a video file 6.mp4
This is a video file 5.mp4
Abhaya Veeranjaneya Swami Temple _ Pallagiri Konda _ Nandigama _ Teerthayatra _ 26th July 2022 _ AP.mp4
Acchamamba Perantalu Temple _ Gudipudi _ Guntur _ Teerthayatra _ 30th March 2021 _AP.mp4
Adiyogi Shiva Devalayam _ Coimbatore _ Tamil Nadu _ Teerthayatra _ 18th November 2022 _ETV AP.mp4

comm -3 <(sed 's/\([0-9|]\)\[.*\]/\1/; s/[[:blank:]]\[.*]//' <(sort input1.txt)) <(sort input2.txt)|grep --color=never -E '^[^[:blank:]]'

Achanteswara Swamy Temple _ Achanta _ West Godavari Dist _ Teerthayatra _ 24th April 2023 _ AP.mp4
This is a video file 0013334.mp4
This is a video file 01.mp4
This is a video file 3334.mp4
This is a video file 3.mp4
This is a video file 55.mp4
This is a video file 65.mp4
This is a video file 88888999999.mp4

Hi finally I have created a small script based upon my past experience and suggestions from the people like you and it is as follows.

#!/bin/bash

rm missing.txt

while read LINE; do
   
#   echo $LINE
   NAME="${LINE// \[*\]/}"
   NAME2=${NAME%????}
   NAME3="$(echo -e "${NAME2}" | sed -e 's/^[[:space:]]*//')"
   NAME4="${NAME3}.mp4"
#   echo $NAME2
#   echo ${NAME}
   grep -F "$(echo ${NAME4})" input2.txt > /dev/null || echo "${LINE}" >> missing.txt
done < input1.txt

Many of you suggested single line code which works at some times and remaining not.

But the above code works in all situations. I have created this script step by step.

  1. First I have removed square brackets including its contents from each line of input1.txt using the following code.

NAME="${LINE// \[*\]/}"

  1. I tried to remove trailing space from each line using the following code.
   NAME2=${NAME%????}
   NAME3="$(echo -e "${NAME2}" | sed -e 's/^[[:space:]]*//')"
   NAME4="${NAME3}.mp4"
  1. Finally I have searched all lines of input2.txt containing the contents of variable NAME4 using the following code.

grep -F "$(echo ${NAME4})" input2.txt > /dev/null || echo "${LINE}" >> missing.txt

The most important step is to remove square brackets and including its contents through step 1.

The above script is working fine except step 2 I mean the code provided in step 2 can't remove trailing space.

Trailing space means.

input1.txt consists of some lines with pattern

This is a video file 1 [abcdefg].mp4

input2.txt consists of lines with pattern

This is video file 1.mp4

Generally the above script must ignore this line and it must not redirect this particular line to missing.txt file but it is being redirected.

I have included code to remove trailing white space before performing grep. I think that trailing white space has been removed but still redundant line is being redirected to missing.txt.

I think that if the above script run properly the missing text file must contain only 80 lines but at present it consists of 165 lines.

Something is better than nothing and that too you may be irritated if I go on saying not working not working etc.,. So I am aborting the current work.

If you still want to know what's not working then my answer is search is not performed well for lines containing trailing white space before square brackets in input1.txt.

Thanks for your patience.

shouldn't it be as simple as this (given your sample inputs)?
awk -F'[][]' -f rupesh.awk file2.txt file1.txt where rupesh.awk is:

FNR==NR {
   f2[$0]
   next
}
!(($1 $3) in f2)

yielding:

This is a video file 01[abcdef].mp4
This is a video file 3[nopqrst].mp4
This is a video file 3334[uvwxyz].mp4
This is a video file 0013334[z].mp4
This is a video file 88888999999[uvwxyz].mp4
This is a video file 55[nopqrst].mp4
This is a video file 65[nopqrst].mp4
Abhaya Veeranjaneya Swami Temple _ Pallagiri Konda _ Nandigama _ Teerthayatra _ 26th July 2022 _ AP [z-mTkgzd9dE].mp4
Acchamamba Perantalu Temple _ Gudipudi _ Guntur _ Teerthayatra _ 30th March 2021 _AP [BHWdnosP2mw].mp4
Achanteswara Swamy Temple _ Achanta _ West Godavari Dist _ Teerthayatra _ 24th April 2023 _ AP [MgQo5Tm726Y].mp4
Adiyogi Shiva Devalayam _ Coimbatore _ Tamil Nadu _ Teerthayatra _ 18th November 2022 _ETV AP [peKr3q83nvM].mp4

or am I missing something obvious?
P.S. please don't mark a post as a "Solution" if you don't have one - it will be confusing for the others later on (and present).

As I am always saying not working not working and you may be irritated and so I said I have found a solution.

If you don't mind I want to be a student to you to learn unix concepts including how it works, shell scripting, x windows system, how to use it's utilities etc.,.

As I am a student I will get doubts and you must be ready to clarify my doubts.

At present I am reading how Linux works written by Brian ward and Linux command line by William shots.