Need some help with shell content scanner

medic · May 19, 2009, 8:22am

Just started to create my own small content scanner that searches all the visible files on my server, but now I got stuck. It should be used to scan the files for phrases like in the following example.

What I tried is the following code:

#!/bin/bash
find /home/userid*/public_html/ -size -307200k -exec grep -H -n -i -l 'www.exampleurl1.com/favicon.ico\|www.exampleurl2.com/v/' > /home/mypath/scan_content.php {} \;

That code first finds all the files within all public_html folders that are not larger than 307200k follows with scanning the content of that files.

Now that worked fine for the first few thousand files, but now it stopped working. I thing there are to many files so that grep cant read all of them or something else. There is no error or something, the process just keeps alive but with a cpu & mem usage of 0 and that forever.

So it would be great if someone has an idea of how to write that scanner to ensure that it also works with a few hundred thousand files.

Thanks

medic · May 19, 2009, 4:32pm

I just got the tip to use find with xargs and grep to solve that problem, but my combinations just wont works. Hopefully someone could help, because I have never tried something like that before.

#!/bin/bash
find /home/userid*/public_html/ -size -2048k | xargs grep -H -n -i -l 'phrase1\|phrase2' > /home/filepath/public_html/path/scans/scan_result.php {} \;

Need some pro here to help me with problem, because I am still a beginner with bash.

rubin · May 19, 2009, 8:18pm

Not clear whether you need just the filenames or their lines as well, also I'm assuming that you're using the pipe | as an alternation operator (RE) - not as a physical part of the files' records. Modify if needed.

find /home/userid*/public_html/ -size -2048k | xargs grep -Eil 'phrase1|phrase2' > /home/filepath/public_html/path/scans/scan_result.php

If using GNU find/xargs ( most Linuxes ), use their -0 option to handle problematic filenames.

medic · May 20, 2009, 5:28am

Thanks for the reply, I just tried your code but got some problems.

First I just tried:

find /home/userid*/public_html/ -size -2048k | xargs grep -Eil 'phrase1|phrase2' > /home/filepath/public_html/path/scans/scan_result.php

there I go a lot of error messages from grep that the files or folders don't exist.

I also tried it with -0 in the following way:

find /home/userid*/public_html/ -size -2048k | xargs -0 grep -Eil 'phrase1|phrase2' > /home/filepath/public_html/path/scans/scan_result.php

There the problem is that I get an error with xargs telling me that the xargs Argument is too long.

rubin · May 20, 2009, 7:29am

medic:

Thanks for the reply, I just tried your code but got some problems.

First I just tried:
find /home/userid*/public_html/ -size -2048k | xargs grep -Eil 'phrase1|phrase2' > /home/filepath/public_html/path/scans/scan_result.php
there I go a lot of error messages from grep that the files or folders don't exist.
...

It works fine with me. Try testing the commands separately, first find:

( I guess you know that the way it is, find will find files and directories )

find /home/userid*/public_html/ -size -2048k

grep command on some test files ( plural ),

grep -Eil 'phrase1|phrase2' test_files*

then all together with xargs, ( I'm guessing you're on Linux - with GNU find/xargs the right syntax is a bit different, print has to be spelled out explicitly ):

find /home/userid*/public_html/ -size -2048k -print0 | xargs -0 grep -Eil 'phrase1|phrase2' > output_file

medic · May 20, 2009, 8:07am

Just found the problem. The first part of the code is working fine, but grep is producing some problems.

I just tried to scan for more than one phrase and there the problem occurs.

find /home/userid*/public_html/ -size -2048k -print0 | xargs -0 grep -Eil 'phrase1|phrase2' > output_file

I just tried it like that:

'www.steampowered.com\|www.icq.com'

With that code, there is no output. By just using one of these phrases it works.

Do I miss something?

medic · May 20, 2009, 8:14am

Okay, just found my fault, it was just the copy paste of the grep parameters.

medic · May 20, 2009, 8:54am

That code works now, but it has the same problem as the I started with. It starts working and after a few minutes the used cpu and mem goes down to nearly 0 and thats it.

I am not a Pro, so are there are limits? I just tried the scanner with a few thousand files and a few arguments, but it seems that it has chance to finish before something gets a timeout or something else.

Any idea on that or any other idea of working around the content search?

rubin · May 20, 2009, 11:46am

I see ..., that's not a code problem though and there are no certain limits as far as find/xargs is concerned, I'd rather suspect a hardware / OS issue.
Try to run the same code I posted on a different machine, and see how it behaves. This way at least, you have some idea of what's going on.

medic · May 21, 2009, 4:08am

Just did some more testing. The problem seems to be find . I think there are too many files and subfolders. I just tried to limit the folder depth and it worked, for a smaller amount of files.

I have no idea where the limit could be. I Use CentOS5 on a machine with 4 cores and 4 gb ram, so if anyone has an idea please let me know.

Otherwise the solution would take a lot more of time.

ghostdog74 · May 21, 2009, 4:42am

if you have Python on Centos, here's an alternative

#!/usr/bin/env python
import os
outfile = os.path.join("/home","filepath","public_html","path","scans","scan_result.php")
for r,d,f in os.walk("/home"):
    if "public_html" in r:
        for files in f:
            size=os.path.getsize(os.path.join(r,files))
            if size <= 2048000:
                o=open(outfile,"a")
                for line in open(os.path.join(r,files)):
                    if "phrase1" in line or "phrase2" in line:
                        o.write(line)
                o.close()

medic · May 21, 2009, 8:12am

@ghostdog74

Thanks for that piece of beautiful code.

I just modified it to save the path of the file with the content of the line und the load is more than okay. The maximum I have seen while testing is 70% of one CPU and the load is just around 1.5 .

I always thought that bash is the only thing that could work with low resources on my server. :rolleyes:

Just one question left, the code you posted searches all public_html folders within the home directory, is there a limit of how many sub directories are scanned, or is it following the hierarchy until the end?

ghostdog74 · May 21, 2009, 9:27am

not true. for example, using too much pipes, also, the logic matters as well.

please read the documentation of the os.walk() method here. you can pass arguments(eg topdown) to os.walk() to limit your search.

medic · May 21, 2009, 10:02am

Thanks again for the answer, I just want to search all sub directories, so it should work as you already wrote in code example.

I will now start my final test with a few more arguments to find.

Thanks again.

medic · May 21, 2009, 11:49am

After my final test, I just found a problem again.

I started the scanner to search for about 5 phrases and write the files into my result file, that works fine so far. But after about an hour and I dont know how many files, the process is no longer listed with "top", but it is still not finished.

Now I had a look in WHM and there I see the process python running with 40% CPU usage, getting smaller every 5 seconds. In my shell with top the cpu�s are not showing that load.

Any idea where that problem comes from? I just searched, but I cant find where a limit could be that stops python.

ghostdog74 · May 21, 2009, 8:27pm

put something like this in your code

            if size <= 2048000:
                # open("logger.txt","a").write("doing "+os.path.join(r,files)+"\n")
                # print "doing ..." + os.path.join(r,files)
                o=open(outfile,"a")
                .......

this is create a logger.txt file. you can tail -f this file to check progress. OR you can just print to stdout the progress. Using top only shows you partial information. best is to use ps -ef |grep "process".
If its always printed, then i believe you really have A LOT of files to process.

medic · May 22, 2009, 1:58am

I already added such a line that is always telling me what the script is doing, but I think there are really a lot of files.

There are about 1000 public_html directories and every folder has about 100-200 files.

So don�t know if it is a problem to work 500 000 files or even more.

ghostdog74 · May 22, 2009, 2:05am

well, if you really have THAT much files, there's really no choice right? between using find+xargs and one using Python, you can time both and use the one more efficient. I guess that's already a bonus if you find one that is fast enough. (or you can wait for some other solutions to come by)

medic · May 22, 2009, 2:15am

From what you explained me python is a very nice way to write such files. I just need to ensure that the scanner is not stopping its work after 1 hour because that happened.

It will wait till the scanner now stops with the added print so that I could find out where it happened.