Extracting content of a file

l20N1N · August 21, 2010, 7:19pm

Hello, I'm working on a script to extract the contents of a file (in general, plain txt file with numbers, symbols, and letters) and output it into a .txt file. but it is kind of all over the place. It needs to not include duplicates and the content has to be readable. I jumped all over the place as far as learning scripting but I managed to get down the translate feature. kind of new to awk but i heard it can be more effective and works similar. I was also wondering if im just making something more complicated when sort & uniq might be able to do the job?

Note: I will be using this script numerous times. Is it possible to keep updating the output file so that the context is extracted collectively?

My logic of the script so far is

1.read (while loop maybe?)
2.sort/uniq -c (to eliminate duplicates)
3.awk (to eliminate gibberish?)

> filename.txt

my code so far:

#!/bin/bash
# Check for input file on command line.
ARGS=1
E_BADARGS=65
E_NOFILE=66

if [ $# -ne "$ARGS" ]  # Correct number of arguments passed to script or too complicated for something easy?
then
  echo "Usage: `basename $0` filename"
  exit $E_BADARGS
fi

if [ ! -f "$1" ]       # Check if file exists.
then
  echo "File \"$1\" does not exist."
  exit $E_NOFILE
fi


#so far i have it set to translate output by feeding tr back to itself. will this work?
#or is awk more effective. what about the use of | sort | uniq -c?

tr A-Z a-z | tr '[:space:]' Z | \
tr -cs '[:alpha:]' Z | tr -s '\173-\377' Z | tr Z ' '` 

# for or while loop?

> output.txt 


exit 0

konsolebox · August 21, 2010, 7:34pm

I think everything for that can be done in awk using associative arrays that will flag every entry and prevent printing of a second duplicate. Conversion of chars are also easily handled. The problem in order to solve that quickly in one shot,.. can you give us an adequate example of the file's contents and the intended output?

anon57720281 · August 21, 2010, 7:46pm

you are on the right track, maybe no need for awk.

uniq -c does a count.

you can simply sort -u

define gibberish.

you can use tr -cd to complement the search (if that's any easier):
eg: delete anything not alphanumeric or space

tr -cd '[:space:][:alnum:]'

l20N1N · August 21, 2010, 10:38pm

The thing is, the input files vary. It could be in logs, records, database, information converted into plain text. The script will need to be able to read everything on it. One file for example had:

John Smith  555-5555  to 555-5555 Hello Jane Doe

another file was an email message so it was all text

The output just needs to have everything taken from the input printed in the output. The problem here is that it needs to be collectively done. For example I input one file and output it to the output file. Input another file and output it to the same(adding into) output file. That's where I'm stuck. I read that it will overwrite it the existing file, but I was wondering if it can be updated instead.

gibberish meaning non-printable that might be mixed into the regular expressions

Update:

so for the while loop portion where it reads I can use this code correct?

while read line 

do echo "${line}"

 done < <(cat file.lst)

/tmp/file1.txt
/tmp/file with space.txt

which inputs a file list of files to extract content out of and output it into a txt file in temp?

---------- Post updated at 07:38 PM ---------- Previous update was at 04:52 PM ----------

Will this also work?

cd <input_file_directory>
for file in `dir -d *` ; do
<exeFile with full path> "$file" <output_file_path/"$file".out>
done

konsolebox · August 21, 2010, 11:09pm

If you intend to do that in bash:

#!/bin/bash

[[ BASH_VERSINFO -ge 4 ]] || {
    echo "Bash version 4.0 or newer is required by this script."
    exit 1
}

declare -A FLAGS=()

while read; do
    REPLY=${REPLY//[^[:print:]]}
    [[ -n ${FLAGS[$REPLY]} ]] && continue
    FLAGS[$REPLY]=.
    echo "$REPLY"
done

exit 0

bash script.sh < input_file

That one requires version 4.0 or newer of bash.

With bigearsbilly's suggestion:

tr -cd '[:print:]' input_file | sort -u

l20N1N · August 21, 2010, 11:36pm

I'm using bash 3.2.39.

Ok lets try this approach. Lets say I use this simple code

cat file.txt |while read line; do echo "${line}"; done >> output.txt

Is it possible for me to code it so that the "file.txt" could be all txt files in a directory?

konsolebox · August 21, 2010, 11:50pm

just change it to *.txt

cat *.txt ...

l20N1N · August 22, 2010, 12:58am

Oh right... I think that should do it for now. I will do some more testing and post back if I have any further questions. Thanks guys!