Shell Scripting | Return list of unique characters in files

clippertm · November 28, 2016, 2:10am

Hi,

I am trying to script the below, but I am not very good at it

Your help would be greatly appreciated.

read all files in the directory in strings

strings *.*

in each file, for each line that contains "ABCD", store characters located at position 521 and 522 of this line (this is where I am stuck)
once all files have been read, print a list of unique values (I guess I would have to use uniq).

RudiC · November 28, 2016, 6:50am

This specification is far from clear or complete. Please explain again in detail, supported by input and output samples, where you come from and what you want to achieve.

rbatte1 · November 28, 2016, 7:25am

Dear clippertm,

I have a few to questions pose in response first:-

Is this homework/assignment? There are specific forums for these.
What have you tried so far?
What output/errors do you get?
What OS and version are you using?
What are your preferred tools? (C, shell, perl, awk, etc. & strings, of course!)
What logical process have you considered? (to help steer us to follow what you are trying to achieve)

Most importantly, What have you tried so far?

There are probably many ways to achieve most tasks, so giving us an idea of your style and thoughts will help us guide you to an answer most suitable to you so you can adjust it to suit your needs in future.

We're all here to learn and getting the relevant information will help us all.

Kind regards,
Robin

rovf · November 28, 2016, 7:47am

It's difficult to tell you how you to accomplish a task, if you don't even specify what programming language you are going to use.

As to your question regarding extracting a single character:

Assuming that you have read the line in question to a shell variable, extracting a character at a certain position from a variable can be done in bash or zsh with

    ${line:522:1}

In Zsh, you also have the option to write

    ${line[521,521]}

In your case, you might also consider to not solve this within the respective shell language, but pipe the selected lines into 'cut'. Given that your problem description is somewhat fuzzy, I can't recommend which solution is the better one. In any case, have a look at the the man-page for 'cut'.

clippertm · November 28, 2016, 7:59pm

Hi,

Thank you all for your replies. My apologies if I was not not specific enough, I will do my best. No, this is not homework, just a bunch of files I need to analyse for one of my hobbies. I usually use a combination of strings and grep, but it would just take too long this time. Here is what I typically use:

strings *.* | egrep --color 'ABC.{521}10'

I guess you call it shell, correct? My environment is Cygwin. It shows me all lines (string formatted) starting with 'ABC' with value 10 at the 134 position. It allows me to see if value 10 occurred in this bunch of files. I usually only look for a few values, like 10, 11 and 20, so it does not take long.

The issue is that there are now too many files and too many values, which range from 00 to 99. The output I am looking for would simply be:

which is a list of values that occurred at the 134 position (without duplicates). I do not know if this can be achieve with shell. I do not mind at all trying new things like awk and perl.

Chubler_XL · November 28, 2016, 8:22pm

You could try this:

strings *.* | grep "ABCD" | cut -c 521-522 | egrep '(10|11|20)' | sort | uniq

If your list of values is quite long you could put them in a file with 1 line per value and use -f option of grep like this:

strings *.* | grep "ABCD" | cut -c 521-522 | grep -F -f want.txt | sort | uniq

-F is for fixed string matching (faster than using regex) and -f <filename> fetches list of matching strings from file filename

Aia · November 28, 2016, 8:31pm

'ABC.{134}10'
ABC = 3
.{134} = 134
10 = 2

The 1 from the number 10 is found WITHIN a string of 138 matched elements and the 0 from 10 will be WITHIN a string of 139 matches.
There is no anchor that says that ABC is the start of the line, but the start of the matched string.

I do not see how it can work with what you said:

In that case, it would be something like grep -E '^ABC.{130}(1[01]|2[056])' *

clippertm · November 28, 2016, 8:44pm

Thanks Chubler! This expands my horizon and is easy to use. Any idea on how not having to have to input 10, 11 & 20 at all? Perhaps a range from 00 to 99? So that whatever number is at position 521-522 is returned?

---------- Post updated at 08:44 PM ---------- Previous update was at 08:35 PM ----------

Thanks Aia, you are correct, the ABC was included in the 134 or 521, I do not explain myself very well.

Chubler_XL · November 28, 2016, 9:30pm

Sorry I misinterpreted your requirements and believed you had a number of values in the range of 00 thru 99 and not all the values.

to match any two digit number I would use:

strings *.* | grep "ABCD" | cut -c 521-522 | egrep '[0-9][0-9]' | sort | uniq

Don_Cragun · November 28, 2016, 11:31pm

chubler_xl:

Sorry I misinterpreted your requirements and believed you had a number of values in the range of 00 thru 99 and not all the values.

to match any two digit number I would use:
strings *.* | grep "ABCD" | cut -c 521-522 | egrep '[0-9][0-9]' | sort | uniq

You might speed that up a little bit with:

strings *.* | awk '/ABCD/ && (s = substr($0, 521, 2)) ~ /[0-9][0-9]/ {print s}' | sort -u

But note that if you use strings on a text file, the output from strings might not be complete lines from your input files. Have you really checked the output from strings *.* is producing output with the two digits you want in positions 521 and 522?