extracting delimiter from a file.

hi, pls someone tell me how to extract delimiters from any file and pass it to a unix script.since, im a beginner in unix i find it little bit difficult.how to use awk to do this?

let me know...........

hmmm....

A "delimiter" is not to be found out by any means. It is, what you *define* as delimiter.

consider the following file:

field1 ,| field2 | field3 ,|
field1 ,| field2 | field3 ,|
field1 ,| field2 | field3 ,|

You could define "|" as a delimiter. Then you would have 3 fields, separated by "|", which read "field1 ,", "field2 ," and "field3 ,". You could also define "," as a field separator, which would leave you with 3 fields per line, "field1 ", "| field2 | field3 " and "|".

There is no "correct" solution to it, you could define every other character as a delimiter too.

I hope this helps.

bakunin

suppose, i have a file in which im unaware of the delimiter of it.so ,is it possible to extract a text between the delimiter ,without knowing what delimiter it is..give an example for clear understanding.

If I tell you that I have kept money for you in my house and if you don't know where my house is. Do you think you can get the money?

The same thing goes for the delimeter...:wink:

Just do a cat on the file and look to see what character separates the fields. It will most likely be a comma, space, pipe, tab, or slash.

$ cat myfile
field1|field2|field3
$ cat myfile2
field1,field2,field3

In myfile | is the delimeter. In myfile2 , is the delimeter.

Sorry, but this is nonsense, as i have shown above. A "delimiter" is whatever you define to be a delimiter and it is completely arbitrary.

bakunin

@bakunin,

The whole point of the example was to show that a delimeter is an arbitrary field separator. Generally, if working with already existing files such as a csv file, a delimiter has been chosen for you. Even in a csv spreadsheet, its still arbitrary, but it wouldn't make much sense to use anything other than a comma. A delimiter is arbitrary in the sense that any character can be used, but its not in the sense that if the character doesn't actually mark anything, its not a delimiter.

Webster's Dictionary definition:

a character that marks the beginning or end of a unit of data

In the case of myfile

afielD1|bfiEld2|cfIeld3

The | is the only delimiter because it is the only character that is actually marking the beginning or end of a unit of data.

In a *c*sv-file (which is called "comma-separated" for some kind of reason perhaps) this is right. But then, in a comma-separated file there is no necessity to find out the delimiter, as the thread opener wanted to know and has asked.

This is a misunderstanding: "doesn't [actually] mark anything" means you second-guess what exactly establishes a meaning. Consider:

Lets say the pipe character is used as delimiter: several of the fields delimited this way are empty. Do these empty fields establish useful information or not?

Furthermore - sorry, this gets somewhat philosophical -, "meaning" is not an inherent quality at all. The string "abc" might have a meaning or not, depending on what we agree to establish meaning, depending on context, whatever.

Your argument comes down to "plausibility" and while i agree with you that limiting your search for solutions to plausible or obvious ones most times helps to solve real-world problems faster, it simply doesn't help if you are trying to find generalized solutions - like in "write a script to find the delimiter".

Consider the string "a||b||c": does this mean three fields, "a", "b" and "c", delimited by a double pipe char or does it mean 5 fields, two of them empty? Both variants would be plausible enough, both might be correct - or wrong, depending on the intention of the one who wrote the line. But this information cannot be discerned from the file alone at all. You will need some additional information - context - to do so.

Again, this is appealing to some plausibility. Everything can be considered "data", "afie" or "D1" is (or can be) as much data as "afie|D1" or whatever substring you extract from this line. If it is data or not depends on your ability to derive meaning from it. Again: context.

If i give you a succession of characters, say "R-O-T" - is it data? In other words, does it have a meaning? As long as you don't have additional information you can't decide this question at all. For instance, if you know we are talking in English then this would constitue a word (a verb) and have a meaning. If you know that we are talking in german this would also have a meaning, but a different one ("rot" means "red" and is an adjective) - and if you know we are talking Italian it would have no meaning at all as there is no word "rot" in Italian. It would be some garbled transmission in this case. This means, you need to have (or need to assume) some context (the language) to decide if this string is data or not.

The human brain is very very good in finding (or constructing) patterns, real ones or - in some pathological cases, like the mathematician John Nash - imagined ones. Still, finding a pattern is not discovering some inherent quality of the presented data but to put some organization on received information. But this organization is put on this information from outside and therefore is, what i said: arbitrary. None of these organizations is "better" or "more correct" than any other.

bakunin

Hi,

Finally is there a way to find the delimiter in the file?

Regards
JS