Should I say "field 8" or "column 8" in this case?

hanson44 · March 22, 2013, 5:07am

I saw some recent posts where I thought the terms "field" and "column" were being misused. I work with data a lot, and have my opinions. I'm wondering if those opinions are correct.

***** Rows seem clear - I don't think there is any controversy about what a row is, either for database or text file.

***** Flat file columns seem clear - For a flat file such as the following, I don't think there is any controversy about what a column is. Column in file shown is like "cut -c 1". Several columns may combine to make a field, so "cut -c 1-11" cuts the field in columns 1-11 (record ID here), such as 09011101001, 09011101002, etc.

09011101001270101192008BNB1102008000027060126720001305591
09011101002230101212008B5P1102008000053110126720001305591
090111010032501011120084XB1102008000085030126720001305591
09011101005250101232008GUW1202008000145050126720001305591
09011101006070101132008E3S1102008000157050126720001305591
09012101007060102062008GWB1102008000186030361080005352411
090111010081601011920082XW1102008000226050126720001305591

****** CSV and TSV "columns" seem misused to me - Here is similar data, in TSV format. I would call say "data in field 8" instead of "data in column 8". I think I'm supported by the cut command and it's use of "cut -f 8 -d," (--fields) for parsing this kind of data. To me, "column 8" means "cut -c 8". By the time we get out to "field 8", it doesn't line up vertically anymore, so I doesn't even look like a column. But it seems many, or perhaps most, say "data in column 8". But many of those can barely string together a sentence. So I thought I would ask the experts. Is it more correct to say "column 8" or "field 8" for what "cut -f 8 -d," retrieves in example below? :rolleyes:

9,1,1,1,1001,27,1,01192008,01,19,2008,BNB,110,2008000027
9,1,1,1,1002,23,1,01212008,01,21,2008,B5P,110,2008000053
9,1,1,1,1003,25,1,01112008,01,11,2008,4XB,110,2008000085
9,1,1,1,1005,25,1,01232008,01,23,2008,GUW,120,2008000145
9,1,1,1,1006,7,1,01132008,01,13,2008,E3S,110,2008000157,2
9,1,2,1,1007,6,1,02062008,02,06,2008,GWB,110,2008000186,2
9,1,1,1,1008,16,1,01192008,01,19,2008,2XW,110,2008000226

drl · March 22, 2013, 7:13am

Hi.

Good topic. It may help all of us communicate better.

I tend to prefer the term fields when talking about variable-width data groups, and columns when considering fixed-width data. The article at http://en.wikipedia.org/wiki/Field\_\(computer_science\) pretty much describes my outlook. If there is a separator character or string, I'd call it variable-width, even if all of the members of a specific group are the same width, primarily because any member could become a different width in the future.

From a historical point of view, the FORTRAN influence caused people to describe data in terms of fields even though the data were in fixed-width. Only somewhat later, perhaps in the '70s or '80s would the idea of free-form data become more prevalent.

Thanks for starting the discussion ... cheers, drl

jim_mcnamara · March 22, 2013, 7:43am

Good points, all.

I view fields as objects that are horizontally delimited and not in a fixed position, like drl.
I remember fields from FORTRAN and some versions of BASIC. Now the main driver seems to be portability of data files from UNIX into Excel.

UNIX uses field separators:
sort has a notion of fields delimited by -t [character].
from man sort for GNU sort

-t, --field-separator=SEP

awk has had FS from its inception.

hanson44 · March 22, 2013, 2:55pm

Yes, I think the sort syntax reinforces what I was trying to say. "sort -t, -k 8 fields.txt" sorts "field 8". The sort man page refers to "field separator" and "field number" and --field-separator. "column" is not even mentioned on the sort man page.

For the position within a field, both sort and cut use "character" instead of "column". In other words, cut says --characters where I would have said --columns. To me "characters" is confusing, as could be interpreted to be "--characters=ABC" as if looking for those "characters". The man pages says "select only these characters". Why did they choose --characters instead of --columns for the option name?

Of course, nobody is going to change the option name at this point. I suppose they could at least improve the cut man page, to say "select only these character positions".

Don_Cragun · March 23, 2013, 1:21am

hanson44:

Yes, I think the sort syntax reinforces what I was trying to say. "sort -t, -k 8 fields.txt" sorts "field 8". The sort man page refers to "field separator" and "field number" and --field-separator. "column" is not even mentioned on the sort man page.

For the position within a field, both sort and cut use "character" instead of "column". In other words, cut says --characters where I would have said --columns. To me "characters" is confusing, as could be interpreted to be "--characters=ABC" as if looking for those "characters". The man pages says "select only these characters". Why did they choose --characters instead of --columns for the option name?

Of course, nobody is going to change the option name at this point. I suppose they could at least improve the cut man page, to say "select only these character positions".

If you have a tab character, it may occupy one or more columns. If you have a backspace character, that character and the character it follows may occupy only one column. If you are looking at a Kanji character, a single character may occupy two columns. That is why we chose character rather than column for the tag associated with the -c option to sort.

The cut utility does not perform cuts based on the columns in which characters will be displayed. It can perform cuts based on the number of bytes (-b), the number of characters (-c), or the number of fields (-f). There is no option to cut the characters or bytes that will occupy a particular range of column positions (such as recognizing that the three character sequence <a><backspace><underline> immediately following a <newline> character will all occupy output column number one on some output devices). And the way the sequence of characters <a><tab><backspace><c> translates into output columns may vary considerably based not only on the position within a line where it appears but also on the software or hardware that is interpreting that sequence. Does the <backspace> character backspace over the previous output column or over the previous character (<tab> in this case)? What does the <backspace> character do when it is the first character on a line? Again, counting characters provides a clearly defined operation. If we had used output columns instead of characters (or bytes), the behavior required would not match any known existing implementation of the cut utility.

When using a fixed width character set, rows and columns are solid concepts on a CRT, typewriter, or printer and also when talking about entries in a table in a spreadsheet. Columns are much less precise when talking about the contents of a text file.

Characters, on the other hand, are explicitly defined by the LC_CTYPE category of the current locale.

hanson44 · March 23, 2013, 2:28am

What you say makes sense for UTF-8 or other multi-byte characters, the case I wasn't considering. I knew all that, but it turns out I didn't really know it. I'm so used to dealing with regular ASCII printing characters taking up one column, but I need to keep locales in mind. Thanks

Don_Cragun · March 23, 2013, 3:15am

Even with single-byte characters, tab and backspace seldom take up one column. I know that these characters aren't in the print class, but cut and sort don't just work on characters for which isprint(char) returns true.

hanson44 · March 23, 2013, 3:56am

I got your point before, and was agreeing. But I don't get your point this time.

I never suggested or thought cut only works with printing characters.

I don't agree that tab and backspace "seldom take up one column". The discussion is about fields and columns in files. I think tab and BS always take up one column (single byte characters). If you are referring to how the file displays on a monitor or printer, I think that's just an artifact. I don't think it's relevant to the issue that tab might be displayed ^I or eight spaces, or 'A' displayed in binary or hex by od, etc.

Don_Cragun · March 23, 2013, 4:34am

The POSIX and UNIX Standards have definitions for the terms byte, character, and column position that are very different from what you described above. The standard definitions are:

Byte:
An individually addressable unit of data storage that is exactly an octet, used to store a character or a portion of a character; see also Section 3.87 (on page 47). A byte is composed of a contiguous sequence of 8 bits. The least significant bit is called the ��low-order'' bit; the most significant is called the ��high-order'' bit.

Note: The definition of byte from the ISO C standard is broader than the above and might accommodate hardware architectures with different sized addressable units than octets.
Character:
A sequence of one or more bytes representing a single graphic symbol or control code.

Note: This term corresponds to the ISO C standard term multi-byte character, where a single-byte character is a special case of a multi-byte character. Unlike the usage in the ISO C standard, character here has no necessary relationship with storage space, and byte is used when storage space is discussed.

See the definition of the portable character set in Section 6.1 (on page 125) for a further explanation of the graphical representations of (abstract) characters, as opposed to character encodings.
Column Position:
A unit of horizontal measure related to characters in a line.

It is assumed that each character in a character set has an intrinsic column width independent of any output device. Each printable character in the portable character set has a column width of one. The standard utilities, when used as described in POSIX.1-2008, assume that all characters have integral column widths. The column width of a character is not necessarily related to the internal representation of the character (numbers of bits or bytes).

The column position of a character in a line is defined as one plus the sum of the column widths of the preceding characters in the line. Column positions are numbered starting from 1.

So your description of a column being equivalent to a byte just doesn't compute with what I believe that term means.

hanson44 · March 24, 2013, 9:49pm

To correct the record, I didn't say "a column is equivalent to a byte". And that's quite an assertion you so boldly made, that the standard definitions of byte and character are "very different" from how I used those terms. I've been using gettext, i8n, Unicode and UTF-8 for many years, and understand bytes and multi-byte chars very well. For the record, so I'm not deemed some kind of radical, I 100% agree with the standard definitions of byte and character you included.

My original post was about the word "column", in the context of a delimited file. We've gotten far off that topic. I previously agreed with you that I failed to think about locales when discussing why the cut option says "--characters" and not "--columns". I was trying to be nice. Could we move on?

Don_Cragun · March 24, 2013, 10:38pm

I apologize. I misinterpreted your statement:

to mean that you were equating bytes to columns.

Sometimes e-mail/forum discussions lead to confusion that would never occur in a face-to-face discussion where a clarification would happen immediately rather than being exacerbated by the delays between posts in a forum like this.