How join works and the specific parameters to my problem?

  1. The problem statement, all variables and given/known data:
    I have two files created from extracting data off of two CSV files, one containing class enrollment on a specific quarter and the other containing grades for that specific quarter. The Enrollment file generated contains course name, instructor and catalog number and a few other columns. The Grade file generated contains GPA and catalog number and a few other columns.

Both files have over 5k rows. I need to join the two files using the catalog number as the key field.

The Enrollment CSV file had some entries with Catalog No. missing, these were, in fact, subtotal entries and they were filtered out after the file was created.

  1. Relevant commands, code, scripts, algorithms:
    awk was used to generated the files using "|" as the field separator.

Example:
Course file
...
Catalog No. | ART 101 | instructor | Intro to Art
...

Grade file
...
Catalog No. | ART 101 | GPA
...

I need to use join to create a merged file of the two based on the Catalog No.

Example:
Merge file
...
Catalog No. | ART 101 | Instructor | Intro to Art | GPA
...

The original CSV files have the courses grouped by department (i.e. ART, CRIM, ENGR, CS, etc.)

  1. The attempts at a solution (include all code and scripts):
    I've attempted two approaches, one where both files are sorted based on Catalog No. and one without sorting.

Every relevant join parameters were used in an attempt to merge the two files but each time, I get either an empty merged file or it just contains one or the other file in it.

join -t"|" -o'1.1 1.2 1.3 2.1 2.3' Courses.txt Grades.txt > Merged.txt

empty

join -t"|" -o'1.1 1.2 1.3 2.1 2.3' Courses.txt Grades.txt > Merged.txt

Gave me: 10000|GEOL|150|EARTH REVEALED|Professor||||

The |||| were where the GPA and other stuff from Grades are supposed to be.

join -t"|" -1 1 -2 1 Courses.txt Grades.txt > Merged.txt
empty

join -t"|" -a1 Courses.txt Grades.txt > Merged.txt
Just Courses entry, none from Grades. Same thing with -a2.

-j was also used, as well as.

Keep in mind I sorted the files so that the catalog numbers would match line to line.

All the examples I've seen regarding how join works has the key fields in both files in order and correspond almost line-to-line so this sort of suggest that join does not search the entire file for matches.

California State University, Los Angeles, California, USA. Albert Cervantes, CS 345

You guys got strange requirements. Gee, where's Los Angeles? In Canada?[COLOR="\#738fbf"]

Should I give up now or is your lack of response suggesting that even the experts here are trumped by my problem?

Could you post some small representative sample input (not only the column headers) and the output you'd like to get given that input.

Okay, is this what you mean:

Courses.txt

Cata#|Department|Crs#|Class name|Instructor
10015|GEOL    |158|NATURAL DISASTERS|Rezaie Boroon Mohammad Hassan|
10016|GEOL    |158|NATURAL DISASTERS|Hamane Angelique C.|
10020|ART     | 101A     |WORLD ART|Anderson Paul A|
10021|ART     | 101B     |WORLD ART|Aguilar-Moreno Jose Manuel|

from Grades.txt

Cata#|Dept|Crs#|GPA|A|B|C|D|E|F...etc.
10015|GEOL|158|1.64|2|2|1|7|2|11|26|9|17|12|5|13|||1|
10016|GEOL|158|1.93|13||3|19||3|40|1||9||22|||1|2
10020|ART|101A|1.19|8|5|1|16|7|7|11|8|3|13|6|77||||13
10021|ART|101B|2.59|38|19|15|17|9|8|16|4|4|9|5|12||||3

Using join, we are to generate:

10015|GEOL    |158|NATURAL DISASTERS|Rezaie Boroon Mohammad Hassan|1.64
10016|GEOL    |158|NATURAL DISASTERS|Hamane Angelique C.|1.93
10020|ART     | 101A     |WORLD ART|Anderson Paul A|1.19
10021|ART     | 101B     |WORLD ART|Aguilar-Moreno Jose Manuel|2.59

You'd think that join -1 1 -2 1 Courses.txt Grades would match field 1 of Courses and field 1 of Grades easily, but I keep getting an empty file when I output it.

To clarify, we're trying to get the professor's name and the average GPA they had for the class they taught.

You were very nearly there.

The input files are not sorted on the first field "Catalogue number". If they were sorted, the heading line would be at the bottom.

You need to either remove the heading line before trying the join or sort the files with unix "sort" which will force the heading line to the bottom.

After sorting the files and twiddling with the output fields.

join -t"|" -o 1.1 1.2 1.3 1.4 1.5 2.4 Courses2.txt Grades2.txt

10015|GEOL    |158|NATURAL DISASTERS|Rezaie Boroon Mohammad Hassan|1.64
10016|GEOL    |158|NATURAL DISASTERS|Hamane Angelique C.|1.93
10020|ART     | 101A     |WORLD ART|Anderson Paul A|1.19
10021|ART     | 101B     |WORLD ART|Aguilar-Moreno Jose Manuel|2.59
Cata#|Department|Crs#|Class name|Instructor|GPA

Actually, there are no heading lines. I just added them here for clarification what each field was for (then radoulov included them in the code tags).

I'll try you solution and get back to you.

Update: Nope, merge file shows this

10015|GEOL    |158|NATURAL DISASTERS|Rezaie Boroon Mohammad Hassan|
10016|GEOL    |158|NATURAL DISASTERS|Hamane Angelique C.|
10020|ART     | 101A     |WORLD ART|Anderson Paul A|
10021|ART     | 101B     |WORLD ART|Aguilar-Moreno Jose Manuel|

Which is just the Courses.txt's content, nothing from Grades.txt.

Perhaps I should backtrack to when I generated the file? It shouldn't matter whether I used awk, sed or something else to create the files and fields, right?

Do your command output parameters match mine?

-o 1.1 1.2 1.3 1.4 1.5 2.4

The grade field is 2.4 (i.e. file 2 field 4). Your original didn't have 2.4 .

Yes, I pretty much did a copy/paste, modify the file names, gave it an output. Your solution is similar to what I included in #3 of my original post, only I tried to include all the fields just to see if anyone of them would output.

I should point out that it just outputs the field separators from the second file as my first posts shows.

Sorry to be a pedant, but what command did you type?
All four of the sample commands posted contain logic errors, syntax errors, or typing errors? Space characters are very important in this type of command line.
These were the command lines posted: The first two are identical and have no space character after "-o". The third omits the "-o" switch. The fourth is not really relevant.

Please post the exact command typed and run it on the sample data posted and then post the output (all within code tags).

On the off-chance .... have any of the data files been copied from a Microsoft Operating System?

Are you asking is there any error output? I've done that with a 2>error.txt, and it was blank. I don't think there's any logic, syntax or typing error as I've followed the book (Tansley) and examples I've found online explaining join.

The original data files were Excel files that the campus generated. We converted the files to CSV before extracting the information. So I only know as far as downloading the file?

So this is what I've done. Course2.txt and Grades2.txt have the 4 sample entries posted.

join -t"|" -o 1.1 1.2 1.3 1.4 1.5 2.4 ~/Lab1/Courses2.txt ~/Lab1/Grades2.txt > Merged.txt

more Merged.txt

Nothing.

join -t"|" -o'1.1 1.2 1.3 1.4 1.5 2.4' ~/Lab1/Courses2.txt  ~/Lab1/Grades2.txt > Merged.txt

more Merged.txt

Nothing.

join -t"|" -o '1.1 1.2 1.3 1.4 1.5 2.4'  ~/Lab1/Courses2.txt  ~/Lab1/Grades2.txt > Merged.txt

more Merged.txt

Nothing.

If it works for you guys, I am stumped.

Working totally from your post and copying the data and the first command, mine gives output:

10015|GEOL |158|NATURAL DISASTERS|Rezaie Boroon Mohammad Hassan|1.64
10016|GEOL |158|NATURAL DISASTERS|Hamane Angelique C.|1.93
10020|ART | 101A |WORLD ART|Anderson Paul A|1.19
10021|ART | 101B |WORLD ART|Aguilar-Moreno Jose Manuel|2.59

I guess we need to know what Operating System and Shell and also version of join.

My "join" is this (According to the "what" command):

what /usr/bin/join

/usr/bin/join:
         $Revision: B.11.11_LR
         Fri Oct 27 00:58:04 PDT 2000 $

---------- Post updated at 11:36 ---------- Previous update was at 11:22 ----------

We need to check the format of the data files. This "sed" just shows funny characters and record terminators. A normal unix text file will be terminated with a "newline" character which is displayed as "$" in this example.

sed -n l Courses2.txt

10015|GEOL |158|NATURAL DISASTERS|Rezaie Boroon Mohammad Hassan|$
10016|GEOL |158|NATURAL DISASTERS|Hamane Angelique C.|$
10020|ART | 101A |WORLD ART|Anderson Paul A|$
10021|ART | 101B |WORLD ART|Aguilar-Moreno Jose Manuel| $

If I deliberately make the file incorrect by converting it to Microsoft text format it will look like this. This may be enough to upset your "join" but mine still works.

sed -n l Courses2.txt

10015|GEOL |158|NATURAL DISASTERS|Rezaie Boroon Mohammad Hassan|\r$
10016|GEOL |158|NATURAL DISASTERS|Hamane Angelique C.|\r$
10020|ART | 101A |WORLD ART|Anderson Paul A|\r$
10021|ART | 101B |WORLD ART|Aguilar-Moreno Jose Manuel| \r$

I am using SSH Secure Shell 3.2.9 (Build 283). Campus operates on XP, I have both Vista and 7. My first attempts were at school (uploading, extracting, attempting to join). When that failed, I tried again at home (reupload, etc.)

If it will help, the original files are these:
Enrollment Excel file
Grade Excel File

Again, I only know as far as downloading the file. I did, however, remove column headings from both files and the sum totals from the bottom of the Enrollment file before saving the Excel file as a CSV (comma delimited).

sed -n l ~/Lab1/Courses2.txt

10015\r|GEOL    |158|NATURAL DISASTERS|Rezaie Boroon Mohammad Hassan|$
10016\r|GEOL    |158|NATURAL DISASTERS|Hamane Angelique C.|$
10020\r|ART     | 101A     |WORLD ART|Anderson Paul A|$
10021\r|ART     | 101B     |WORLD ART|Aguilar-Moreno Jose Manuel|$

Hmm, looks like it's attached to the Catalog column (the catalog column was originally the last column of the Enrollment CSV file).

I've used tr -d "\r" to remove it (that part worked) but now I noticed (via sed) there's also a \ within the Catalog number of some of the entries. Is this a line break in SSH? Either way, it's still not working.

EDIT: Nevermind, was cramming too much stuff into one command line (copying a copy of file, etc.). Looks like it worked.

Thanks y'all.