[GNU Manual] [POSIX requirement] [Linux man] [No FreeBSD entry]
Summary
join - join lines on a common field
Lines of code: 1200
Principal syscall: write()
Support syscalls: open(), close(), fadvise()
Options: 17 (10 short, 7 long)
Descended from join introduced in Version 7 UNIX (1979)
Added to Textutils in November 1992 [First version]
Number of revisions: 211
The idea of the join utility is very similar to the JOIN operation used in relational databases. The idea is more complex than the simple text parsing seen in other utilities, and so the execution relies on four custom structures to manage I/O
Helpers:add_field()
- Adds a field to the outlistadd_field_list()
- Adds to a field listadd_file_name()
- Adds the file name to the input file listadvance_seq()
- Adds another line to a sequencecheck_order()
- Verifies the ordering of lines in a filedecode_field_spec()
- Decodes the -o option argumentsdelseq()
- Deallocates a sequenceextract_field()
- Retrieves field data from a linefree_spareline()
- Deallocates all spare linesfreeline()
- Deallocates line resourcesgetseq()
- Builds a sequence from a file lineget_line()
- Parses a line from a file and returns successinit_linep()
- Initializes a line structureinitseq()
- Initializes a sequence to zero entriesjoin()
- The top level join procedurekeycmp()
- Compares two lines and returns a ternary resultprfield()
- Print a single field in a lineprfields()
- Print all fields in a lineprjoin()
- Joins two input lines on a key and printsreset_line()
- Removes the number of fields in a lineset_join_field()
- Sets the join field valuestring_to_join_field()
- Converts a decimal string to represent a field valueSWAPLINES()
- Function-like macro to perform a line swapxfields()
- Creates the fields structure from a line
die()
- Exit with mandatory non-zero error and message to stderrerror()
- Outputs error message to standard error with possible process termination
Setup
join defines four custom structures needed to keep track of I/O:
struct field
- Tracks a field in a line with a start pointer and a lengthstruct line
- Holds a line in a buffer and tracks number of fields and pointer to eachstruct outlist
- A list of output lines specifying files and fieldsstruct seq
- A sequences of lines with the same join field
There are also a few important globals that manage execution:
autocount_1
- The number of fields for file 1 during autoformattingautocount_2
- The number of fields for file 2 during autoformattingautoformat
- Flag to infer the output format from the first line of input files*empty_filler
- The string to print in place of empty fieldseolchar
- The line delimiter, default\n
g_names[]
- The real names of file1 and file2hard_LC_COLLATE
- Flag set if LC_COLLATE is in a standard locationignore_case
- Flag to ignore letter casing on join fields (-i)issue_disorder_warning[]
- Flag for each file that has set a warningjoin_field_1
- The field number to join on in file 1join_field_2
- The field number to join on in file 2join_header_lines
- Flag to use the first line of a file for the headerline_no[]
- The number of lines read from file1 and file2outlist_end
- The end of the outlist*outlist_head
- The beginning of the outlist*prevline[]
- The previous line read from file1 and file2print_pairables
- Flag to print lines that are matched (-v)print_unpairables_1
- Flag to print unpairables lines from file 1print_unpairables_2
- Flag to print unpairable lines from file 2seen_unpairable
- Flag set if we've processed a line without a match*spareline[]
- An additional buffer for a line from file1 and file2, if neededtab
- The character used for the field delimiteruni_blank
- A line reference dedicated to separating lines
main()
introduces a few local variables:
i
- Integer iterator for file numberjoption_count[]
- The join field numbersfp1
- The file stream for file1fp2
- The file stream for flie2nfiles
- The number of file arguments providedoperand_status[]
- Tracks the type of operand (file, join arg, etc)optc
- The character for the next option to processoptc_status
- The type of the operand we're processingprev_optc_status
- The type of the previous operand
Parsing
Parsing sets the possible execution parameters for join. The user answers the following questions:
- Should the files be ordered?
- Which field from which file should be the join key
- Are the keys case sensitive?
- Is there a header line?
- Should the line entries be NUL terminated?
The join field is initialized during parsing (or just after in one case) via the set_join_field()
.
Parsing failures
These failure cases are explicitly checked:
- User provides a nonsensical field number
- User gives an invalid tab
- Trying to use STDIN for both input files
- Missing source files
- User inputs invalid field specifiers
- Unknown option used
User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.
Execution
The join utility is fairly simple to understand despite the number and depth of the support functions. The high-level operation goes like this:
- Open the two input sources, one of which could be STDIN
- Output the header if requested
- Initialize sequence buffers to hold matching lines
- Read lines from files and compare the keys for match
- If the key's match:
- Read all matching lines from file 1 and add to the sequence
- Read all matching lines from file 2 and add to the sequence
- Print the resulting output sequence
- Verify the file ordering if requested
- If any lines we're unpairable, output those lines.
- Clean up all data structures
Failure cases:
- Unable to read from input file
- Bad source file number provided
- Invalid join field
- Files not properly ordered
All failures at this stage output an error message to STDERR and return without displaying usage help