[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]
Summary
sort - sort, merge, or sequence check text files
Lines of code: 4780
Principal syscalls: None
Support syscall: dup2(), fork()
Options: 54 (24 short, 30 long)
Descended from sort in Version 2 UNIX (1972)
Added to Textutils in August 1998 [First version]
Number of revisions: 456
The sort utility is unique among the coreutils in that some tasks are multi-threaded. It implements merge sort through two custom data structures (binary tree and a priority queue [heap]) along with the gnulib hash table. Since sorting is a fundamental problem in computer science, it's no surprise that this utility has an 'academic' feel.
Setup
main()
initializes a few variables used througout the utility:
*key
- A generic key pointer (usually to key_buf)key_buf
- Buffer for a key during processinggkey
- A 'global' default keygkey_only
- Only use the default key*s
- A string used as a key suffix (used several ways)c
- The first character of the option to processcheckonly
- Flag for check sorting only (two modes)mergeonly
- Flag for only merging input filesrandom_source
- Random source file nameneed_random
- Flag for random source usagenthreads
- Number of threads to use for sortingnfiles
- Number of input files to processposixly_correct
- Flag for if the environment enforces POSIXposix_ver
- Holds the posix versiontraditional_usage
- Flag the force non-POSIX key styles**files
- The pointer to the list of files*files_from
- Name of a file containg the list of input filestok
- Holds the token struct for an input file*outfile
- The output file for sortlocale_ok
- Flag set if i18n was properly configured
Then sort goes through a few preliminary steps:
- The usual i18n setup
- Initialize character lookup tables
- Customize signal handlers
- Initialize global key
- Prepare input file arguments
Character tables
sort initializes four tables indexed by character value and used for handling decisions: A boolean table identifying blank characters, to include the newline. A boolean table of nonprintable characters, a table of non-dictionary characters, and a translation table to upper case
Signal handlers
Sorting produces temporary files that need to be cleaned if we receive a termination signal. We define a custom sighandler
that removes the files and apply it to a subset of signals, caught_signals
. During critical section execution, child threads block these signals since temp files are in use.
Parsing
Parsing command line options answers the following questions:
- Which mode are we using, sorting, merging, or neither? (Neither implies sort check)
- For sorting, which key fields are we using?
- What is the sort basis? (numbers, letters, descending, etc)
- Where is input coming from? (stdin, file arguments, a list file)
The most complex task in parsing is setting the key information from user input. It's further complicated by the historical differences between BSD and POSIX.
Parsing failures
These failure cases are explicitly checked:
- Invalid key position values
- Unknown character in field specification
- Request both check-only variants (quiet and diagnose)
- Multiple output files
- Multiple random sources
- Missing or unusable tab characters
- Combining input file list and command line files
- Providing extra operands for check-only mode
Extra Comments
Execution
Sorting executes through three distinct modes: checking a sort, merging input files (assumed to be sorted), or performing a full sort and merge.
The full sort is the most complex path through the utility and involves specific tools including a binary merge trees, child threads, priority queues, and temporary files.
Data structures
Sorting uses a binary merge tree and a priority queue which work together to process individual files then tie them together.
Sorting
The full sort()
pass goes through the following tasks
- Find the maximum number of possible independent threads
- For each input file...
- Buffer all file lines
- Open a temporary file for the upcoming sort
- Construct a merge tree to divide up the work among threads
merge_tree_init()
- Fork child threads for each tree layer. Leaf threads sort and queue work.
sortlines()
- Merge the results up the tree. Root node outputs to temp file
merge_loop()
andmergelines_node()
- Close the temporary file
Now we have temporary files that contain sorted data ready for merger
Merging (after sorting)
Putting the temporary files together in to a single output file can go several ways. We'll look at both the simple and the complicated
Easy merge
The easy case happens when there are few (or one) temporary files.
- Open the temp files as possible in a single pass
open_input_files()
- Open the output target
- Merge the files directly to the output
mergefps()
Complicated merge
We need to do more work if there are more temporary files than we can reasonably manage in a single pass
- Prepare a temporary output file
- Open as many temp files as possible in a single pass
mergefiles()
- Merge the files in to a single temporary file for that pass
mergefps()
- Repeat passes and coalesce files for subsequent passes
- Attempt to merge temp files to the final output
- Close file input descriptors and
mergefps()
until all remaining are processed
Merging Only
The difference with merge-only is that there are no temporary files to process. The user has provided a list of actual files to sort. The merge procedure is effectively the same.
--
There are several ways that file processing could fail:
Failure cases:
- Unable to open any input or output files
- Unusable file names
- Failure to localize
- Unable to close stdin
- Any read or write failures from I/O
- Output file truncation or mode error
- Unable to delete temporary files
- Attempting to merge too many files at once
- Nonsensical number of threads requested
Helpers
There are enough helper functions to separate them by purpose:
Threading helpers:cs_enter()
- Blocks signals may that interrupt executioncs_leave()
- Restores signal handlers to nornaldelete_proc()
- Removes a temp process from the process has tablereap()
- Wait for threads to exit then cleanreap_all()
- Remove and clean all child threadsreap_exited()
- Reap all exited child threadsreap_some()
- Remove at least one child threadregister_proc()
- Adds a temp process to the proccess hash tablewait_proc()
- Wait for and reap a child thread
check_ordering_compatibility()
- Verifies key consistencydebug_key()
- Outputs a line and highlights the keydebug_key_compare()
- Verifies key sortabilitydebug_line()
- Checks all keys against a linedefault_sort_size()
- Computes the default sorting sizeinsertkey()
- Add a new sorting keykey_init()
- Initializes (zeroes) a keyfieldkey_numeric()
- Returns true if the key is a numberkey_to_opts()
- Converts a key to short optionskey_warnings()
- Verifies and complains about given keyskeycompare()
- Compares two lines using given keyslimfield()
- Finds the end of a key field (character after end)mark_key()
- Underlines a key in a lineparse_field_count()
- Parses and bounds keyssequential_sort()
- Recursive compare and swap of line arrayset_ordering()
- Applies the given options to a keysort()
- Starts the sort proceduresort_buffer_size()
- Gets the sort buffer sizesort_die()
- Exits sorting with SORT_FAILUREsortlines()
- Threaded-recursive sorting proceduresortlines_thread()
- Sort procedure for child threads
avoid_trashing_input()
- Ensures no clashing I/O files names during mergemerge()
- Starts the merge proceduremerge_loop()
- Merges lines from the queuemergefiles()
- Verifies files and calls mergefps to merge filesmergelines()
- Compares and merges two lines appropriatelymergelines_node()
- Meges nodes from the merge treemergefps()
- Merges input files to another output fd
compare_nodes()
- Compares two nodes based on priority (level)init_node()
- Initialize a merge tree nodelock_node()
- Lock the node to limit accessmerge_tree_destroy()
- Free a merge treemerge_tree_init()
- Create a new merge tree of given sizeunlock_node()
- Unlock the node for general access
queue_check_insert()
- Check node before queuingqueue_check_insert_parent()
- Check and insert a node's parentqueue_destroy()
- Free the queuequeue_init()
- Allocate space for a new queue of max nodesqueue_insert()
- Add a new node to the queue and re-heapqueue_pop()
- Pop the top priority node from the pq
add_temp_dir()
- Adds a directory for temporary filesasync_safe_die()
- Outputs thread errors safelybadfieldspec()
- Exits from a bad field specificationbegfield()
- Returns a pointer to the start of a fieldbuffer_linelim()
- Return a ponter just beyond the linecheck()
- Verify file line orderingcheck_inputs()
- Verify input file availabilitycheck_output()
- Verify output file availabilitycleanup()
- Remove any temporary files that may remaincompare()
- Compare two linescompare_random()
- Sorts based on key hash (~random order)create_temp()
- Wrapper to create a temp file or failcreate_temp_file()
- Creates a temporary filedebug_width()
- Computes the printed width of a memory region (compress tabs)exit_cleanup()
- High-level exit routinefillbuf()
- Fills a buffer from a streamfind_unit_order()
- Returns the order of magnitude of a numbergeneral_numcompare()
- Generalized number-as-string comparatorgetmonth()
- Returns an integer month from an input stringhuman_numcompare()
- Compares numbers with unit labelsincompatible_options()
- Exits with incompatibilty errorinitbuf()
- Allocates and aligns buffers for linesinittables()
- Initializes character tablesmaybe_create_temp()
- Creates a temporary file if possiblemove_fd()
- Duplicates a file descriptor (dup2 wrapper)nan_compare()
- Compares NaN values via memcmpnumcompare()
- Compares numbers-as-strings and returns resultopen_input_files()
- Opens and prepares multiple input files for useopen_temp()
- Create a file descriptor for temporary filespipe_fork()
- Create a new process and redirect I/O for compressionproctab_comparator()
- Compare function for the process tableproctab_hasher()
- Hash function for the process tablerandom_md5_state_init()
- Initialize a random vectorsighandler()
- Generic defualt signal handlerspecify_nmerge()
- Sets the number of concurrent input filesspecify_nthreads()
- Sets the number of concurrent threadsspecify_sort_size()
- Sets the memory limit while sortingstream_open()
- Creates a FILE stream for a file. Null is STDOUTstruct_month_cmp()
- Month comparison functiontraverse_raw_number()
- Walks a number and returns the largest digitwrite_line()
- Writes a line to a temporary filewrite_unique()
- Writes unique lines (-u), otherwise normal writexfclose()
- Closes a FILE stream and may fail.xfopen()
- Creates a FILE stream. Never null, may fail.xstrxfrm()
- Transforms a string while checking erroszaptemp()
- Removes temporary file from the list