Decoded: shuf (coreutils)

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [No POSIX requirement] [Linux man] [No FreeBSD requirement]

Logical flow of shuf command (coreutils)

Summary

shuf - shuffling text

[Source] [Code Walkthrough]

Lines of code: 611
Principal syscall: write()
Support syscalls: open(), close(), fadvise()
Options: 15 (6 short, 9 long, does not include perm digits)

Added to Coreutils in August 2006 [First version]
Number of revisions: 60 [Code Evolution]

Helpers:
  • input_from_argv() - Loads user-provided arguments in to an array
  • input_size() - Returns the size of the input if possible
  • next_line() - Returns a pointer to the next line
  • read_input() - Reads lines from input for standard permutation. Allocates heap, and returns size
  • read_input_reservoir_sampling() - Reads lines from input used in reservoir sampling
  • write_permuted_output_reservoir() - Outputs permuted lines from the reservoir
  • write_permuted_lines() - Outputs lines using a permuted index
  • write_permuted_numbers() - Outputs a permuted number range
  • write_random_lines() - Outputs a random line with replacement using an entropy source
  • write_random_numbers() - Outputs a random ranged number with replacement using an entropy source
External non-standard helpers:
  • die() - Exit with mandatory non-zero error and message to stderr
  • error() - Outputs error message to standard error with possible process termination
  • randint_all_new() - Creates a new random source structure (in randint.c)
  • randperm_new() - Generates a new permutation index array (in randperm.c)

Setup

shuf manages execution by declaring a long list of local variables in main():

  • echo - Flag if each command line argument is considered a line
  • eolbyte - The line delimiter character. Usually \n, but may be \0.
  • head_lines - The number of output lines requested by the user
  • hi_input - User specified high input number
  • i - integer to hold the number of successful byte written
  • **input_lines - Pointer to the list of lines
  • input_range - Flag if the user specified an input range
  • **line - Pointer to the permuted line array
  • lo_input - User specified low input number
  • n_lines - The number of lines to output
  • n_operands - The number of user-provided arguments on the CLI
  • **operand - Pointer to the beginning of the user provided arguments
  • optc - The character for the next option to process
  • *outfile - The name of the output file
  • *permutation - Pointer to the permutation index array
  • *randint_source - The generated random source structure (specific randread source)
  • *random_source - Name of the entropy source file
  • repeat - Flag to triggle sample with replacement
  • *reservoir - The linebuffer used for reservoir sampling
  • use_reservoir_sampling - Flag if reservoir sampling is used

Parsing

Parsing answers the following questions to define the execution parameters

  • What is the source of input? File, Arguments, Ranges?
  • Do we repeat lines (sample with replacement)?
  • Do we output all input lines, or a subset?
  • How is output delimited?

Parsing failures

These failure cases are explicitly checked:

  • Using both input ranges and arguments for input
  • Adding extra arguments when reading input ranges
  • Nonsensical input ranges or output line counts
  • Providing multiple input ranges
  • Specifying multiple input or output files
  • Unknown option used

User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.


Execution

Input comes from three possible sources: Read in from a file, provided as arguments on STDIN, or generated via input range options. The source determines the execution path for the output. After input source is chosen, execution looks like this:

  • Initialize random (default: ISAAC cipher - nonce from /dev/urandom)
  • Prepare the resevoir sample, if necessary
  • Build a permutation index, if necessary
  • Open the output file, if requested
  • Write output lines/numbers either randomly or permuted
  • Repeated last write until all lines written

Failure cases:

  • No lines written
  • Unable to open input source or output file
  • Unable to initialize random source
  • Unable to read from input source
  • Input is too large to read in

All failures at this stage output an error message to STDERR and return without displaying usage help

Extra comments

Source of random

Coreutils includes a self-contained implementation of the ISAAC cipher used in both the shuf and shred utilities. There's a fair amount of plumbing between the entropy source and the utility needed to generate, store, and distribute the numbers. This allows a minimal framework to bring random numbers in to this (and future) utilities. Consider the following diagram:

Coreutils implementation of ISAAC PRNG

Reservoir Sampling

shuf employs reservoir sampling (paper) to quickly reorder a large, unknown-sized input. The key idea is to read all lines sequentially in to the reservoir until it's full, then randomly swap subsequent lines in to the reservoir with decreasing probability. The result is a random selection of lines built in one pass without knowledge of where the end of input lies. The process happens entirely in read_input_reservoir_sampling() and flows like this:

  • Fill the reservoir with input lines (sequential)
    • Realloc reservoir as needed to continue growth
  • Read in a new input line beyond the max reservoir size
  • Select a line to replace (uniform random).
  • If selection falls within the reservoir, swap it out
    • Otherwise, line discard newly read line
  • Repeat until size limit or end of input
  • Return count of lines processed

Reservoir construction alone does not guarantee shuffled output. Consider the case where the input size matches the reservoir size -- the reservoir contains exactly the input in order. While we can mitigate this by enforcing a large minimum input size, there is likely still read-order bias for small inputs. The solution is to sample the reservoir from using our PRNG as an index. (See above). In short, construct the reservoir, then randomly read from the reservoir to produce shuffled results.


[Back to Project Main Page]