Decoded: uniq (coreutils) – MaiZure's Projects

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of uniq command (coreutils)

Summary

uniq - uniquify files (remove duplicate lines from a sorted file)

[Source] [Code Walkthrough]

Lines of code: 676
Principal syscall: write()
Support syscalls: open(), close(), fadvise()
Options: 15 (9 short, 12 long, does not include legacy digits for field skip)

Ancestor included with Version 3 UNIX (1973). Original man dated late 1972.
Added to Textutils in November 1992 [First version]
Number of revisions: 189 [Code Evolution]

Helpers:

check_file() - The actual uniq procedure
different() - Checks if two input strings match and returns false/true
find_field() - Returns the offset to a line's field to compare
size_opt() - Converts input option to a size type
strict_posix2() - Checks if the system is POSIX2 compliant (affects valid syntax)
writeline() - Outputs a line to standard output

External non-standard helpers:

die() - Exit with mandatory non-zero error and message to stderr
error() - Outputs error message to standard error with possible process termination

Setup

uniq keeps several flags and variables as globals, including:

check_chars - The number of characters to check on a line (-w)
hard_LC_COLLATE - Flag set if LC_COLLATE is in a standard location
ignore_case - Flag if we ignore case when comparing letters
output_first_repeated - Flag to only output the first of a repeating group
output_later_repeated - Flag to output only repeated lines
output_unique - Flag to output only unique lines
skip_chars - The number of characters to skip in each field
skip_fields - The number of fields to skip when comparing lines

main() introduces a few local variables:

delimiter - The end of line delimiter, \n or \0 (-z)
*file[] - The input and output file names
nfiles - The index to file[]
optc - The character for the next option to process
output_option_used - Flag if the user requested a specific output mode
posixly_correct - Flag if the POSIXLY_CORRECT environment variable is set
skip_field_option_type - Holds the skip field processing behaviors (unused, legacy, current)

Parsing

Parsing answers the following questions to define the execution parameters

What is the range of comparison between lines?
Is the comparison case sensitive?
Which lines should be output (first match, subsequent matches)?
Should we use the NUL delimiter and how should it apply to groups?

Parsing failures

These failure cases are explicitly checked:

Providing too many file names
Nonsensical number of fields, characters, or bytes to skip/check
Combining a grouping method with an output method
Grouping and printing repeats
Printing duplicates and repeats
Unknown option used

User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.

Execution

uniq employs a small optimization to minimize processing and enhance responsiveness depending on the behavior selected by the user. To keep it simple, here is the complex path that may happen during file checking:

Open the input and output files
Initialize the line buffers
While there are still lines of input:
- Check that there is still more input otherwise exit
- Find the next field
- Compare the lines and if they match, count the match
- Add group or prepend delimiter
- Output the lines if they don't match
- Add end of line delimiter
Close the input files
Free the line buffers
Return successful

Failure cases:

Too many repeating lines
Unable to open or close I/O files
Unable to read from input source

All failures at this stage output an error message to STDERR and return without displaying usage help

[Back to Project Main Page]