[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]
Summary
uniq - uniquify files (remove duplicate lines from a sorted file)
Lines of code: 676
Principal syscall: write()
Support syscalls: open(), close(), fadvise()
Options: 15 (9 short, 12 long, does not include legacy digits for field skip)
Ancestor included with Version 3 UNIX (1973). Original man dated late 1972.
Added to Textutils in November 1992 [First version]
Number of revisions: 189 [Code Evolution]
check_file()
- The actual uniq proceduredifferent()
- Checks if two input strings match and returns false/truefind_field()
- Returns the offset to a line's field to comparesize_opt()
- Converts input option to a size typestrict_posix2()
- Checks if the system is POSIX2 compliant (affects valid syntax)writeline()
- Outputs a line to standard output
die()
- Exit with mandatory non-zero error and message to stderrerror()
- Outputs error message to standard error with possible process termination
Setup
uniq keeps several flags and variables as globals, including:
check_chars
- The number of characters to check on a line (-w)hard_LC_COLLATE
- Flag set if LC_COLLATE is in a standard locationignore_case
- Flag if we ignore case when comparing lettersoutput_first_repeated
- Flag to only output the first of a repeating groupoutput_later_repeated
- Flag to output only repeated linesoutput_unique
- Flag to output only unique linesskip_chars
- The number of characters to skip in each fieldskip_fields
- The number of fields to skip when comparing lines
main()
introduces a few local variables:
delimiter
- The end of line delimiter, \n or \0 (-z)*file[]
- The input and output file namesnfiles
- The index to file[]optc
- The character for the next option to processoutput_option_used
- Flag if the user requested a specific output modeposixly_correct
- Flag if the POSIXLY_CORRECT environment variable is setskip_field_option_type
- Holds the skip field processing behaviors (unused, legacy, current)
Parsing
Parsing answers the following questions to define the execution parameters
- What is the range of comparison between lines?
- Is the comparison case sensitive?
- Which lines should be output (first match, subsequent matches)?
- Should we use the NUL delimiter and how should it apply to groups?
Parsing failures
These failure cases are explicitly checked:
- Providing too many file names
- Nonsensical number of fields, characters, or bytes to skip/check
- Combining a grouping method with an output method
- Grouping and printing repeats
- Printing duplicates and repeats
- Unknown option used
User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.
Execution
uniq employs a small optimization to minimize processing and enhance responsiveness depending on the behavior selected by the user. To keep it simple, here is the complex path that may happen during file checking:
- Open the input and output files
- Initialize the line buffers
- While there are still lines of input:
- Check that there is still more input otherwise exit
- Find the next field
- Compare the lines and if they match, count the match
- Add group or prepend delimiter
- Output the lines if they don't match
- Add end of line delimiter
- Close the input files
- Free the line buffers
- Return successful
Failure cases:
- Too many repeating lines
- Unable to open or close I/O files
- Unable to read from input source
All failures at this stage output an error message to STDERR and return without displaying usage help