Decoded: ptx (coreutils) – MaiZure's Projects

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [No POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of ptx command (coreutils)

Summary

ptx - produce permuted indexes

[Source] [Code Walkthrough]

Lines of code: 2154
Principal syscalls: fopen(), fread() (both via read_file())
Support syscall: fstat() (also via read_file())
Options: 35 (17 short, 18 long)

Descended from ptx in Version 2 UNIX (1972)
Added to Textutils in August 1998 [First version]
Number of revisions: 114

The ptx utility realized an early computing goal to automate a labor intensive task: creating permuted indexes from a text source. The original use-case was to build the index for the physical UNIX manuals based on the man pages.

Helpers:

compare_occurs() - Compares OCCURS to return which goes first
compare_words() - Compares WORD locations
compile_regex() - Compiles a regular expression
copy_unescaped_string() - Processes a string and evaluates escapes
define_all_fields() - Computes position and length of fields in OCCURS
digest_break_file() - Processes the file of break characters
digest_word_file() - Processes the file of words to ignore
find_occurs_in_text() - Creates OCCURS structures for each input WORD
fix_output_parameters() - Sets the output parameters according to user specifications from cli
generate_all_output() - Prints data lines from the occurs_table[]
initialize_regex() - Initializes pattern match tables
matching_error() - Fails out of regular expression matching
output_one_dumb_line() - Outputs a line as-is with trailing newline char
output_one_roff_line() - Outputs a line in [n|t]roff format (leading typesets)
output_one_tex_line() - Outputs a line is TeX notation: \typeset {...}
print_field() - prints a BLOCK of text
print_spaces() - Prints a given number of spaces
search_table() - Binary search for a WORD in a WORD_TABLE
sort_found_occurs() - Sorts the global occurs_table[]
swallow_file_in_memory() - Loads a file in to contiguous memory and collect statistics

External non-standard helpers:

re_compile_pattern() - Compiles a regular expression to a pattern buffer

Setup

Several global structures, flags, and other variables are needed for collecting and organizing input text. These include:

Structs:

struct BLOCK - An arbitrary space in memory (*start, *end)
struct regex_data - An expression and a compiled pattern
struct OCCURS - The context of a keyword
struct WORD - A single word as a start pointer and size
struct WORD_TABLE - An array of words, including a start, total size, and used size

OCCURS structures are managed through a global table pointer, *occurs_table[]

WORD_TABLEs have two global references in the ignore_table and only_table

Flags:

auto_reference - Flag to track file:line output
gnu_extensions - Flag to enable extended GNU features
ignore_case - Flag to distinguish between upper/lower case during sorting
input_reference - Flag to process leading line text for context
right_reference - Flag to force reference text to the end of a line

main() initializes a few variables used througout the utility:

optchar - The next argument for processing
file_index - Index to input_file_name[] for multiple input files

Parsing

Parsing for ptx has a few more steps than most utility because some execution parameters need to be pulled from other files. Like other utilities, we begin by reading the line options to answer questions:

What are the input sources?
What is the desired output format?
Are there external parameters in other files (ignore/only lists, etc)
Any special character set considerations, such as case handling?

Parsing failures

These failure cases are explicitly checked:

Invalid gap widths or line widths
Extra arguments
Unable to open any parameter files
Using unknown options

Execution

ptx executes in three stages: analyzing data, sorting data, and outputting data.

Analyzing Data

The goal populate the global occurrence table after reading all input data. Along the way, we apply the inclusion/exclusion filters and regex patterns to focus on relevant data. For each file:

Read the file in to memory ( swallow_file_in_memory() )
Check each line against the gives regular expressions
Check each word and its context
Verify that the word isn't ignored or is explicitly required
Build an occurence entry in the global table

Sort Data

qsort() the occurrence table using the provided compare_occurs() comparator. The basis is the lexographic ordering of keywords for each occurrence.

Output Data

The output phase is dictated by the three supported formats: 'dumb' terminal, TeX, and troff. Before the output is processed, several parameters are defined for all formats:

file_index - The input file index
line_ordinal - The line count remaining for the file
reference_width - Holds the length of the reference string
character - A character value
*cursor - Pointer within a string

Many of the globals already mentioned are computed at this stage in preparation for output generation.

Generating output means iterating over the sorted occurs_table[]. For each entry:

Compute the position and length of each field: Reference, tail, before, keyafter, head.
Output the appropriate formats
- Dumb terminal - A single line with all fields in order
- TeX - line begins with '\\' and fields enclosed in curly braces
- troff - line begins with '.' and fields are quoted

Execution failure cases concern regular expressions:

A regular expression match or compile fails
A zero-length regular expression match

[Back to Project Main Page]