[GNU Manual] [No POSIX requirement] [Linux man] [FreeBSD man]
Summary
fmt - reformat paragraph text
Lines of code: 1030
Principal syscall: fwrite()
Support syscall: fadvise()
Options: 16 (7 short, 9 long)
Added to Textutils in September 1994 [First version]
Number of revisions: 173 Helpers:
base_cost()
- Set the basic cost of the input wordcheck_punctuation()
- Compute word punctuationcopy_rest()
- Push a non-conformant line to outputflush_paragraph()
- Output the partial paragraph now due to size contraintsfmt()
- The main format procedure for get, format, and outputfmt_paragraph()
- Computes the current paragraph formatget_paragraph()
- Retrieves an entire paragraph, by line and wordget_line()
- Gets a line from the inputget_prefix()
- Gets a prefix from the inputget_space()
- Gets a blank from the inputline_cost()
- Computes the line costput_paragraph()
- Outputs all the words in a paragraph by lineput_line()
- Outputs all words in a line to include spacesput_word()
- Outputs a single word and recalculates column positionput_space()
- Outputs the appropriate spacesame_para()
- Returns true if a line could belong to the current paragraphset_other_indent()
- Updates the other indent typeset_prefix()
- Updates the prefix type
die()
- Exit with mandatory non-zero error and message to stderrerror()
- Outputs error message to standard error with possible process terminationfull_write()
- Wrapper forwrite()
that retries on interruptgetpagesize()
- Gets the memory page size for the systemio_blksize()
- Gets the optimal block sizeptr_align()
- Ensures that returned pointer is memory alignedsafe_read()
- Reads with retry on interrupt
Setup
Before looking at main()
, fmt.c establishes the basis for computing paragraph formats. These include a goal (target line width) and cost/bonus functions for deviating from the goal. Specifically:
- Defines
SHORT_COST
macro to set the cost of missing the goal - Defines
RAGGED_COST
macro to set the cost of not matching neighboring lines - Defines
LINE_COST
macro to set the basic cost of a line - Defines
WIDOW_COST
macro to set a special cost for breaking a line after the first word - Defines
ORPHAN_COST
macro to set a special cost for breaking a line before the last word - Defines
SENTENCE_BONUS
macro to set a bonus for breaking at the end of a sentence - Defines
NOBREAK_COST
macro sets a large penalty for breaking on a period thats not end of line - Defines
PAREN_BONUS
macro sets a cost for breaking on an open paren - Defines
PUNCT_BONUS
macro sets a cost for breaking on any punctuation - Defines
LINE_CREDIT
macro sets a bonus for breaking just after a long paragraph
These costs make sense when applied to a structure that describes a word as part of a paragraph. The Word struct looks like this:
*text
contains the word textlength
is the length of the wordspace
counts the number of subsequent spacesparen
bit indicates a leading parenthesisperiod
bit indicates ending in a final punctuationpunct
bit indicates ends with a non-terminal punctuationfinal
bit indicates the end of a sentenceline_length
the length of the best line starting herebest_cost
the best case cost of starting a paragraph with this wordnext_break
pointer indicates the Word of the best break
A number of globals support the format optimizer including:
- The
max_width
integer for the maximum line width - The
goal_width
integer is the line length target - The
word[]
array holding the Words of the paragraph under analysis - The
parabuf[]
array holding text of the constructed paragraph
Other globals support I/O buffering, including:
- The
in_column
integer is the column of the most recently read character - The
out_column
integer is the column of the next character to write - The
word_limit
pointer indicates the first invalid word in the array
The above global lists are not exhaustive.
Finally, main() defines a few variables to kick off formatting:
goal_width_option
- The user provided line length goalmax_width_option
- The user provided max widthoptchar
- The option we're currently processingok
- The return value of the utility
Parsing kicks off with the short options passed as a string literal:
"0123456789cstuw:p:g:"
Parsing
Parsing for fmt digs in to the user's ideal paragraph, including:
- How wide are the lines?
- Are there special indentation cases?
- Should spacing be uniform?
- Should only certain lines be formatted?
Parsing failures
There are a few parsing failures to note:
- Numeric width must be the first option
- Width could be too large
- Unknown options are used
Execution
On the surface, fmt is simple:
- Open stdin or file stream for input
- Get a paragraph of data
- Compute the optimal format at the paragraph, line, and word level
- Print the paragraph
- Repeat for all paragraphs in the file
Computing the 'optimal' paragraph deserves more explanation
Optimal Paragraph
Paragraphs are a chain of words with the basic constraint that they must remain in the same order. We must determine the 'best' place to set the line breaks. Parsing provided hints about user expectations for line length, indentations, and spacing. fmt embedded the basic idea of 'cost' associated with line features. Here is the approach for the default format:
- A valid paragraph of data has been retrieved via
get_paragraph()
- Loop through each word backwards from the end of the paragraph:
- Assume cost of a break is maximum
- Scan forward from the word to determine the best cost of a break
- Update any improved costs and the new breakpoint and length
- Consider all words ahead of the current until the last word or max_width reached
- Move back another word and repeat
- Now go through each breakpoint and output the words between all of them. Including spaces and indentation as necessary
There is only one explicit failure case during execution: If we cannot open the input file