Recent Changes - Search:

Cookbook

PmWiki

pmwiki.org

TextExtract

Summary: search, grep, and extract text from other pages or groups with search terms and regular expressions, using search form or markup expression.
Version: 2014-02-22
Prerequisites: PmWiki 2.2.56 (compatible with PHP 5.5)
Status: stable
Maintainer: HansB
Users: +5 (View / Edit)
Download: extract.phpΔ
Discussion: TextExtract-Talk

Questions answered by this recipe

How can I do searches showing query results within context and not just a list of page links?
I'd like to offer more advanced search options like case-sensitive, whole words, regular expressions in addition to the standard search.
How can I show content from different pages if the content matches specific query terms?

Description

Text Extract provides a search form and a markup expression for extracting text lines or paragraphs from multiple pages, using search terms (including regular expressions), and wildcard pagename patterns.

Installation:

Download extract.phpΔ, copy to cookbook folder and install in config.php with:

include_once("$FarmD/cookbook/extract.php");

Usage:

Markup syntax:

As search form:

(:extract <parameters> :)
(:searchresults:)

As markup expression:

{(extract Term1 [Term2] [-Term3] ... name=PageName group=GroupName
        [keyword=value] [keyword=value] ...)}

With (:pagelist:):

(:pagelist fmt=extract Term1 [Term2] [-Term3] 
         <pagelist + textextract parameters>:)

With (:searchbox:):

(:searchbox fmt=extract <parameters>:)
(:searchresults:)

With PowerTools (pagelist} markup expression:

{(pagelist fmt=extract <parameters>)}
Arguments:
  • Text(Pattern) - = Search Terms - display lines containing Text string or matching regular expression TextPattern. All single arguments are treated as search terms, results will be returned from pages matching all terms with no prefix or a + (plus) prefix, but not terms prefixed with - (minus).
Options:
  • group=GroupName source pages from group GroupName. Allowed are Wiki wildcards '*' and '?'.
  • name=PageName - source pages from PageName or Group.PageName. Allowed are Wiki wildcards '*' and '?'. You can specify any number of pagenames, comma-separated, and each could contain wiki wildcards. Page names with a - (minus) prefix will be excluded. Note that wiki wildcard pagename patterns are not the same as regex patterns!
  • name=PageName#section - the text from anchored #section will be taken as source. Allowed are Wiki wildcards '*' and '?' in PageName, but not in #section.
  • name=PageName#sectA#sectB - the text from anchor #sectA to anchor #sectB will be taken as source. Allowed are Wiki wildcards '*' and '?' in PageName.
  • page=GROUP.NAME - full source pagename, can include wildcards * and ?
    As a searchform parameter this will hide the page field.
  • defaultpage=GROUP.NAME - search form parameter to put initial value into page field.
  • pattern=SEARCHTERM - search form parameter for search term which will hide the search field.
  • cut=PATTERN - do not display rows (lines or paragraphs according to unit=) matching PATTERN.
  • count=n - include only n number of pages in the output.
  • lines=n - the text source is the first n lines of a page or page section.
  • lines=-n - the text source is the last n lines of a page or page section.
  • lines=n..m - the text source is the lines from line n to line m (including line m) of a page or page section.
  • lines=n.. - the text source is the lines from line n till end of a page or page section.
  • snip=PATTERN - do not display text matching PATTERN, remove it from the line
  • highlight=COLOR - highlight matches using COLOR for background, default is 'yellow' background.
  • highlight=bold - bold (strong) text highlight.
  • highlight=none - do not use match highlighting.
  • unit=line - single text row (line) is shown.
  • unit=para - default: whole paragraph is shown (separated by empty lines or headings)
  • unit=page - the whole page text is shown (or a part of a source page specified by PageName#section or PageName#sectA#sectB.
  • markup=cut - default: directives and other invisible markup will be removed and ignored.
  • markup=code - lines including directives will be shown as source code.
  • markup=text - show only visible text without markup rendering, shortened by default.
  • markup=on - directives will be active, but only if pattern is '.' or unit=page or unit=para.
  • markup=source - display results as page source code.
  • case=1 - do case-sensitive search. Default is 0 insensitive case search.
  • word=1 - match whole words only, default is 0.
  • regex=1 - treat search term as regular expression (preg), default is 0 treat terms as text strings.
  • header=STRING - display STRING on first line.
  • header=count - display results counter on first line.
  • header=full - display extended result count on first line plus a footer to mark end.
  • footer=STRING - display STRING at the end as a footer.
  • phead=link - display page link above extract; prefix=STRING display STRING above extract
  • phead=linkmod - display line with page link and 'modified by' link and modified time above extract
  • pfoot=STRING - display STRING on line below text page extract
  • title =STRING - display STRING on left side in full header, default is 'Text Extract'.
  • timer=1 - display search time in full header.
  • linenum=1, matchnum=1, pagenum=1 - display line-, match-, page-numbers with default color setting.
  • linenum=COLOR - display line numbers in color given with COLOR (color code or recognised name).
  • matchnum=COLOR - display match numbers in color given with COLOR (color code or recognised name).
  • pagenum=COLOR - display page numbers in color given with COLOR (color code or recognised name).
  • linewrap=0 - prohibit automatic linewrapping of preformatted text. Default is 1 (linewrap true).
  • shorten=1 - shorten (truncate) output to 5 words left and 10 words right of terms.
  • shorten=7 (example) - output is shortened to 7 words left of the highlighted term, and 14 (double) words right of it.
  • lwords=n - markup=text output is shortened to n words left of term.
  • rwords=m - markup=text output is shortened to m words right of term.
  • linktext=COLOR - links in markup=text output shown with COLOR. Default is blue.
  • ellipsis=STRING - shortened markup=text output displayed with STRING at shortened end. Default is … (ellipsis)
  • textlinks=1 - if set links will be rendered as text only. This is the default for markup=text, but not for markup='code', 'cut' and 'on'.
  • order=results - pages will be displayed in the order of match results per page, pages with most matches first.

Text(Pattern)

By default search terms entered are treated as strings. With regex=1 set, or the regular expression box ticked, the term entered is taken as a regular expression (Perl).

'cat' will look for all occurrences of 'cat'. The default is a case-insensitive search, so any occurrence of 'Cat', 'CAT', 'cAt' etc. will also be returned.

  • 'cat dog' will look for string 'cat' AND 'dog', both strings need to be present on the source page.
  • '"cat and mouse" dog' will look for string 'cat and mouse' AND string 'dog'.
  • To look for matches of 'cat' OR 'dog' use 'cat|dog' and check 'Regular expression'.
  • To look for matches of 'cat' but NOT 'dog' use 'cat -dog'.
  • To match the word 'cat' and not 'catastrophe' tick the Match whole word box, or use parameter word=1 in the markup expression.
  • When using a regex search be aware that some characters are used as special control characters: the dot ., the star *, the question mark ?, the pipe |, the dollar $, and brackets. To use any of these as normal characters you need to escape them with a backslash in front.
  • The regex dot . character represent any character, so if you use a single dot as the textpattern the whole page content will be returned, as it matches everything. This is enabled for default searches too.

To specifically exclude lines matching some text(pattern) put it into the cut= option. With the snip= option on the other hand you can prevent certain words or phrases being shown in any matching lines, but still get the line. Input in cut= and snip= is treated as a regular expression pattern.

PageName source lists

Page names or group names can be specified with name= and group= parameters, and can include wildcard characters star * and question mark ?, ? representing any valid single character, and * representing any string of valid characters. A page name with a minus - or ! in front will be excluded from the pages to be searched.

So name=Test* means all pages beginning with 'Test', group=PmWiki will be interpreted as all pages in group PmWiki, name=-*RecentChanges means no RecentChanges and no AllRecentChanges etc. pages.

If you use full page names like Group.Name note that the wildcard pagename pattern is not a regex pattern, and a dot here means just the separator between the Group and PageName component of a page name! When several expressions are given, they will be combined logically as AND conditions to arrive at a valid source pagelist.

Comma-separated lists of page names can also be given.

Instead of using all of a page as the source for the text extract, one can specify an anchor defined page section as source with Group.PageName#anchor, or a section between two anchors with Group.PageName#anchor1#anchor2. Within the anchor section part you cannot use wiki wildcards, but if the name contains wildcards, then pages matching the name will be searched, and results only taken from the specified anchor section. You cannot use several names with different anchor sections!

Search form markup

Markup (:extract:) will produce a search form with a field for entering search terms and a field for entering a page name or pagename with wildcards.
Markup (:searchresults:) is used as marker for showing the results.

Note that in the standard PmWiki searchbox when entering 'Main/apple' 'apple' is searched in pages of group 'Main', but TextExtract will search for string 'Main/apple' in pages or groups specified in the page name field.

Default parameters for markup (:extract:)

  • size=30
  • button='Search'
  • searchlabel='Search for'
  • pageslabel='On pages'
  • 'wordlabel'='Match whole word'
  • caselabel='Match case'
  • regexlabel='Regular expression'
  • header='full'
  • phead='link'

Other optional parameters

  • regex=1 - this will show a checkbox for giving the option to enter a regular expression as search term (regular expression search).
  • Use group= and name= parameters as with pagelist and search markup.
  • page=GROUP.NAME (you can use wildcards * and ?) - this will hide the pagename field of the form, and pass on 'PageName' as source page parameter.
  • pattern=SEARCHTERM - this will hide the search field and search is always with the search term thus set. Setting both page= and patter= options you will get a form with just the submit button, useful to let a user get information with a preprogrammed search.
  • defaultpage=GROUP.NAME to set initial value for page field.
  • All the other keyword=value options from the {(extract ....)} markup expression can be used.

Notes on page field input:
A single * will search all pages in group (if group= parameter is set), or all pages.
A group name plus an ending / will search pages in that group.
Names with wildcards will search corresponding pages, narrowed down by any page options, like group=.

Examples:

Default Search Form showing fields for search term and for page pattern input.

(:extract:)
(:searchresults:)

Search PmWiki Documentation (by paragraph, ignore hidden markup)

(:extract page=* group=PmWiki name=-RecentChanges regex=1:)
(:searchresults:)

Search PmWiki Documentation (by line, with code )

(:extract page=* group=PmWiki name=-RecentChanges markup=code unit=line  regex=1:)
(:searchresults:)

Notes

Styling

You can change styling of results via css:

  • The results are wrapped in a div with class 'te-results'.
  • The header div has class 'te-header'.
  • The footer div has class 'te-footer'.
  • Each page link subheader div has class 'te-pageheader'.

Template variables

You can use some template variables within the values set with parameters header= footer= phead= pfoot=.
Useful for header=

  • {$$time} - search time
  • {$$pattern} - search term(s) from input.
  • {$$listcnt} - number of pages in source page list.
  • {$$pagecnt} - number of results pages.
  • {$$matchcnt} - number of matches (results).
  • {$$rowcnt} - number of result rows.

Useful for phead=

  • {$$pagenum} - consecutive number of source page.
  • {$$source} - source page name. Use as link like [[{$$source}]]
  • {$$pmatchnum} - number of matches on the source page.

Example, imitating header=full (remove line break):

header="%rfloat%{$$matchcnt} results from {$$pagecnt} pages,
 {$$listcnt} pages searched in {$$time} %%[+ '''$[Text Extract]''' +]"

PmWiki Search Form and pagelist directives

It is possible to use TextExtract with PmWiki (:searchbox:) search form, or (:pagelist:) directives. This may be useful in situations were it is necessary to use some pagelist options, which TextExtract does not supply.

and any other TextExtract options within the markup.
Example 1:

Search the PmWiki Documentation

(:searchbox group=PmWiki fmt=extract:)
(:searchresults:)

Example 2 (:pagelist Search Terms fmt=extract header=full phead=link:)

Custom Search Form

This is a form built with Forms markup, and using action='search' and fmt='extract' utilising Pmwiki pagelist and TextExtract.

Example search form, not enabled here!

(:input form :)
(:input default request=1:) 
(:input default name 'PmWiki.*':)
||width=30em
||Search for ||(:input text q:) (:input submit post "Search":)||
||On pages ||(:input text name :) ||
|| ||(:input checkbox word 1:) Match whole word ||
|| ||(:input checkbox case 1:) Match case ||
|| ||(:input checkbox regex 1:) Regular expression ||
(:input hidden unit line:)
(:input hidden markup code:)
(:input hidden header full:)
(:input hidden title 'Search Results:':)
(:input hidden phead link:)
(:input hidden matchnum 1:)
(:input hidden timer 1:)
(:input hidden action search:)
(:input hidden fmt extract:)
(:input end:)
(:searchresults:) 
Search for
On pages
  Match whole word
  Match case
  Regular expression

Release Notes

  • 2014-02-22: Updated markup definitions for PHP 5.5 compatibility.
  • 2009-10-15: Improved order=results sorting.
  • 2009-10-15: Added order=results to show pages with highest number of matches first; fixed checkboxes to retain previous setting.
  • 2009-10-02: Fixed count= option for page names with anchored section.
  • 2009-09-28: Fixed quoted parameter handling in {(extract..)}. Fixed result count and output when snip= removes searchterm. Added single dot (.) input to return all page text (not just for regex=1).
  • 2009-09-26: Added config variable $TEModeDefaults for setting markup mode specific default options. Made option shorten= available for all markup modes. Changed activelinks=0 to textlinks=1. Modified handling of vertical spacing, removed custom (:spacer:) markup. Fixed bug handling input of '/' when regex=1.
  • 2009-09-25a: Simplified code for handling of input 'foo/bar' and '/'.
  • 2009-09-25: Added activelinks=0 option (default for markup=text). Modified handling of input 'foo/bar' and '/'. Fixed markup=source output. Modified cleanup of directives. Added stripmagic() to input strings.
  • 2009-09-23: Added markup=text option, including truncating by words.
  • 2009-09-22: Fixed bug with handling escape markup and highlighting.
  • 2009-09-21: please adjust your markup! I normalised the input syntax to correspond with pagelist syntax! Added #section for usewith wildcard PageName pattern; removed action=extract; source pagelist is now always generated via MakePageList(); deprecated extractresults:) (use (:searchresults:) instead); deprecated prefix= option (use phead= instead); deprecated page2= option (use name= instead).
  • 2009-09-18: improved timer for more accuracy; corrected {$$listcnt} for use with fmt=extract.
  • 2009-09-17a: added FPL function for pagelist fmt=extract, no custom pagelist template needed when using fmt=extract in pagelist or searchbox.
  • 2009-09-17: Integrated use of (:pagelist ..... fmt=#extract:). Fixed some vertical spacing bugs.
  • 2009-09-16a: Fixed bug in form markup causing inline markup in parameters to be rendered.
  • 2009-09-16: added {$$pattern} template variable; fixed some minor bugs.
  • 2009-09-15: modified search term input to add inclusive and exclusive term options, similar to PmWiki searchbox input; split regex from normal search input; added Regular expression checkbox; added Match whole word checkbox; added template variables for header, footer, phead parameters; changed prefix and suffix to phead and pfoot.
  • 2009-09-07: added wrapper div and style classes.
  • 2009-09-06: large speed optimization; more argument tweaking.
  • 2009-09-05: tweaked argument handling.
  • 2009-09-04: reworked the way options are combined for making pagelist; fixed some form bugs; added 'defaultpage' form parameter; added 'pattern' as form option; fixed 'suffix' bug; silently drop pages for which no read permission exists; escaped markup expressions from output.
  • 2009-09-03: fixed bug in line numbers; expanded line numbers; changed search form to use POST and retain input values; improved markup cleaning for better display; changed some defaults.
  • 2009-09-01: Complete code overhaul for better text processing and maintenance. Added options markup=source, (match) numbers, linewrap, perpagenumbers, highlight styles.
  • 2008-03-07: Added unit=para option to show whole paragraphs, separated by empty lines or headings.
  • 2008-02-12: Changed extractresult markup so outpput does not get wrapped in <p>..</p> tags
  • 2008-02-11: Added options group= name= for source pages (same as PageList directive). Improved handling of input from pagelist markup expression (PowerTools)Added option count= and prefix=linkmod.
  • 2008-01-31: Added markup=on option for processing markup directives when pattern is '.' or unit=page. Fixed wrong line handling when unit=page. Added cleanup of form input options. Added qualifying of relative links.
  • 2008-01-29: Added simple filter to suppress bad pattern input by disallowing input of single regex special characters. Added capability to receive input from Pmwiki standard search form, with use of custom fmt template.
  • 2008-01-28: Added search form with markup (:extract:) and (:extractresult:). Optimised code. Improved handling of directives and highlighting. Removed timer since results were not very meaningful. Added default option arrays. Added capability to handle comma-separated pagename lists.
  • 2008-01-25a: Added error notice if no pages were found matching the PageName list. Changed full header to include number of pages searched.
  • 2008-01-25: Minor fixes to handling of parameters supplied.
  • 2008-01-24: Further improved highlighting. Added markup expressions to be rendered as source code rather than evaluated in output (same as directives). Improved vertical spacing for both nolinebreaks and linebreaks conditions, by adding custom (:spacer:) markup. Added markup expression {(cleanspacer ...)} as a wrapper for use in form templates to write output directly into a page, to remove the (:spacer:) markup.
  • 2008-01-23: Added handling of -PageName for page exclusion from source list. Added results counter and timer for option 'header'. Added case sensitive and insensitive search option. Improved handling of directives and of highlighting. Renamed 'out' to 'markup'.
  • 2008-01-22: Added 'highlight', 'unit' and 'out' options.
  • 2008-01-21b: Renamed script. Renamed expression to 'extract'. Renamed 'hide' option to 'snip'.
  • 2008-01-21a: Added suffix= option. Added handling of page section as source input. Added support for multiple PageNames, each can also have wiki wildcard characters, unless the pagename has a #section specified.
  • 2008-01-21: Enhanced lines= option. Changed fmt= to prefix=
  • 2008-01-20a: Added lines= parameter
  • 2008-01-20: Initial release

If the recipe has multiple releases, then release notes can be placed here. Note that it's often easier for people to work with "release dates" instead of "version numbers".

See Also

Contributors

Comments

See discussion at TextExtract-Talk

User notes +5: If you use, used or reviewed this recipe, you can add your name. These statistics appear in the Cookbook listings and will help newcomers browsing through the wiki.

Edit - History - Print - Recent Changes - Search
Page last modified on September 20, 2014, at 11:22 PM