TextExtract-Talk

Back to TextExtract "Please put your new question at the top!"

Comments

header-footer error — fixed with version 2025-01-30

the defaults-array $TextExtractOpt needs to specify

	'footer'     =>'',

calling (extract ... header=something) without specifying a 'footer' results in errors.

Problem with tables

Hi HansB. I noticed that with this part (case 'para':) of the update, TE does not display the entire table (simple or advanced) in the result, but only the row where the searched word is found.

	$newpara = array(); $j=0;
	foreach($rows as $i => $row) {
		$row = rtrim($row);
		if ($row=='') {  continue; }
		$j++;
		$newpara[$j] = '';
		$newpara[$j] .= $row."\n";
	}
	$rows = $newpara;
	break;

I ask to find out if it is an internal problem with my site (still under development in localhost) or it happens to you too. Thank you. - Frank January 28, 2023, at 02:05 PM

Did it behave different before the update? Why should a table be regarded as a paragraph? We may have sentences and paragraphs within a table, within a cell. Maybe I need to make a special case for this. HansB

Exactly as you said We may have sentences and paragraphs within a table, within a cell. I had considered just that in the past, so all the tables I have in my site are intentionally composed with rows one after the other (no empty rows). Before the update, if I wanted to view the whole table, I searched for a word (that I knew was in it) with unit=para, and if I wanted to view only one row of the table, I searched with unit=line. After the update I no longer have this distinction, a row always appears.
I update from version 2022-10-31 to version 2023-01-26. This was the code that worked for me:

	$paras = array(); $j=0;
	foreach($rows as $i => $row) {
		$row = rtrim($row);
		if ($row=='') { $j++; continue; }
		$paras[$j] .= $row."\n";
	}
	$rows = $paras;

Of course to get around the problem of having to compose tables like I did, it would be great to have a sort of tab=1, i.e. "if the searched word is within a table (even with empty rows here and there), even part of sentence or paragraph, display the whole table". But is it possible? - Frank January 28, 2023, at 05:04 PM

Okay, I see. I updated the script, reverting the code changes you have pointed out. HansB

Thank you - Frank January 28, 2023, at 05:50 PM

PHP8.1 -> pmwiki fails to render due error from count() [line 316 of script]

The function count() now throws a TypeError. How may it be fixed.

Many thanks, John Anglin 22-Oct-2022

Hopefully the latest update (version 22-10-31) has fixed that issue. - HansB

Order of the results for (and within) each page (Unrelated to the list of pages, order=results)

By default the content shown for each page follows the internal layout of the page text, e.g. if I search the terms grape|apple|mandarin (with regex, any unit=, except page; any markup=) the results will be:

(= layout of the page text) Page X/1
... apple ... mandarin ...

... grape ...

... mandarin ...

... apple ...

Page X/2
...

I would like to get (with a new parameter, maybe: dispose=entered or disposition=submitted) the following result:

(≠ layout of the page, order as submitted) Page X/1
... grape ...

... apple ... mandarin ... ^[1]

... apple ...

... mandarin ...

Page X/2
...

^[1] The first occurrence in the same unit= has priority... just an idea, maybe it is better that the precise match has the priority. Probably an obvious technical choice... that I do not know.

I understand that it is a very specific parameter for who knows what and where to look for. I would need it to find paragraphs (named with specific alphanumeric codes) in very long pages and have them in a certain order.
I think it would enhance the recipe and will also be useful to other users in other situations.

I hope it is possible, thank you. --Frank November 06, 2020, at 09:24 AM

More options in result header?

Hi Hans, could we have one (or more) additional options for phead parameter, like phead=linktitle which would show page title, or phead=linknametitle which would show both page name and title? Thanks!

PHP 7.2 -> Parameter must be an array or an object that implements Countable

With PHP 7.2, I get the error "Warning: count(): Parameter must be an array or an object that implements Countable in .../extract.php on line 315" if the search test occurs in links. OliverBetz August 26, 2019, at 06:37 PM

You can suppress the warnings by placing a "@" before the count() call, like (@count($new[$j]['rows'])>0). It will behave exactly as before PHP 7.2, returning false if it is not an array. See PHP count() changelog. --Petko August 28, 2019, at 10:33 AM

Two problems with word=1|0 (punctuation and compound words)

Hi Hans, I found this: There is no difference in the results by searching with the option word=1|0 in these two situations:

punctuation (just rtl languages) - because it is in the opposite position compared to the ltr languages (e.g. ',מים' vs 'ὑπόθεσιν,')
compound words (rtl and ltr languages) - (the cause is not clear to me, so I proceed with examples):

There are 5 מַיִם in the test page:

1 as complete word - string 6 (second to last word)
4 as part of compound words (i.e. מַיִם_)
- string 1 (third to last word)
- string 6 (last word - immediately followed by -rtl- punctuation)
- string 8 (middle of the string - immediately followed by -rtl- punctuation)
- string 9 (middle of the string)
מַיִם (word=1) - expected to find just 1 occourence (i.e. string 6 - second to last word), but: found it plus 2 of the compound words in string 1 and 9.
מַיִם (word=0) - expected to find all 3 occourences (5 if you consider those with punctuation), but: found 3 (not found those with punctuation, string 8 and 6 - last word).

In the ltr languages sometimes there is the same problem with compound words (e.g. τὰ_ ταὐτὰ - in the page) τὰ (word=1) - τὰ (word=0). But (οὐ_ οὐδέν - οὐχὶ) works: οὐ (word=1) - οὐ (word=0)
At this point it seems to me that the problem emerges more clearly comparing the position of the compounds ( e.g. _οῦ or οῦ_ ) οῦ (word=1) - οῦ (word=0). The same for rtl languages: ו (word=1) - ו (word=0)

I hope there is an easy solution, because I would need to use it also with combined options (e.g. serial=1 strict=1 word=0|1), e.g. אֱלֹהִים+מָיִם (word=1) - אֱלֹהִים+מָיִם (word=0).
P.S. Sorry if I wrote so much, but I wanted to report as many tests as possible. Thanks.

Frank July 11, 2018, at 04:44 PM

Too many occurrences found

Hi Hans, I have a problem. When TE finds a huge number of occurrences, the system is unable to open the wikipage and crashes: i.e. Allowed memory size of ... bytes exhausted (tried to allocate ... bytes) in .../pmwiki.php on line ...). I tried to increase the parameters in farmconfig.php, but when the occurrences are too many ... well, there is no way out. May I suggest a couple of ideas to solve the problem?

1) Automatic insertion of a virtual (:page-break:) markup, let's say after a fixed number of occurrences found (maybe: 100 for unit=line; 50 for unit=para; 5 for unit=page, ect.)
... or even better (perhaps too complicated?):
2) Make the results load when the user scrolls the page down

What do you think about it?

P.S.
Actually, maybe it would not be a bad idea to suggest the last one (but based on loading-time) as Core Candidate for 'standard pmwiki search' and for a 'standard wikipage loading' (when it is so heavy it can not be opened, see Memory problems).

Frank Feb 16, 2018, at 09:25 PM

It is more likely you need to stop PmWiki to open too many pages for the search. TextExtract uses the normal PmWiki core search function to find pages which contain the search term, then goes through those pages and extracts the relevant text portions, as required by the parameters set. I don't know how the initial search can be broken into parts. - HansB February 16, 2018, at 02:51 PM

Maybe I understand... but it also happens when I look for certain words on just one page! Let me explain better, I have a couple of pages with so much text (data) that can not be opened, and that's fine, because you can access the portions of data that you need (and when you need) by extracting sections with TE. As long as the lines or the paragraphs found are about 40/50, there is no problem, but when they become higher, the problem mentioned above arises.
Do you think my second idea as Core Candidate could be a solution? Just to know if it is worthwhile that I propose it.

Frank Feb 16, 2018, at 10:35 PM

Browser Caching

Can I enable Browser Caching? I have set $EnableIMSCaching = 1;, but the result page will not be cached. Thank you!

XML pages

Sorry Hans, I am going to disturb you again. I was wondering if there is a way to make TextExtract work for pages stored as XML too, ... obviously if that does not mean to overturn the recipe. Thank you.

Frank Sept 12, 2017, at 00:21 AM

Sorry, Frank, but TextExtract works on the wiki source text of wiki pages. It proceses source texts, extracts parts of it, removes perhaps some wiki markup, and adds others (like markup for highlighting the search terms). At the end all gets converted to HTML the standard PmWiki way. In order to process pages stored as XML, one would need to remove all XML tags first. But, more importantly, TextExtract uses PmWiki's PageList functions to search through wiki pages for the search terms, and I have no idea if PageList still works on a pagestore with XML pages. If yes, there may be some hope to tweak TextExtract for XML page stores.- HansB September 12, 2017, at 01:12 AM

Pagelists work, no PageList-functions involved (and the pages are stored in wiki.d as usual), just page-tags are transformed, e.g.:

<?xml version="1.0" encoding="UTF-8"?>
<page xmlns="http://www.pmwiki.org/cookbook/xmlpage" version="...">
<agent>...</agent>
<author>...</author>
...
<text>
...

...</text>
<time>...</time>
<author time="...">...</author>
<csum time="..."> ...</csum>
<diff:...:minor>...d0
&lt; ...
&lt; 
&lt; ...
\ No newline at end of file
</diff:...>
<host time="...">::...</host>
</page>

Frank Sept 12, 2017, at 05:12 PM

So what does PmWiki function ReadPage($pagename) return? TextExtract grabs the text after ReadPage is called, in function TETextRows():

$page = ReadPage($source); //source is page name from list of matches

....

$text = $page['text'];

Perhaps those lines would need modifications for XLM pages. - HansB September 12, 2017, at 11:52 AM

PS: just checked, ReadPage() works as normal, and in a simple test with XML converted page store (I had all pages converted first) TextExtract works as well, without any modifications. So what does not work for you? - HansB September 12, 2017, at 12:13 PM

After your reply (PS) it was obvious that the mistake was mine ... and as a matter of fact: I wanted to enable XML just for the pages of a specific group, so I had put in groupX.php:

  $EnablePageStoreXML = 1;
  include_once('cookbook/XMLPageStore.php');
  $WikiDir = new XMLPageStore('wiki.d/{$FullName}');

Now I've moved the last two lines to farmconfig and everything works fine. I made a beginner's mistake, but if you didn't confirm that the recipe was working, I don't think I would have noticed it. Thank you.

Frank Sept 13, 2017, at 01:27 AM

Search and standard markup

I have a problem when I search for small phrases (two or three words concatenated) if I have used in the wikipage the standard markup to stress (bold, highlight, underline, etc.) a word, e.g.

Source: Labor est '''sit''' quo excepturi architecto ut nemo atque.

Search: “est sit quo”
Result: it finds nothing.

Could the search not take into account the standard markup such as: ''...'', '''...''', [+...+], [-...-], %color%...%%, {+...+}, etc., but in the same time achieve a search-result with an active markup (if markup=on)? For example, with the footnote markup [^...^] I can get this result perfectly:

Source: Labor est sit[^...^] quo excepturi architecto ut nemo atque.

Search: “est sit quo”
Result: Labor est sit^[1] quo excepturi architecto ut nemo atque.

Thank you.

Frank Jun 1, 2017, at 10:40 PM

Hi Frank, thanks for your suggestions! I have now reworked the script to make this possible. It was quite a bit of change, as we need to remove inline markup from the text of the pages being searched, during the search. Now any search terms enclosed in double quotes are treated as phrases and the search triggers a "remove inline markups" mode. This can also be achieved by adding phrase=1 to a searchbox markup, or clicking the 'Match phrase' checkbox of the (:extract...:) searchbox. Please note that due to the stripping out of inline markup, the displayed search results do not contain inline markup either. I hope this is a price worth paying for the added functionality. - HansB June 04, 2017, at 12:11 PM

Thanks for the update. Unfortunately in my specific case it does not solve the problem as I wished because in half of the site I'm developing I use the recipe Glosses and it has an inline markup that divides each word: word1{gloss} word2{gloss} and I hoped they would be displayed (with markup=on).
I'm a beginner in php so I hoped it was simpler, like inserting a search regex command like: when you look for “word1 word2” don't consider the markups in the middle of the words: "(word1).*?(word2)" but replace them with '$1 $2' only as search-replace and not as result-replace too (I hope I explained my thought well, I don't own the technical terms).
Anyway, thank you.
Just a clarification:
I was wrong with the example of [^footnote^] (see above), no inline markup will be displayed as search results when the search is "phrase".

Frank Jun 5, 2017, at 09:30 PM

this looks like a complicated problem. For your specific usage, do you wish to enter for instance "word1 word2", and the search should search for "word1{abc def} word2{xyz gloss}", and have that returned in the result, including markup {...gloss...}? Do I understand you correctly? I try to understand that Linguistic Glosses recipe... - HansB June 05, 2017, at 03:09 PM

Exactly, always {gloss} added. If the whole paragraph is extracted with 'unit = para', also the (gloss) and (glossend) markup would be included and it would trigger the recipe.

Frank Jun 5, 2017, at 10:42 PM

it is a relativly simple matter to modify extract.php to remove all markup of form {some gloss text here} when searching, as part of the TERemoveInlineMarkup function. Then a phrase search will find the terms entered correctly. But the gloss is not restored to show in the result. I do not see yet a way to do so. When searching page text for a phrase, first TextExtract (TE) removes inline markup (and gloss markup would need to be added to the function for it to be removed, then the phrase is looked for, and matches are listed as page matches, so TE has a list of pages with matches. Then TE goes through that list, converts text into lines, searches again for the phrase in the lines, with inline markup being stripped out, and builds a search result list of text lines, basically. Once inline markup or gloss markup is stripped out, I do not see a way to add it back in again at the right places. - H.

Consider what follows for what it is, just the idea of a beginner:
H: ...so TE has a list of pages with matches. Then TE goes through that list,
F: ok
H: converts text into lines,
F: this conversion is based on the replacement list in TERemoveInlineMarkup function if I have understood correctly. Maybe you can skip this step for search “phrase” and case markup=on.
H: searches again for the phrase in the lines, with inline markup being stripped out,
F: this search for the phrase (on the pages found already) could be done directly on the text (is it possible?), based on a new function where all inline markups will be skipped by a regex.
Maybe to achieve this, the “phrase” should be

first splitted (match_split \s): e.g. Labor est '''sit''' -> three words
then searched as a composite string -> e.g. ($a)[\W](+$b)[\W]+($c)

to extract the paragraphs or the lines where it has been found.
H: and builds a search result list of text lines
F: ok
Maybe it's not possible to do so, but I hope this can give you some idea.

Frank Jun 6, 2017, at 05:20 AM

Not as easy! The phrase in your example is "est sit". The page text could be anything, containing perhaps a string like "est '''sit'''" or like "''est''{somegloss} sit{more gloss}" or any other combinations of inline markup and, in your case, gloss markup. So we try to find word "est" followed by word "sit", without knowing what characters are in between or surrounding or following these words, and also we want to preserve exactly those characters we do not know. As yet I can only see the solutions of a) stripping out inline markup so we can find a match to phrase ("est sit"), or b) not looking for a phrase at all, but live with extra results by searching for "est" AND "sit", in which case inline markup will be preserved.

As regards to the glosses recipe:

The markup definition at the bottom of the script needs to be updated to use Markup_e() function, in order to be PHP5.5 compatible (no /e parameter allowed in regex).
The {gloss} markup also interferes with other inline markup (the {+ins+} and {-del-} markup for example, there may be others)
in TE results any {gloss} markup will not be rendered, even if it was not stripped (by not using a phrase search), unless we also use unit=para and markup=on, because of the need for the (:gloss:) and (:glossend:) markup.

So some improvements to that recipe may be advantageous. - H.

PS: have a look/go with the latest update. I implemented a stricter search, which will leave out results where the search terms are not in the same paragraph, sentence, line, whatever the unit parameter is set to. Which might mean a search for phrase may not be necessary, because the standard results are pretty good too. - I also added a way to add custom markup patterns for markup removal, if search by phrase is used. - H.

PPS:The latest version also has a new parameter serial, default is serial=0. Set to 1 and with strict=1, but not phrase=1, will match only the terms if found serially in the same text unit (paragraph, sentence, line), as submitted. So "fat cat" entered will match "a fat man and a stray cat" and "a fat catfish" but not "the cat ate a fat fish". - H.

Thank you very much Hans, I really like this solution, in my case it is perfect, even better than searching "phrases".
Just one last thing (I have bothered you enough already). Running various tests I noticed that TE did not find:

Latin characters: words containing the letters à è é ì ò
Hebrew characters: words with some letters (I use only basic letters, no vocal punctuation sign, etc., e.g. והנחש). For Hebrew, I add that when it does not recognize a letter, it still finds the sentence / paragraph but breaking the word in half, e.g. וה + חש (in this case it did not find the נ in the middle)
Greek characters: no word at all (I use only basic letters, no accents, spirits, diacritical signs, etc., e.g. ιησου ΙΗΣΟΥ)

Is it possible to do something? Thank you.

Frank Jun 7, 2017, at 06:20 AM

Thank you for pointing this out! I think it happened when word=1, word boundaries were not seen properly for UTF-8 characters. This is now fixed in latest update. - HansB June 07, 2017, at 04:20 AM

Now:

Latin characters: All right except for the letter à (FE does not recognize it but finds the words anyway. It seems that it does not distinguish it from á).
Hebrew characters: the problem remained.
Greek characters: perfect.

I saw the regex that you changed. Maybe adding the modifier \u (but I'm not sure). I would test it myself but I do not know where to place it in that particular string (I have found the information here).

Frank Jun 7, 2017, at 02:53 PM

Frank, all works fine here on my local computer, Windows with XAMP server, Firefox. UTF encoding works, searches work. I uploaded a UTF-8 test page, with proper UTF encoded text samples, and made a search page here on my site:
Search Page - UTF-Test page
now I see that on the test site searches for RTL languages do not get results, LTR do. On my local machine both do get results. strange.
P.S: Try a search for a RTL word with the normal PmWiki search. On my testsite this fails. On my local machine it succeeds. So it seems some other issue here, not one of TE. Need to investigate this later... - H.

I tried reinstalling everything (latest versions of XAMPP, Eclipse and of course PmWiki). When I run Eclipse I got the message: An internal error occurred during: “Modifying Include Path". java.lang.NullPointerException so I thought: there you are! (but I spoke too soon). I tried different versions of Eclipse and with 'Oxygen' the message no longer appeared, I reinstalled my version of PmWiki, I run the tests... nothing, all the same.
Maybe it is related: I have the same problems about recognizing some hebrew letters in Glosses too. I planned to report it to Petko right after the solution of the search-problem, ... perhaps he is reading this discussion already.

Frank Jun 8, 2017, at 10:30 AM

I am raising this pagelist search issue now on the pmwiki-user mailing list. It may have to do with some server setting I do not know about. - H.

After more tests I found that deleting the .pageindex file in wiki.d made searches for RTL terms possible. - H.

I probably did something wrong. I will try with a new installation, one recipe after another (It will take a while) and I'll let you know. Thank you for your help.

Frank Jun 8, 2017, at 04:30 PM

I was not able to solve the problem, I tried with a new pmwiki installation with TE only, but it came again the message: "java.lang.NullPointerException" (see above). So I changed everything: from Eclipse to Netbeans, from XAMPP to AMPPS (I have to say it's great tool for pmwiki beginners), but nothing. Maybe it's a problem of my Mac. Conclusion: I'm forced to keep this problem and I hope when I will put my site online (at the end of this year), everything will work. Thanks again Hans for help, at least now it works for ltr languages.

Frank Jun 10, 2017, at 06:15 AM

okay. I cannot help with problems relating to Eclipse and Java, I do not understand what Eclipse has to do with running PmWiki (can you not run PmWiki without Eclipse?), what you trying to achieve there. I hope you can find a way to solve it. It is best to have a working version of a new site on the local machine, before hosting it online. - H.

Concerning search finding international characters, make sure when the pages were saved the xlpage-utf-8.php script was already included, and the page source text contains actual international characters, not HTML entities like "جست‌وج&#160" (or similar). When your browser posts a page text, if an international character is not available in the page encoding, then it will be converted/encoded to an Unicode entity and this will be in the page source, then in the pageindex only the "words" ([0-9a-z]) will be saved, and the search will not find the international characters. If such is your case, you have to re-edit your pages and save the text as actual characters not encoded entities (and the pageindex will be updated). --Petko June 10, 2017, at 03:53 AM

Unfortunately (it's strange to use this word, but in this case ...) all you have listed (xlpage-utf-8.php inclusion, international characters in .pageindex and in the source pages; charset=UTF-8 in wiki.d pages) it's perfect.
Now my situation is the following: everything works as expected, except for three characters that are not recognized by TE and by Gloss:

Latin: à (italic) as in sarà (italian);
Hebrew: נ (= n/N) as in פני;
Greek: Π (capital, = P) as in ΠΑΝΤΩΝ.

What I do not understand is that without any markup they are normally displayed in wikipages but inside Gloss markup (I mean here{...} ) they are not recognized (e.g. ΑΝΤΩΝ�) and TE (during the search) considers them word-breaks (e.g. � + ΑΝΤΩΝ) ... however in the TE result they are normally displayed.
This does not prevent me from finishing the site... it just makes me crazy not to understand why it happens! I just hope it's a Mac problem!
P.S. Thank you guys for your efforts, I appreciate a lot.

Frank Jun 10, 2017, at 02:00 PM

I cannot replicate these errors at all, sorry! As said above, I have some misgivings about the gloss markup recipe, as it seems to corrupt other markup. For instance I just discovered that single or double backslashes at line end are not recognised as markup for joining lines or putting in line breaks, so lines can stay within a paragraph. - In TE results with markup=on and unit=para a {gloss} markup will only be displayed as gloss, if the paragraph contains the gloss (:gloss...:) directives. I wish that recipe would use a less aggressive markup (it uses not even a markup rule), like other inline markup rules, so one could suspend with the use of (:gloss:) and (:glossend:) alltogether. - HansB June 12, 2017, at 03:23 AM

I built a simple custom glosses-based inline markup (strictly adapted to my needs) that avoid this problem, in this way in wikipages all the characters are readable and with TE the results are displayed not just with unit=para. But for the recognition of those particular three characters (see above) I will have to be patient. Thanks again.

Frank Jun 12, 2017, at 01:48 PM

Use as a smart include

Is it possible to use this as a smart include, e.g. I want to include all individual lines matching a pattern in another page?

simon May 28, 2016, at 05:25 PM

HansB: perhaps use it as a markup expression.

Output to different page

Is there an equivalent of the search box target=group.page parameter?

simon June 09, 2015, at 06:54 PM

Added in latest update. HansB

Output of extra line or sentence

Using unit=line is it possible to display the line the match is found on, and the following line?

simon May 11, 2015, at 09:57 PM

I do not see a simple solution to do this, as the script processes each line in turn. HansB June 05, 2015, at 08:23 AM

OK, thanks

simon June 05, 2015, at 05:16 PM

I think that what I am looking for is a variation on unit=para, say unit=cell or unit=cellnr, as when I use unit=para the entire table show up. I.e. unit=cell would be unit=para plus terminated by the next cell, cellnr, or tableend. Similarly for unit=cellnr except cell would not be a terminator of the display.

simon June 09, 2015, at 06:48 PM-]

UPDATE: With the latest update I added a unit=sent and unit=dsent and unit=dline option. With unit=dsent results are shown in a single sentence, plus the following sentence, if in the paragraph (no empty line between). Similar with unit=dline for a two line display. Not sure it will help you, but I find it a great improvement in cutting down single result size, while giving meaningful results. HansB

PS: a search text extract based on table cells would be really difficult, I think. There are also different kinds of table markup to look for. Apart from the difficulty that most content is not in table cells. I hope the unit=sent option may work instead, as it would use a sentence, in a cell or not, disregarding the line. But it depends perhaps how text content is structured within a table cell. HansB

This was exactly what I was looking for. Thanks very much for the two changes above. simon July 10, 2015, at 04:11 AM

Searching without search term (page search)

The PmWiki search recognises name=pagename, or group=groupname. I use the Text Extract search as a substitute for it. It seems to me it should behave in the same way as the PmWiki search.

simon June 05, 2015, at 05:16 PM

UPDATE: A TextExtract search using a custom search form, the (:extract ...:) search form, or simply fmt=extract with the PmWiki search form will use the same kind of parameters to narrow down the search to specific pages. The latest release will provide a default pagelist when no search terms are supplied, but without the statistics of how many page searched and found. I could not manage to get the statistics to show, but it may be better than just the error message, that no search term is supplied.

Also: if you enter Group/ it will return a list of pages for that group. If you enter Group/ searchterm it will look for searchterm inside the pages of Group.

HansB June 10, 2015, at 05:19 AM

Output formatted as a table?

I'm trying to list selected items from tables on different pages in one new table using the extract markup. My new table consists of one row per extract statement, each containing the lines that are the extract result. This works very well, but due to the processing sequence, the extract results in each row are an encapsulated table by themselves so there are no continuous columns spanning all rows. (I'd prefer continuous columns as this would it simple to copy & paste the relevant parts of the listed information, in this case e-mail addresses.) I've tried generating the list with a single extract statement, which is not straightforward as there is no clear pattern in the to-be-searched group names, but this does not yield continuous columns either. Is there a way to get continuous columns (or to isolate the e-mail address from the source line though the search term is somewhere else in the same line)? --Henning February 03, 2010, at 10:31 AM

I don't see how you can get output formatted as one table. But perhaps the last is possible: use the snip= parameter to remove unwanted parts of the result line. - HansB February 03, 2010, at 11:56 AM

Thanks, your answer showed me the way. I use snip=.*mailto.|\| now as the mailto: markup itself is not vital to me and the email address is the rightmost column in the source tables. If the expression looks clumsy, it's because it's the first time I give regular expressions a try :-) --Henning February 03, 2010, at 12:36 PM

Sorting output by page age

given, I want to extract content from several news pages, like

{(extract '.' News.* -News.RecentChanges -News.Uebersicht -News.*-Talk 
prefix=link lines=3 cut='!')}

Is there a way to sort the page processing by the age of the pages (e.g. last edit or so), so that the latest news are displayed on top? Ingo vonBorstel September 09, 2008, at 10:26 AM

HansB: I think you could install PowerTools and use the (pagelist ...) with normal PageLists#pagelistorder sort order parameters to provide the list of pages inside the (extract ..) expression, for instance like this:

{(extract . (pagelist name=News.*,-News.RecentChanges,-News.Uebersicht,-News.*-Talk
order=-ctime) prefix=link lines=3 cut='!')}

Talk page for the TextExtract recipe (users).