TextExtract-Talk

Back to TextExtract ''Please put your new question at the top!

Comments


XML pages

Sorry Hans, I am going to disturb you again. I was wondering if there is a way to make TextExtract work for pages stored as XML too, ... obviously if that does not mean to overturn the recipe. Thank you.

Frank Sept 12, 2017, at 00:21 AM

Sorry, Frank, but TextExtract works on the wiki source text of wiki pages. It proceses source texts, extracts parts of it, removes perhaps some wiki markup, and adds others (like markup for highlighting the search terms). At the end all gets converted to HTML the standard PmWiki way. In order to process pages stored as XML, one would need to remove all XML tags first. But, more importantly, TextExtract uses PmWiki's PageList functions to search through wiki pages for the search terms, and I have no idea if PageList still works on a pagestore with XML pages. If yes, there may be some hope to tweak TextExtract for XML page stores.- HansB September 12, 2017, at 01:12 AM

Pagelists work, no PageList-functions involved (and the pages are stored in wiki.d as usual), just page-tags are transformed, e.g.:

<?xml version="1.0" encoding="UTF-8"?>
<page xmlns="http://www.pmwiki.org/cookbook/xmlpage" version="...">
<agent>...</agent>
<author>...</author>
...
<text>
...

...</text>
<time>...</time>
<author time="...">...</author>
<csum time="..."> ...</csum>
<diff:...:minor>...d0
&lt; ...
&lt; 
&lt; ...
\ No newline at end of file
</diff:...>
<host time="...">::...</host>
</page>
Frank Sept 12, 2017, at 05:12 PM

So what does PmWiki function ReadPage($pagename) return? TextExtract grabs the text after ReadPage is called, in function TETextRows():

$page = ReadPage($source); //source is page name from list of matches
....
$text = $page['text'];

Perhaps those lines would need modifications for XLM pages. - HansB September 12, 2017, at 11:52 AM

PS: just checked, ReadPage() works as normal, and in a simple test with XML converted page store (I had all pages converted first) TextExtract works as well, without any modifications. So what does not work for you? - HansB September 12, 2017, at 12:13 PM

After your reply (PS) it was obvious that the mistake was mine ... and as a matter of fact: I wanted to enable XML just for the pages of a specific group, so I had put in groupX.php:

  $EnablePageStoreXML = 1;
  include_once('cookbook/XMLPageStore.php');
  $WikiDir = new XMLPageStore('wiki.d/{$FullName}');

Now I've moved the last two lines to farmconfig and everything works fine. I made a beginner's mistake, but if you didn't confirm that the recipe was working, I don't think I would have noticed it. Thank you.

Frank Sept 13, 2017, at 01:27 AM

Search and standard markup

I have a problem when I search for small phrases (two or three words concatenated) if I have used in the wikipage the standard markup to stress (bold, highlight, underline, etc.) a word, e.g.

Source: Labor est '''sit''' quo excepturi architecto ut nemo atque.

Search: “est sit quo”
Result: it finds nothing.

Could the search not take into account the standard markup such as: ''...'', '''...''', [+...+], [-...-], %color%...%%, {+...+}, etc., but in the same time achieve a search-result with an active markup (if markup=on)? For example, with the footnote markup [^...^] I can get this result perfectly:

Source: Labor est sit[^...^] quo excepturi architecto ut nemo atque.

Search: “est sit quo”
Result: Labor est sit[1] quo excepturi architecto ut nemo atque.

Thank you.

Frank Jun 1, 2017, at 10:40 PM

Hi Frank, thanks for your suggestions! I have now reworked the script to make this possible. It was quite a bit of change, as we need to remove inline markup from the text of the pages being searched, during the search. Now any search terms enclosed in double quotes are treated as phrases and the search triggers a "remove inline markups" mode. This can also be achieved by adding phrase=1 to a searchbox markup, or clicking the 'Match phrase' checkbox of the (:extract...:) searchbox. Please note that due to the stripping out of inline markup, the displayed search results do not contain inline markup either. I hope this is a price worth paying for the added functionality. - HansB June 04, 2017, at 12:11 PM

Thanks for the update. Unfortunately in my specific case it does not solve the problem as I wished because in half of the site I'm developing I use the recipe Glosses and it has an inline markup that divides each word: word1{gloss} word2{gloss} and I hoped they would be displayed (with markup=on).
I'm a beginner in php so I hoped it was simpler, like inserting a search regex command like: when you look for “word1 word2” don't consider the markups in the middle of the words: "(word1).*?(word2)" but replace them with '$1 $2' only as search-replace and not as result-replace too (I hope I explained my thought well, I don't own the technical terms).
Anyway, thank you.
Just a clarification:
I was wrong with the example of [^footnote^] (see above), no inline markup will be displayed as search results when the search is "phrase".

Frank Jun 5, 2017, at 09:30 PM

this looks like a complicated problem. For your specific usage, do you wish to enter for instance "word1 word2", and the search should search for "word1{abc def} word2{xyz gloss}", and have that returned in the result, including markup {...gloss...}? Do I understand you correctly? I try to understand that Linguistic Glosses recipe... - HansB June 05, 2017, at 03:09 PM

Exactly, always {gloss} added. If the whole paragraph is extracted with 'unit = para', also the (gloss) and (glossend) markup would be included and it would trigger the recipe.

Frank Jun 5, 2017, at 10:42 PM

it is a relativly simple matter to modify extract.php to remove all markup of form {some gloss text here} when searching, as part of the TERemoveInlineMarkup function. Then a phrase search will find the terms entered correctly. But the gloss is not restored to show in the result. I do not see yet a way to do so. When searching page text for a phrase, first TextExtract (TE) removes inline markup (and gloss markup would need to be added to the function for it to be removed, then the phrase is looked for, and matches are listed as page matches, so TE has a list of pages with matches. Then TE goes through that list, converts text into lines, searches again for the phrase in the lines, with inline markup being stripped out, and builds a search result list of text lines, basically. Once inline markup or gloss markup is stripped out, I do not see a way to add it back in again at the right places. - H.

Consider what follows for what it is, just the idea of a beginner:
H: ...so TE has a list of pages with matches. Then TE goes through that list,
F: ok
H: converts text into lines,
F: this conversion is based on the replacement list in TERemoveInlineMarkup function if I have understood correctly. Maybe you can skip this step for search “phrase” and case markup=on.
H: searches again for the phrase in the lines, with inline markup being stripped out,
F: this search for the phrase (on the pages found already) could be done directly on the text (is it possible?), based on a new function where all inline markups will be skipped by a regex.
Maybe to achieve this, the “phrase” should be

  • first splitted (match_split \s): e.g. Labor est '''sit''' -> three words
  • then searched as a composite string -> e.g. ($a)[\W](+$b)[\W]+($c)

to extract the paragraphs or the lines where it has been found.
H: and builds a search result list of text lines
F: ok
Maybe it's not possible to do so, but I hope this can give you some idea.

Frank Jun 6, 2017, at 05:20 AM

Not as easy! The phrase in your example is "est sit". The page text could be anything, containing perhaps a string like "est '''sit'''" or like "''est''{somegloss} sit{more gloss}" or any other combinations of inline markup and, in your case, gloss markup. So we try to find word "est" followed by word "sit", without knowing what characters are in between or surrounding or following these words, and also we want to preserve exactly those characters we do not know. As yet I can only see the solutions of a) stripping out inline markup so we can find a match to phrase ("est sit"), or b) not looking for a phrase at all, but live with extra results by searching for "est" AND "sit", in which case inline markup will be preserved.

As regards to the glosses recipe:

  • The markup definition at the bottom of the script needs to be updated to use Markup_e() function, in order to be PHP5.5 compatible (no /e parameter allowed in regex).
  • The {gloss} markup also interferes with other inline markup (the {+ins+} and {-del-} markup for example, there may be others)
  • in TE results any {gloss} markup will not be rendered, even if it was not stripped (by not using a phrase search), unless we also use unit=para and markup=on, because of the need for the (:gloss:) and (:glossend:) markup.

So some improvements to that recipe may be advantageous. - H.

PS: have a look/go with the latest update. I implemented a stricter search, which will leave out results where the search terms are not in the same paragraph, sentence, line, whatever the unit parameter is set to. Which might mean a search for phrase may not be necessary, because the standard results are pretty good too. - I also added a way to add custom markup patterns for markup removal, if search by phrase is used. - H.

PPS:The latest version also has a new parameter serial, default is serial=0. Set to 1 and with strict=1, but not phrase=1, will match only the terms if found serially in the same text unit (paragraph, sentence, line), as submitted. So "fat cat" entered will match "a fat man and a stray cat" and "a fat catfish" but not "the cat ate a fat fish". - H.

Thank you very much Hans, I really like this solution, in my case it is perfect, even better than searching "phrases".
Just one last thing (I have bothered you enough already). Running various tests I noticed that TE did not find:

  • Latin characters: words containing the letters à è é ì ò
  • Hebrew characters: words with some letters (I use only basic letters, no vocal punctuation sign, etc., e.g. והנחש). For Hebrew, I add that when it does not recognize a letter, it still finds the sentence / paragraph but breaking the word in half, e.g. וה + חש (in this case it did not find the נ in the middle)
  • Greek characters: no word at all (I use only basic letters, no accents, spirits, diacritical signs, etc., e.g. ιησου ΙΗΣΟΥ)

Is it possible to do something? Thank you.

Frank Jun 7, 2017, at 06:20 AM

Thank you for pointing this out! I think it happened when word=1, word boundaries were not seen properly for UTF-8 characters. This is now fixed in latest update. - HansB June 07, 2017, at 04:20 AM

Now:
  • Latin characters: All right except for the letter à (FE does not recognize it but finds the words anyway. It seems that it does not distinguish it from á).
  • Hebrew characters: the problem remained.
  • Greek characters: perfect.

I saw the regex that you changed. Maybe adding the modifier \u (but I'm not sure). I would test it myself but I do not know where to place it in that particular string (I have found the information here).

Frank Jun 7, 2017, at 02:53 PM

Frank, all works fine here on my local computer, Windows with XAMP server, Firefox. UTF encoding works, searches work. I uploaded a UTF-8 test page, with proper UTF encoded text samples, and made a search page here on my site:
Search Page - UTF-Test page
now I see that on the test site searches for RTL languages do not get results, LTR do. On my local machine both do get results. strange.
P.S: Try a search for a RTL word with the normal PmWiki search. On my testsite this fails. On my local machine it succeeds. So it seems some other issue here, not one of TE. Need to investigate this later... - H.

I tried reinstalling everything (latest versions of XAMPP, Eclipse and of course PmWiki). When I run Eclipse I got the message: An internal error occurred during: “Modifying Include Path". java.lang.NullPointerException so I thought: there you are! (but I spoke too soon). I tried different versions of Eclipse and with 'Oxygen' the message no longer appeared, I reinstalled my version of PmWiki, I run the tests... nothing, all the same.
Maybe it is related: I have the same problems about recognizing some hebrew letters in Glosses too. I planned to report it to Petko right after the solution of the search-problem, ... perhaps he is reading this discussion already.

Frank Jun 8, 2017, at 10:30 AM

I am raising this pagelist search issue now on the pmwiki-user mailing list. It may have to do with some server setting I do not know about. - H.

After more tests I found that deleting the .pageindex file in wiki.d made searches for RTL terms possible. - H.

I probably did something wrong. I will try with a new installation, one recipe after another (It will take a while) and I'll let you know. Thank you for your help.

Frank Jun 8, 2017, at 04:30 PM

I was not able to solve the problem, I tried with a new pmwiki installation with TE only, but it came again the message: "java.lang.NullPointerException" (see above). So I changed everything: from Eclipse to Netbeans, from XAMPP to AMPPS (I have to say it's great tool for pmwiki beginners), but nothing. Maybe it's a problem of my Mac. Conclusion: I'm forced to keep this problem and I hope when I will put my site online (at the end of this year), everything will work. Thanks again Hans for help, at least now it works for ltr languages.

Frank Jun 10, 2017, at 06:15 AM

okay. I cannot help with problems relating to Eclipse and Java, I do not understand what Eclipse has to do with running PmWiki (can you not run PmWiki without Eclipse?), what you trying to achieve there. I hope you can find a way to solve it. It is best to have a working version of a new site on the local machine, before hosting it online. - H.

Concerning search finding international characters, make sure when the pages were saved the xlpage-utf-8.php script was already included, and the page source text contains actual international characters, not HTML entities like "&#1580;&#1587;&#1578;&#8204;&#1608;&#1580;&#160" (or similar). When your browser posts a page text, if an international character is not available in the page encoding, then it will be converted/encoded to an Unicode entity and this will be in the page source, then in the pageindex only the "words" ([0-9a-z]) will be saved, and the search will not find the international characters. If such is your case, you have to re-edit your pages and save the text as actual characters not encoded entities (and the pageindex will be updated). --Petko June 10, 2017, at 03:53 AM

Unfortunately (it's strange to use this word, but in this case ...) all you have listed (xlpage-utf-8.php inclusion, international characters in .pageindex and in the source pages; charset=UTF-8 in wiki.d pages) it's perfect.
Now my situation is the following: everything works as expected, except for three characters that are not recognized by TE and by Gloss:

  • Latin: à (italic) as in sarà (italian);
  • Hebrew: נ (= n/N) as in פני;
  • Greek: Π (capital, = P) as in ΠΑΝΤΩΝ.

What I do not understand is that without any markup they are normally displayed in wikipages but inside Gloss markup (I mean here{...} ) they are not recognized (e.g. ΑΝΤΩΝ�) and TE (during the search) considers them word-breaks (e.g. � + ΑΝΤΩΝ) ... however in the TE result they are normally displayed.
This does not prevent me from finishing the site... it just makes me crazy not to understand why it happens! I just hope it's a Mac problem!
P.S. Thank you guys for your efforts, I appreciate a lot.

Frank Jun 10, 2017, at 02:00 PM

I cannot replicate these errors at all, sorry! As said above, I have some misgivings about the gloss markup recipe, as it seems to corrupt other markup. For instance I just discovered that single or double backslashes at line end are not recognised as markup for joining lines or putting in line breaks, so lines can stay within a paragraph. - In TE results with markup=on and unit=para a {gloss} markup will only be displayed as gloss, if the paragraph contains the gloss (:gloss...:) directives. I wish that recipe would use a less aggressive markup (it uses not even a markup rule), like other inline markup rules, so one could suspend with the use of (:gloss:) and (:glossend:) alltogether. - HansB June 12, 2017, at 03:23 AM

I built a simple custom glosses-based inline markup (strictly adapted to my needs) that avoid this problem, in this way in wikipages all the characters are readable and with TE the results are displayed not just with unit=para. But for the recognition of those particular three characters (see above) I will have to be patient. Thanks again.

Frank Jun 12, 2017, at 01:48 PM

Use as a smart include

Is it possible to use this as a smart include, e.g. I want to include all individual lines matching a pattern in another page?

simon May 28, 2016, at 05:25 PM

HansB: perhaps use it as a markup expression.


Output to different page

Is there an equivalent of the search box target=group.page parameter?

simon June 09, 2015, at 06:54 PM

Added in latest update. HansB


Output of extra line or sentence

Using unit=line is it possible to display the line the match is found on, and the following line?

simon May 11, 2015, at 09:57 PM

I do not see a simple solution to do this, as the script processes each line in turn. HansB June 05, 2015, at 08:23 AM

OK, thanks
simon June 05, 2015, at 05:16 PM

I think that what I am looking for is a variation on unit=para, say unit=cell or unit=cellnr, as when I use unit=para the entire table show up. I.e. unit=cell would be unit=para plus terminated by the next cell, cellnr, or tableend. Similarly for unit=cellnr except cell would not be a terminator of the display.

simon June 09, 2015, at 06:48 PM-]

UPDATE: With the latest update I added a unit=sent and unit=dsent and unit=dline option. With unit=dsent results are shown in a single sentence, plus the following sentence, if in the paragraph (no empty line between). Similar with unit=dline for a two line display. Not sure it will help you, but I find it a great improvement in cutting down single result size, while giving meaningful results. HansB

PS: a search text extract based on table cells would be really difficult, I think. There are also different kinds of table markup to look for. Apart from the difficulty that most content is not in table cells. I hope the unit=sent option may work instead, as it would use a sentence, in a cell or not, disregarding the line. But it depends perhaps how text content is structured within a table cell. HansB

This was exactly what I was looking for. Thanks very much for the two changes above. simon July 10, 2015, at 04:11 AM

Searching without search term (page search)

The PmWiki search recognises name=pagename, or group=groupname. I use the Text Extract search as a substitute for it. It seems to me it should behave in the same way as the PmWiki search.

simon June 05, 2015, at 05:16 PM

UPDATE: A TextExtract search using a custom search form, the (:extract ...:) search form, or simply fmt=extract with the PmWiki search form will use the same kind of parameters to narrow down the search to specific pages. The latest release will provide a default pagelist when no search terms are supplied, but without the statistics of how many page searched and found. I could not manage to get the statistics to show, but it may be better than just the error message, that no search term is supplied.

Also: if you enter Group/ it will return a list of pages for that group. If you enter Group/ searchterm it will look for searchterm inside the pages of Group.

HansB June 10, 2015, at 05:19 AM


Output formatted as a table?

I'm trying to list selected items from tables on different pages in one new table using the extract markup. My new table consists of one row per extract statement, each containing the lines that are the extract result. This works very well, but due to the processing sequence, the extract results in each row are an encapsulated table by themselves so there are no continuous columns spanning all rows. (I'd prefer continuous columns as this would it simple to copy & paste the relevant parts of the listed information, in this case e-mail addresses.) I've tried generating the list with a single extract statement, which is not straightforward as there is no clear pattern in the to-be-searched group names, but this does not yield continuous columns either. Is there a way to get continuous columns (or to isolate the e-mail address from the source line though the search term is somewhere else in the same line)? --Henning February 03, 2010, at 10:31 AM

I don't see how you can get output formatted as one table. But perhaps the last is possible: use the snip= parameter to remove unwanted parts of the result line. - HansB February 03, 2010, at 11:56 AM

Thanks, your answer showed me the way. I use snip=.*mailto.|\| now as the mailto: markup itself is not vital to me and the email address is the rightmost column in the source tables. If the expression looks clumsy, it's because it's the first time I give regular expressions a try :-) --Henning February 03, 2010, at 12:36 PM


Sorting output by page age

given, I want to extract content from several news pages, like

{(extract '.' News.* -News.RecentChanges -News.Uebersicht -News.*-Talk 
prefix=link lines=3 cut='!')}

Is there a way to sort the page processing by the age of the pages (e.g. last edit or so), so that the latest news are displayed on top? Ingo vonBorstel? September 09, 2008, at 10:26 AM

HansB: I think you could install PowerTools and use the (pagelist ...) with normal PageLists#pagelistorder sort order parameters to provide the list of pages inside the (extract ..) expression, for instance like this:

{(extract . (pagelist name=News.*,-News.RecentChanges,-News.Uebersicht,-News.*-Talk
order=-ctime) prefix=link lines=3 cut='!')}

Talk page for the TextExtract recipe (users).