00955: Automatic anchors for document sections

Summary: Automatic anchors for document sections
Created: 2007-07-20 22:45
Status:
Category: Feature
From: HaganFox
Assigned:
Priority: 544441
Version: 2.2
OS: All

Description: Jon Abott started a thread on the list[1] regarding automatic anchors

Is there a simple way to configure pmWiki such that all headings (H1,
H2, etc) are automatically generated with anchor tags so people can link
directly to a section or subsection?  (I believe MediaWiki has this
functionality by default.)

Wikipedia does indeed have this feature (related to a TOC capability, perhaps?), and it would be nice to see it in PmWiki.

The section's anchor appears immediately preceding the heading. It has the same text as the heading with some character substitutions like these:

  [space] => _
  ( => .28
  ) => .29

Essentially, this markup

!! Heading Name

would produce output similar to that produced by this markup does now:

[[#Heading_name]]
!! Heading Name

--Hagan

[1] http://pmichaud.com/pipermail/pmwiki-users/2007-July/044949.html


Notes:

The recipe Cookbook:PageTableOfContents does create an anchor at each heading it reports on, but

  1. it only reports two levels deep,
  2. it only reports on (and therefore only creates anchors for) headings from its insertion point down, and
  3. the anchors it creates are simply numbered sequentially, top-down, which means that any given heading's anchor will change when a new heading is added above it.

This sounds like a useful feature. However, I don't use anchors very often but I do use h1 and h2 headings quite often. If this becomes a core feature, I vote for a setting in config.php that would allow users to turn this feature off. --Ian MacGregor


Questions from Pm:

  • What if the heading contains other markups, such as links or wikistyles? Do we have to be smart enough to strip those somehow before producing the heading anchor?
  • If a heading already has an anchor in it, should PmWiki still generate yet another anchor for the heading?
  • Do we convert all punctuation that appears in the heading, or just parens?

Pm


Suggested answers yes; yes; and as follows (not knowing what wikimedia does)

We want to generate a usable anchor presumably to link to, the simplest way may be to allow the same heading text (sans markup) to be used in the anchor and have pmwiki transliterate it.

Note that in a few instances this may not quite work.

! My (example) heading "with" punc-uation; etc! But 'not', markup

might generate the anchor (details to be decided)

[[#My_example_heading_with_punc-uation_etc_But_not_markup]]

and could be used as follows

[[#My (example) heading "with" punc-uation; etc! But 'not', markup | link to heading]]

Obviously the vertical bar would not work, and how to handle markup such as

! [[PageName | +

needs to be considered


Hagan's answers to Pm's questions:

  • What if the heading contains other markups, such as links or wikistyles? Do we have to be smart enough to strip those somehow before producing the heading anchor?

I'd say yes. Hopefully this complication is not a deal-killer.

  • If a heading already has an anchor in it, should PmWiki still generate yet another anchor for the heading?

I'd say yes because it is consistent. As an author, if I see "See also" heading in a page I'll know there's a #See_also anchor that I can use to link to that section.

  • Do we convert all punctuation that appears in the heading, or just parens?

My hunch is that we need to convert all or none. Here's my answer on pmwiki-users:

Any character not allowed in an anchor tag, I suppose.  The example on
the PITS page is just something I discovered when I was looking at how
it was done elsewhere.

Stripping out disallowed characters may be adequate.

More feedback...

As noted by Dominique Faure on the pmwiki-users list, another consideration is whether or not to let a heading start with a numeral (since the specification is something like "alphanumeric, starting with a letter"). The Wikipedia link to the engine in my car demonstrates that they don't worry about it there.

Yet another consideration is what to do about duplicate headings that occur within a page.

Personally, I don't like the idea of enumerating the links so they change when another heading is inserted. That makes the link much less useful, except perhaps for a dynamic table of contents.

FWIW, if the link will not be similar to the heading text I'd rather use something like the md5() function, so a heading of "See Also" would be linked with [[#h611114a3a55940e855fb96b973f897fc]] (the result of " 'h'.md5('See also') " in PHP).

Update: Using the crc32() function would result in shorter links. This test script

<?php
$foo = 'See also';
echo '<pre>';
echo 'crc32: h'.crc32($foo)."\n";
echo 'md5: h'.md5($foo)."\n";
echo 'sha1: h'.sha1($foo)."\n";
echo '</pre>';

produces this output

crc32: h-441119427
md5: h611114a3a55940e855fb96b973f897fc
sha1: h2d8243a2c0e464492c9d563c4f92c56ae3421bcc

--Hagan


I posted this suggestion to the list, with a correction by dominique.faure@gmail.com:

> I think this is an excellent suggestion. But there is a way to get
> around having "fidgety numbers". Perhaps it could organize by heading
> intensity (H1-H6) and it's number in a hierarchy, such as:
>
> [[#h123]] = The 3rd H3 in the 2nd H2 in the 1st H1.
> [[#h601]] = The 1st H3 in the 6th H1. (the 0 is for a missing H2).
It seems that most people are interested in the heading titles being used, but I propose a variable switch for perhaps a few different ordering systems.

I'd like to work on this project, but I haven't dug deep enough into PmWiki code yet. I'm going to go dig deeper in the site and my installation to learn what I can about implementation. -Mike


John Rankin wrote this in a post to the pmwiki-users list:

On Thursday, 26 July 2007 11:47 AM, H. Fox <haganfox@users.sourceforge.net> wrote:
>On 26 Jul 2007 10:34:35 +1200, John Rankin <john.rankin@affinity.co.nz> wrote:
>> Turning heading text into an anchor guaranteed to be valid would also need
>> a bit of care; eg
>>
>> !!!Christian Ridderström
>>
>> would need to turn the ö into something else,
>
>Maybe there's some regular-expression magic that can do that.

The trick I used in the citations recipe was to pass the text
through the htmlentities function and remove the & and ; from
the result.

Petko Yotov wrote a descriptive post to the list, including this description of How MediaWiki Dose It:

MediaWiki [1] has this capability from version 1.8 [2], one can link to a section like this:

[[Wiki#Editing wiki pages|Click here]] will link to:
http://en.wikipedia.org/wiki/Wiki#Editing_wiki_pages

where there is a heading "==Editing wiki pages==".

The conversion algorythm heading->anchor_id is quite simple:

  • wiki to plain text (all links and styles removed)
  • whitespace trimmed ("= Title =" and "=Title=" are the same)
  • in case there is a repeated heading in the same page (including "included" pages), a _2 or _3 etc. is added;
  • spaces replaced with underscores;
  • the string is "urlencoded";
  • "%"-signs are replaced with dots.

This allows to have html that validates, and is very easy for the writers (PmWiki philosophy n°1). The conversion function works also in other languages, that is UTF-8 encoding: while it is not very nice in the html code for non-latin languages, it works both with the page table of contents and with links from other pages .

If the section heading changes order, the links to it will still work (unlike if all anchor_ids are id1, id2...). If the section heading dissapears or changes, the anchor_id will also change and then the links will redirect to the top of the page (least surprise).

I would add some items to the above list...

  • use strtolower() so changing capitalization won't affect an existing anchor.
  • prior to "urlencoding", strip certain characters out. Which characters? We'd need to come up with a list. (Maybe it should be locally customizable, too.) Here's a starter
    • ","
    • "("
    • ")"

Both of these have the advantage of making the anchor name easier to guess without looking at the page source. The second one (stripping certain characters) also mitigates having a heading start with a non-alphanumeric character.

The list becomes

  • wiki to plain text (all links and styles removed)
  • whitespace trimmed ("= Title =" and "=Title=" are the same)
  • alpha characters to lower case
  • certain special characters stripped (e.g. ",()" and others TBD)
  • in case there is a repeated heading in the same page (including "included" pages), a _2 or _3 etc. is added;
  • spaces replaced with underscores;
  • the string is "urlencoded";
  • "%"-signs are replaced with dots.

...or something like that. ;-)

--Hagan


Martin Fick wrote this in a post to the pmwiki-users list:

Perhaps the link code could be extended to look for
existing/non-existing anchors in pages and display
links to non-existing anchors slightly differently
making misspelled anchors more obvious?

That would certainly be author-friendly. --Hagan


Mike Shanley wrote this in a post to the pmwiki-users list:

Would abbreviations be a good option for people? At least in the test
page below, not one abbreviation was repeated, and if it were, the _2
and _3 would fix it right up. This would take care of many of the human
errors we've been talking about, though in pages with (way too many)
headings, it might add some new ones.

 #wiki_style_basics = #wsb
 #scopes = #s
 #wikistyle_attributes = #wa
 #applying_wikistyles_to_block = #awtb
 #enabling_Styles  = #es
 #custom_style_shortcuts = #css
 #predefined_style_shortcuts = #pss
 #Examples = #e
 #known_issues = #ki
 #see_also = #sa

Also, how can we use this automatic anchoring to create quick TOCs?
Generally speaking, an (:include function that stripped everything but
headers, or a (:pagelist toc= that tracked headers instead of trail=
tracking bullets would also work... My thought on this is that as long
as we are talking about anchors across an entire wiki, we should also
provide a way to just as easily index these anchors. Right?

I agree that this feature is naturally related to a TOC feature. --Hagan

See also PITS.00027 --Simon


Purely "dabble-ware" but maybe does what people are looking for...? Put this somewhere in config.php:

include_once("$FarmD/scripts/stdmarkup.php");
DisableMarkup("^!");
Markup('^!#', 'block',
  '/^(!{1,6})\\s?(.*)$/e',
  "'<:block,1><h'.strlen('$1').PSS('>'.Header2Anchor('$2').'$2</h').strlen('$1').'>'");
function Header2Anchor($text)
{
	$text = strtolower(trim(preg_replace(array('/%.*?%/', '/\(:.*?:\)/', '/\[\[.*?\]\]/'), '', $text)));
	if (strtolower($text{0}) < 'a' || strtolower($text{0}) > 'z')
		$text = 'h'.$text;
	$text = preg_replace('/[^\w]/', '_', $text);
	$text = preg_replace('/__+/', '_', $text);
	for ($i=0, $sfx=''; TrackAnchors($text.$sfx); $i++, $sfx="_$i");
	return Keep("<a name='$text$sfx' id='$text$sfx'></a>", 'L');
}

(We could put a call to FmtPagename() in there, but to be safe we should wait on this PITS entry to be implemented before we do that.)

If it's helpful I can put it into a cookbook recipe...

--Peter Bowers July 16, 2010, at 05:03 PM