EProtect-Talk

Summary:Discussion of EProtect
Maintainer:
Users: +2 -1 (View / Edit)

This space is for User-contributed commentary and notes. Please include your name and a date along with your comment.

  • I recommend using the noscript tag after script tag (two times in function eProtectDecode). Otherwise people with deactivated Javascript won't know that there is something they don't see.
    E.g. <noscript>Please activate Javascript to view the Emailaddress</noscript>
    Thank you! 29.04.2006
  • Install Error. Any suggestions? "Fatal error: Call to undefined function: markup() in /usr/local/<full_path>/httpdocs/wiki/local/scripts/e-protect.php on line 66" - ah I'm running pmwiki 1.x. I'll remove this comment soon.
  • I can see the "encrypted" mail address in my browser, and it also displays on mouse-over!
    • This is correct and intended that way. If `JavaScript is turned on in your browser, it will replace the protected addresses with cleartext when the page is loaded. The HTML source still shows just a garbled address and `JavaScript code to un-garble it; that's all that a mail harvester will see.
  • 17-Feb-2005 if i check a wiki with the PmWiki 2 e-protect included at http://validator.w3.org the javascript in the $HTMLHeaderFmt['eProtect']= will be the reason to fail the check. @Klonk do you think its possible to get a "valid" e-protect so the whole wiki will be checked, okay? newmy
    • 18-Feb-2005 Fixed, just forgot to comment the javascript, now it's valid XHTML - Klonk
  • Many spammers don't bother parsing the protocol (i.e. the 'mailto:' part), they just harvest strings like \w+@\w+\(.\w+)+. could you replace the @ with a random string like [_on_|_from_|_de_|_a_|_of_] (the regexp in my comment is fake, just to illustrate the point :) -Radu
    • Hmm, I'll think about that, as this would require a total rewrite of the encoding (now it's ROT13) and decoding. But then I think it's better to create simply a new type of ROT13 encoding that just includes the characters ':', '.', '@' and additionally replaces the '@' character (maybe with '#'). But then I have to create a new version of the script as backward compatibility would not be possible. - Klonk
    • The spammer will simply get a wrzlbrmft@pfrtlboing.dlu address, which isn't of much use to him. Joachim Durchholz
  • I think this recipe should be a base recipe, then anyone using it should add some creativity, because as soon published a recipe will be known and used, you could have a look at http://aspirine.org/cgi-bin/trouvemail.pl they implement basic and complex mechanisms used by spammers to decode web pages, you will see that ROT13 and either emails with # are parsed easily.
    Also, any email with a mailto: rendered by a browser could be parsed Isidor
    • Thanks for that link. I can imagine that there is always a way to retrieve the. As you said there are several ways to hide the email address. I also thought of using e.g. SHA-1 as encryption. But as you have to add some Javascript code for the normal viewer to see the email-address, so if a parser evaluates the Javascript it will always get the email address. A very difficult topic and a hard decision between ease of use and effort of hiding the email address. Well I think you should use two steps, the first one "hides" the email on the website (more or less effective), and second you should use an emal client that has SPAM filters integrated or simply use a Mailform -Klonk
      • Using SHA-1 won't work! because this is a one-way digest: it definitely destroys the original text without possibility of desencrypting it later (it would require an exhaustive collision search to find another clear text string that matches the encrypted SHA-1 digest). This is true at least for now, until SHA-1 is completely broken allowing easy colision finding using some complex algorithm that the javascript used in EProtect will not be able to compute in a reasonnable time to compute a working link. In addition, SHA-1 is limited to a fixed number of bits (160), which may not be enough to store all the significant bits needed to get a possibibly long email address (so, collisions are unavoidable, even if they are still unpredictable computationally).
      • What you really need would be an inversible encryption, but then you'll need to store the decryption key somewhere in the generated HTML page, along with the encrypted email address, in a way that spammers will not recognize easily. The best solution is then to adapt the Javascript provided (that will be used in browsers visiting the Wiki) by personalizing it (in addition to the Wiki PHP code that will compute the encryption on the server) so that it will be different from the encryption method used in other similar wikis. But then, make sure that the encryption algorithm implemented in PHP is reversible with the Javascript that will be executed by browsers.
      • Note that now there exists some spammers using indexing robots that can execute the Javascript embedded in your HTML page within their own sandbox, in order to reveal the code computed by actual browsers. You could break these robots by using some advanced Javascript code (for example, by extracting some information hidden in various part of the HTML page, using DOM) that won't work reliably outside of a true browser (but then your page will have some compatibility problems with some browsers implementing only a subset of Javascript).
      • The only way to protect from such smart active bots is to make an encryption that requires authentication. Another way would be that Wikis do not even attempt to reveal any email address, not even with this method, unless the visitors are authenticated on the Wiki site. Or to generate a link to a protection page hosted by the wiki, where the visitor must decrypt a visual cryptogram, and give their own email address to receive the requested information, i.e. the email address that was inserted in the Wiki.
  • Good points, Isidor. Then how about making the email link href='#' onClick="mailIt(encoded string);return false;", with 'mailIt' being a configurable or random string, and the respective function declared in the head of the page (in the template probably) opening a href="mailto:decoded string" -Radu March 15, 2005, at 03:56 PM
  • What about encoding the email in such a way that only a part is encoded or e.g. simply a ".com" is added at the end? - Klonk
  • What about a script that finds any e-mail address (not just mailto's) and converts @'s to &'s? Is that possible? Appleton
    • Yes, make it a markup. I.e. edit config.php and add something like Markup('hide-@', '>links', '@', '&'); to it. That means: "rule is named 'hide-@'" (important if other rules reference it), "it will be executed after links processing", "find all @ signs", "replace them with '&'". Note that the second parameter determines when the rule will be executed; this rule rather indiscriminately replaces all @ signs, whether they are part of an email address or not. It might be advisable to do the rule even later, say at '>block' or even '<_end'. JoachimDurchholz
      • That's an important feature to add, because many bots can recognize now easily ROT-13 encoded email addresses, if they can find a matching '@' in the page. In fact I think that the encrypted email address should better use at least two separate methods for encrypting the user part and the domain name part (in addition to replacing the '@' character).
      • Keeping the '@' also could potentially generate actual email addresses with the current method, and it's a good idea not trying to use the syntax of a valid email address in the encrypted form. Don't give any hint to spammers about where there are possible email addresses in your generated HTML pages.
      • As an improvement, Eprotect should be able to generate scripts that will execute in two separate parts: one part will contain the encoded email addresses (marked for example with some specific CSS class with an unpredictable name, possibly used only once par HTML page) and will be present at each place in the HTML page where an email should appear (if Javascript is enabled in the visitor's browser), another part (for example in the header of the page and executed after the page is completely loaded, would contain the Javascript code and decryption key that are needed to decrypt the email addresses present in the page (found by their specific CSS class names using DOM) and then replace the encoded addresses with the actual ones. With this method, at the location where the encrypted email is present, no "script" or "noscript" tag is needed, only a simple span containing some simple fixed text and marked with the HTML class that the Javascript will search and decrypt.
      • For example, suppose the Wiki code contains an email address "user@example.com". When EProtect detects that email address it will replace it with something like <span class="eprotect">javascript_needed:SD1SxySDVCsdsTxed45de</span> This is the string that the visitors (or spammers) will "see" without Javascript enabled.
      • Then elsewhere on the page, the Javascript executes and finds (using DOM) every elements in the page body marked with class "eprotect" (this class name should be personalized at least for each site and changed at any time by Wiki Administrators, but it could also change dynamically constantly, using various prefixes and suffixes to avoid collisions with other CSS class names needed for the HTML rendering and some short random text). It then drops the "javascript_needed:" prefix in the content of the elements enumerated by DOM (here also this prefix is personnalisable, but it should give an hint to visitors that javascript is needed, it could also be translated), the javascript decrypts the rest and replaces the whole content of the element with the decrypted email address.
      • This way, the decryption code and the decryption key, could be physically separated in the HTML page from the locations where an actual email address is present. This will resist to almost all indexing bots used by spammers, except those that can interpret and run the Javascript in their own local sandbox (emulating a browser). The javascript could also optionally perform some safety check to see if it actually runs in a true browser (but be prepared to receive complains if users are using specific browsers or security addons in their browsers that disable some browser features
      • One of these security checks performed within the javascript could be performing an HTTP request to the server to get the decryption key separately, associated with the Wiki page, or some other distinctive data present in the encrypted data; the Wiki server could also perform the decryption itself, instead of the visitor's browser derypting the content, allowing advanced security checks like validating the HTTP session with the current session cookie, but then the encrypted email addresses won't be decrypyted if the webpage is seen offline.) This approach could be very powerful to protect from spammer bots: these bots, if they are emulating browsers will need to perform repetitive requests to the server, one for each encrypted email present in each page of the Wikithey are trying to index, and will leave tracks within the server logs! The Wiki server could make severe restrictions about the number of decryption requests it receives (so that it won't harm the legitimate browser users, but the server will refuse to decrypt many email addresses at high speed).
      • The javascript could also include a limitation of speed for decrypting multiple emails in the same page: it will decrypt one address at the time and will pause for some 15-20 milliseconds (by retriggering another event) before seeking another encypted email in the webpage. This will safely avoid the speed limitation required by the server. Time-based Javascript events are powerful things to detect if an actual browser (implementing Javascript correctly) is used. For legitimate users, the only impact is that they will need to wait a little after the page is loaded, before the encrypted email addresses are decrypted! But for malicious indexing robots, having to depend on time and simulating it would be far too complicate or not very productive (severely limiting their performance when trying to collect as many email addresses as possible) if they need to respect time constaints!
      • Another possibility is to not decrypt any email address until there's a mouse hovering event on the element containing the encrypted email address: when this occurs, only this email address is decrypted, and all the other decrypted email addresses are reencrypted... This way, it's not even possible for a legitimate user to print the webpage and use a scanner sotware to get the displayed list of users. It should work correctly for visual impaired users, because the email address will be decrypted in their Braille readers, as soon as they are focusing their Braille reader on it.
  • I have found that this script broke the signature markup [[~Signature | Name]]. Before the script is installed, PmWiki looks for a file Profile/Signature. With this script, it looks for a file CurrentGroup/Signature.
    • Could you give a link so that we can check? Joachim Durchholz May 01, 2005, at 04:43 AM
  • this tests two links, one being e-protected (text, test2 -> mailto:info [snail] pmwiki [period] org) - why doesn't the -> notation for links work in this case? hm. in this wiki e-protect does not seem to be active. on my site (currently running beta v51) I get both links treated as a single one (but not if I use the | notation) and some extra spaces around the links)
  • Why does the script only hide [[mailto:whatever@example.com]] and not mailto:whatever@example.com?

The recipe as it is excludes blind people from using PmWiki. Of course, if a screen reader program as used by the blind can read the mail address, then so can a mail harvesting robot.
One possible way out that I see is adding a link that leads to a "contact" page that will accept a subject line and a text (the mail address would be implicit - else we'd be offering a mail gateway to arbitrary people).
The implicit assumption here is that a spammer isn't interested in an HTML page that allows him to send a mail to just a single address.
Joachim Durchholz March 25, 2005, at 12:01 PM

There's one rather real problem with this recipe: it uses JavaScript. That's bad, because many people (including myself) have switched it off due to its annoyance factor and its security implications. Unfortunately, switching scripting on and off on a per-site basis isn't implemented in Mozilla yet, so I'm loath to switching JS on even for a wiki that I trust - I have too often forgotten to switch it off when moving to other sites to be comfortable with JS even on an occasional basis.
Requiring mail harvesters to execute JS isn't a real obstacle as well. They are known to use OCR with neural nets to get past "use a pass phrase on a GIF that's further distorted and blurred" barriers, which takes far more CPU cycles than executing a few measly scripts.
This leaves me with the impression that we can do some obfuscation to force harvesters into at least some effort, and wait until Internet mail is turned into something that requires authenticated senders...
Joachim Durchholz March 25, 2005, at 12:01 PM

Talk page for the EProtect recipe (users).