Re: [PATCH v5] gitweb: redacted e-mail addresses feature.
Eric Wong
Georgios Kontaxis via GitGitGadget
"Ævar Arnfjörð Bjarmason"
brian m. carlson
Georgios Kontaxis
See Also
Prev Ref 1 Ref 2
2021-03-29 03:17:36 UTC
> Georgios Kontaxis via GitGitGadget <> wrote:
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers and they may not respect robots.txt.
>> This can result in unsolicited messages.
>> Introduce an 'email-privacy' feature which redacts e-mail addresses
>> from the generated HTML content
> A general reply to the topic: have you considered munging
> addresses in a way that is still human readable, but obviously
> obfuscated?
> On some other project, I settled on HTML "&#8226;" as a replacement
> for '.' for admins who enable that option.  The $USER@$NO_DOT
> remains as-is for easy identification+recognition of hosts.
Thanks for the suggestion.

People have been trying to hinder address harvesting for a while now.
Replacing '@' with "at", the dot with "dot", adding spaces, etc.
was pretty common at some point. May still be.
I would expect crawlers to have caught up and this includes
all sorts of character encodings and unicode look-alike substitutions.

At the end of the day we are looking for something that's easy for humans
to read but hard for scripts to parse as an e-mail address.
(And that scripts cannot learn through an additional regex)
I'm not aware of anything like that. (I know CAPTCHAs, etc.)

> I also considered Unicode homographs which can look identical
> to replacement characters, too; but rejected that idea since
> it would cause grief for legitimate users who would not notice
> the homograph when pasting into their mail client.
> Anyways, here's the list of candidates I tried:
> homograph∂
> homograph@80x24ͺorg
> homograph@80x24·org
> homograph@80x24•org
> homographï¼
> homograph﹫
> homographⒶ
> homograph@80x24 org
> homograph@80x24․org
> homograph@80x24ꓸorg