Comments
-
Greg Hurrell
Status changed:
- From: New
- To: Open
-
Greg Hurrell
There are a couple of workarounds at the application level:
- set up a "link" record
- or a wiki article "redirect"
-
Greg Hurrell
Although the problem can be worked around at the application level, this is really a lexer-level issue. By the time the app gets the URI it has already been tokenized in this undesirable way.
At a lexer level, I'm really not sure what can be done about this. Here are the relevant state machines:
uri_chars = (alnum | [@$&'(\*\+=%_~/#] | '-')+ ; special_uri_chars = ([:!\(\),;\.\?])+ ; uri = ('mailto:'i mail) | (('http'i [sS]? '://' | 'ftp://'i | 'svn://'i) uri_chars (special_uri_chars uri_chars)*) ;
)
is a "special URI char" and it will only be considered to be part of a URI if it is followed by a "non-special" char. We could append a "non-special" char but that would change the URI and most likely lead to 404s. By definition, the "special" char cannot be the last char in a URI.Nevertheless, this "special/non-special" distinction is very useful in 99.9% of cases. It allows us to write stuff like "my home page (https://wincent.dev/)" and have it work as the author intended with no escaping or other special tricks. If we make the 0.1% case work (URIs ending with a parenthesis) then we make things less convenient in the remaining 99.9%.
Funnily enough, if I had used single quotes in my example above then things wouldn't have worked. ie. 'my home page (https://wincent.dev/)'
There
'
is a "non-special" URI char so both the parenthesis and the single quote get inappropriately slurped into the URI token. But it's a fairly unlikely edge case and I don't think it constitutes a flaw in the state machines. -
Greg Hurrell
The expected behaviour with respect to "special" chars is defined in the "autolinking" spec file.
I am wondering if perhaps the machine could be modified to accept "special" chars as long as they are followed by at least one "non-special" or "special" char. ie:
-
(hello).html
: the)
and the.
are accepted because they are followed byh
-
(hello)
: the)
would not be accepted -
(hello): foo
would consider(hello)
to be part of the URI, but not the:
because it is followed only by a space -
(hello).
would also consider(hello)
to be part of the URI
In other words, stuff ending with a parenthesis still wouldn't be automatically tokenized as a URI, but there would be an "escape hatch" for making the lexer tokenize it: adding exactly one more "special" char, such as
:
,.
,,
etc.I am not sure if this is desirable as someone could conceivably write: "See my website (https://wincent.dev/)." and the parenthesis would now be erroneously interpreted as being part of the URI.
Perhaps the only way to resolve this definitively would be to introduce some sort of escaping into the lexer, and I think that's a line I don't necessarily want to cross (importing an idea from the domain of programming into what is supposed to be a lightweight, easy-to-use markup).
-
-
Greg Hurrell
Thinking about this a little more, I believe a percent escape can be used:
Yes, it's relatively ugly. Slightly less ugly if we do it consistently for both left and right parens:
So I think that's the end of the issue because:
- We don't want to change the existing behaviour because it is already the most convenient behaviour possible in 99.9% of cases.
- We don't want to introduce an explicit escaping mechanism (such as using a backslash or some other marker) as that would be introducing a programming-domain idea, and quite an invasive one, into what should be simple markup.
- An easy workaround (percent escapes) exists.
This even allows us to enclose such URLs inside parens and everything will just work as expected (eg. this link will work: http://en.wikipedia.org/wiki/ACE_%28file_format%29), I think.
-
Greg Hurrell
Status changed:
- From: Open
- To: Closed
Add a comment
Comments are now closed for this issue.