Is Tidy still being maintained?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Is Tidy still being maintained?

aditsu
Hi, is anybody still working on HTML Tidy? There has been no commit in CVS for more than a year now, and only about 4 commits in 2009. And there are lots of bugs that nobody is fixing.
I brought JTidy pretty much in sync with Tidy, to the point where it actually performs better, and I have to emulate bugs to get similar results. I have a bunch of bugs to report, but the ones I reported last year are still sitting there untouched.

Adrian
Reply | Threaded
Open this post in threaded view
|

Re: Is Tidy still being maintained?

Bjoern Hoehrmann
* aditsu wrote:
>Hi, is anybody still working on HTML Tidy? There has been no commit in CVS
>for more than a year now, and only about 4 commits in 2009. And there are
>lots of bugs that nobody is fixing.

Speaking for myself, if somebody made me aware of a security problem, or
would like me to review a patch, I would be happy to do so, beyond that,
for quite some time now nobody has been interested enough in the project
to steer discussion on the development list or otherwise engage with the
project actively.

>I brought JTidy pretty much in sync with Tidy, to the point where it
>actually performs better, and I have to emulate bugs to get similar results.
>I have a bunch of bugs to report, but the ones I reported last year are
>still sitting there untouched.

Well, to things like http://tidy.sf.net/issue/2917718 I think I've
always reponded to saying that there are very many errors Tidy does
recover from silently, it's never meant to be a fully featured HTML
Validator, or something along that line. (To me, usability of the
Sourceforge bug tracking system has always been terrible, and it has
becomes worse recently, so I mostly stay away from the tracker.)

I would say we do still have infrastructure and people in place to
support maintenance, but nobody doing much groundwork like working
through the bug tracker, submitting patches, starting discussions on
the development list, and so on.

There are basically four components to Tidy, one is a HTML parser that
can recover from errors in a manner authors might expect (as sometimes
opposed to what browsers do), a component that checks for some errors
and reports them, one where attempts are made to fix problems, and one
that "pretty prints" documents. The parser is likely going to be re-
placed by "HTML5" parsers, the validation component probably has to be
rewritten to accomodate "HTML5", I am not sure about the "fixup" part,
and the pretty printing should really be done by a separe library that
allows for more freedom than Tidy has done so far, pretty printing is
useful independently of all the other things (whereas parsing and error
reporting and fixups are not so easily separated).

With that in mind, the main thing that I would expect to happen in terms
of Tidy development is someone submitting a patch to support "HTML5"
elements and attributes in some rough manner, so you can continue to use
Tidy where you are used to it, and at the same time use "HTML5" features
and I would not expect much more to come of the project at the moment.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Reply | Threaded
Open this post in threaded view
|

Re: Is Tidy still being maintained?

ts-15
Hi,


If possible, would anyone consider an option
to disable the "autofix" and instead have
Tidy report error (count) it finds.
(command line)

Even if not a full validator, it will still
be useful in lots of situations I think...


best regards
Thomas Schulz


-----Oprindelig meddelelse-----
From: Bjoern Hoehrmann
Sent: Friday, November 26, 2010 6:44 AM
To: aditsu
Cc: [hidden email]
Subject: Re: Is Tidy still being maintained?

* aditsu wrote:
>Hi, is anybody still working on HTML Tidy? There has been no commit in CVS
>for more than a year now, and only about 4 commits in 2009. And there are
>lots of bugs that nobody is fixing.

Speaking for myself, if somebody made me aware of a security problem, or
would like me to review a patch, I would be happy to do so, beyond that,
for quite some time now nobody has been interested enough in the project
to steer discussion on the development list or otherwise engage with the
project actively.

>I brought JTidy pretty much in sync with Tidy, to the point where it
>actually performs better, and I have to emulate bugs to get similar
>results.
>I have a bunch of bugs to report, but the ones I reported last year are
>still sitting there untouched.

Well, to things like http://tidy.sf.net/issue/2917718 I think I've
always reponded to saying that there are very many errors Tidy does
recover from silently, it's never meant to be a fully featured HTML
Validator, or something along that line. (To me, usability of the
Sourceforge bug tracking system has always been terrible, and it has
becomes worse recently, so I mostly stay away from the tracker.)

I would say we do still have infrastructure and people in place to
support maintenance, but nobody doing much groundwork like working
through the bug tracker, submitting patches, starting discussions on
the development list, and so on.

There are basically four components to Tidy, one is a HTML parser that
can recover from errors in a manner authors might expect (as sometimes
opposed to what browsers do), a component that checks for some errors
and reports them, one where attempts are made to fix problems, and one
that "pretty prints" documents. The parser is likely going to be re-
placed by "HTML5" parsers, the validation component probably has to be
rewritten to accomodate "HTML5", I am not sure about the "fixup" part,
and the pretty printing should really be done by a separe library that
allows for more freedom than Tidy has done so far, pretty printing is
useful independently of all the other things (whereas parsing and error
reporting and fixups are not so easily separated).

With that in mind, the main thing that I would expect to happen in terms
of Tidy development is someone submitting a patch to support "HTML5"
elements and attributes in some rough manner, so you can continue to use
Tidy where you are used to it, and at the same time use "HTML5" features
and I would not expect much more to come of the project at the moment.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/



Reply | Threaded
Open this post in threaded view
|

Re: Is Tidy still being maintained?

aditsu
In reply to this post by Bjoern Hoehrmann
Hi, thanks for your reply



> Speaking  for myself, if somebody made me aware of a security problem, or
> would like me  to review a patch, I would be happy to do so, beyond that,
> for quite some  time now nobody has been interested enough in the project
> to steer discussion  on the development list or otherwise engage with the
> project  actively.

I don't think I've seen any security problem.
And I'm not sure I want to get involved in fixing the C code, but I might submit
some patches where I can see a simple fix.

> >I brought JTidy pretty much in sync with Tidy, to the point  where it
> >actually performs better, and I have to emulate bugs to get  similar results.
> >I have a bunch of bugs to report, but the ones I  reported last year are
> >still sitting there untouched.
>
> Well, to  things like http://tidy.sf.net/issue/2917718 I think I've
> always reponded to  saying that there are very many errors Tidy does
> recover from silently, it's  never meant to be a fully featured HTML
> Validator, or something along that  line.

Well, most of the bugs I found (2917718 included) are regressions. They appear
in the tests that come with Tidy, and they were introduced at some point, they
didn't exist before.
Some of them are minor, but others cause broken output.

>(To me, usability of the
> Sourceforge bug tracking system has always  been terrible, and it has
> becomes worse recently, so I mostly stay away from  the tracker.)

I find it not too bad; of course, there's room for improvement.
Is there a better place to report bugs or submit patches for Tidy?
And if I submit patches (even for minor problems) will they get reviewed, and
committed if acceptable?

> I would say we do still have infrastructure and people in  place to
> support maintenance, but nobody doing much groundwork like  working
> through the bug tracker, submitting patches, starting discussions  on
> the development list, and so on.

So basically nobody is fixing bugs anymore?

> There are basically four  components to Tidy, one is a HTML parser that
> can recover from errors in a  manner authors might expect (as sometimes
> opposed to what browsers do), a  component that checks for some errors
> and reports them, one where attempts  are made to fix problems, and one
> that "pretty prints" documents. The parser  is likely going to be re-
> placed by "HTML5" parsers, the validation component  probably has to be
> rewritten to accomodate "HTML5", I am not sure about the  "fixup" part,
> and the pretty printing should really be done by a separe  library that
> allows for more freedom than Tidy has done so far, pretty  printing is
> useful independently of all the other things (whereas parsing and  error
> reporting and fixups are not so easily separated).

You're talking about "replacing" and "rewriting" things for HTML5. Why not just
add HTML5 support to what is already there?
If the new HTML5 code doesn't support previous HTML versions, then Tidy will
lose at least 90% of its usefulness.
If it does, then it means reimplementing a huge amount of functionality instead
of reusing it.
Anyway, HTML5 would be major new functionality, but the project is still
suffering from lots of bugs.

> With that in  mind, the main thing that I would expect to happen in terms
> of Tidy  development is someone submitting a patch to support "HTML5"
> elements and  attributes in some rough manner, so you can continue to use
> Tidy where you  are used to it, and at the same time use "HTML5" features
> and I would not  expect much more to come of the project at the moment.

That sounds more like adding rather than replacing. That would be a good thing.
But I would rather expect bugfixes before "delving into new territory".

Regards,
Adrian


     

Reply | Threaded
Open this post in threaded view
|

Re: Is Tidy still being maintained?

Bjoern Hoehrmann
* Adrian Sandor wrote:
>Is there a better place to report bugs or submit patches for Tidy?
>And if I submit patches (even for minor problems) will they get reviewed, and
>committed if acceptable?

I do think it would help if people mention important bugs in postings to
the developer list. Like I said, there are certain kinds of bugs where I
myself would look at the bugs and patches, if made aware of them.

>You're talking about "replacing" and "rewriting" things for HTML5. Why not just
>add HTML5 support to what is already there?

If somebody both submits a patch for "HTML5" support and also shows that
the level of "HTML5" support added is reasonable, I would be quite happy
to review and possibly commit those changes. I would not work on this
myself at the moment as "HTML5" is very unstable and its development
does not follow a predictable process.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Reply | Threaded
Open this post in threaded view
|

Re: Is Tidy still being maintained?

aditsu
> I do think it  would help if people mention important bugs in postings to

> the developer  list. Like I said, there are certain kinds of bugs where I
> myself would look  at the bugs and patches, if made aware of them.

I submitted
https://sourceforge.net/tracker/?func=detail&aid=3119322&group_id=27659&atid=390963
 with test case and patch.
Also,
https://sourceforge.net/tracker/?func=detail&aid=2913493&group_id=27659&atid=390963
 has a trivial fix.
I might be wrong about 2917718 being a regression - it looks like it had been
fixed only in JTidy.
I could backport the fix but I'm not sure when lexer->token needs to be freed
(that's not a concern in java).

> If somebody both submits a patch for  "HTML5" support and also shows that
> the level of "HTML5" support added is  reasonable, I would be quite happy
> to review and possibly commit those  changes. I would not work on this
> myself at the moment as "HTML5" is very  unstable and its development
> does not follow a predictable process.

As I mentioned before, my main concern is about bug fixes. I don't care much
about HTML5 support at this time.
(But if somebody else has a patch, I will be happy too)

Adrian


     

Reply | Threaded
Open this post in threaded view
|

Tidy and HTML5

Keryx Web
X-posting to WHATWG help from [hidden email]

2010-11-26 09:36, Adrian Sandor skrev:

> As I mentioned before, my main concern is about bug fixes. I don't care much
> about HTML5 support at this time.
> (But if somebody else has a patch, I will be happy too)
>

Here is the deal with HTML5. Pretty soon every browser will have an
HTML5 parser. Except for IE, browsers do not have multiple parsers.

This means that tokenization and DOM tree building will follow the rules
defined in HTML5 - as opposed to not really following any rules at all,
since HTML 4 never defined them.

Simply put, there is no "opt out" of HTML5. An HTML 4 or XHTML 1.x
doctype is nothing more than a contract between developers. Technically
all it does is to set the browser in standards compliance mode.

Thus, I do not see any future in a tool that does not rely on the HTML5
parsing algorithm. Tidy can not grow from its current code base, but
needs to have the same html5lib at its core that is in the HTML5
validator, which basically is the same as the one being used in Firefox 4.

A basic "Tidy5" implementation would thus look like this:
1. Parse the tag soup into a DOM
2. Serialize HTML from that DOM
3. Compare the start and the end result.

Perhaps any error reporting can be made *during* the parsing process.
Henri Sivonen could probably answer the question if that is possible.

However, there is *much* talk about having a lint tool for HTML, that
goes beyond what the validator does. So in addition to the above, there
can be settings for stuff like:

- Implicit close of elements. Tolerate, require or drop all closing tags?
- Implicit elements - tolerate, require or drop (maybe require body but
drop tbody...)?
- Shortened attributes - tolerate, require or drop?
- HTML 4 style type attributes on <script> and <style> - tolerate,
require or drop?
- Explicit closing of void elements - tolerate, require or drop?
- Full XHTML syntax (convert both ways)
- Indentation. Preferably with an option not to have block elements with
a very short text content not to be broken up into 3 rows as in Tidy today.

Besides purification and linting, such a tool/library can be used for:

- Security. This will require the possibility of white and/or
blacklisting elements and attributes. And preferably also attribute values.

- HTML post processing. This will enable authors to see indented code,
that is explicit, while at the same time such "waste" can be removed
before gzipping. This would be akin to JS minification and it could be
performed on the fly from within PHP, Python, Java, Ruby, C#, server
side JS or whatever. It can also be done manually before uploading from
the development environment to production - or it could be integrated
into the uploading tool!

The *main* feature that Tidy has today, is the ability to handle
templates, by preservering/ignoring PHP or other server side code. To
what extent the HTML5 parser can be modified to handle that feature I do
not know.

 From a maintenance and bug fixing POV, I see *huge* wins in having a
common base for Tidy, the HTML5 validator and HTML parsing in Gecko.

But the actual possibility thereof is beyond my technical knowledge to
evaluate.


--
Keryx Web (Lars Gunther)
http://keryx.se/
http://twitter.com/itpastorn/
http://itpastorn.blogspot.com/

Reply | Threaded
Open this post in threaded view
|

Re: Tidy and HTML5

aditsu
----- Original Message ----

> From: Keryx Web <[hidden email]>
>
> 2010-11-26 09:36,  Adrian Sandor skrev:
>
> > As I mentioned before, my main concern is about  bug fixes. I don't care
much
> > about HTML5 support at this time.
> >  (But if somebody else has a patch, I will be happy too)
> >
>
> Here is  the deal with HTML5.

Hi, I'm still trying to figure out the connection between my message and your
reply.
Are you perhaps trying to say that I am headed down the wrong path because the
code in Tidy is garbage and not worth fixing, and it should be replaced with an
html5 parser, which I SHOULD care about instead?

> Simply put, there is no "opt out" of HTML5.

I'm not sure what you mean by that. Sure, browsers may start using an HTML5
parser. But I don't think a majority of websites will switch to HTML5 anytime
soon. And even if they do, not many will break compatibility with HTML 4.x/older
browsers.

> Thus, I do not  see any future in a tool that does not rely on the HTML5
>parsing algorithm. Tidy  can not grow from its current code base, but needs to
>have the same html5lib at  its core that is in the HTML5 validator, which
>basically is the same as the one  being used in Firefox 4.

I disagree with both statements. But I think there could be some value in
starting fresh with an HTML5 parser.

> The *main* feature that Tidy has today, is the ability to handle  templates, by
>preservering/ignoring PHP or other server side code.

I completely disagree. I'd say that the main features are its ability to
transform broken HTML into valid markup and produce a node tree, while reporting
the problems and corrections. I couldn't care less about php tags, but different
people have different needs.

> From a maintenance and bug fixing POV, I see *huge* wins in having  a common
>base for Tidy, the HTML5 validator and HTML parsing in  Gecko.
>
> But the actual possibility thereof is beyond my technical  knowledge to
>evaluate.

Well, I don't know about that. If somebody can do it, great. I'm not going to do
any major development work in C; IF I'll do anything about HTML5, it will be in
java.
But for now, at the risk of repeating myself ad nauseam, I'm interested in
getting some bugs fixed.
If that's not going to happen, then I'll have to treat JTidy as a fork rather
than a port.

Regards,
Adrian


     

Reply | Threaded
Open this post in threaded view
|

Re: Tidy and HTML5

Keryx Web
2010-11-26 16:34, Adrian Sandor skrev:

> ----- Original Message ----
>
>> From: Keryx Web<[hidden email]>
>>
>> 2010-11-26 09:36,  Adrian Sandor skrev:
>>
>>> As I mentioned before, my main concern is about  bug fixes. I don't care
> much
>>> about HTML5 support at this time.
>>>   (But if somebody else has a patch, I will be happy too)
>>>
>>
>> Here is  the deal with HTML5.
>
> Hi, I'm still trying to figure out the connection between my message and your
> reply.
> Are you perhaps trying to say that I am headed down the wrong path because the
> code in Tidy is garbage and not worth fixing, and it should be replaced with an
> html5 parser, which I SHOULD care about instead?

I was jumping in on the thread after your message, but in reality I was
commenting on the whole thread. Tidy is not getting any developer love
right now, and I do not think it will get any in the future either.

In a world that is going HTML5, WCAG 2.0 and script heavy sites that
need ARIA to be accessible, Tidy just will need more than a little
facelift to stay relevant. Maybe you and a few more developers do not
want more than a few bug fixes, but those wishes will not gain any momentum.

HTML5 has momentum. like it or not.

>> Simply put, there is no "opt out" of HTML5.
>
> I'm not sure what you mean by that. Sure, browsers may start using an HTML5
> parser. But I don't think a majority of websites will switch to HTML5 anytime
> soon. And even if they do, not many will break compatibility with HTML 4.x/older
> browsers.

Except for IE there is no browser that switches between rendering
engines based on doctype or some other metadata. Thus, you maý serve
your content with an HTML 4 doctype, it will still be treated as HTML5
by every new browser on the planet.

>> Thus, I do not  see any future in a tool that does not rely on the HTML5
>> parsing algorithm. Tidy  can not grow from its current code base, but needs to
>> have the same html5lib at  its core that is in the HTML5 validator, which
>> basically is the same as the one  being used in Firefox 4.
>
> I disagree with both statements. But I think there could be some value in
> starting fresh with an HTML5 parser.

"Both statements..." (1) You think there is a future for a tool that
does not follow HTML5 parsing rules? (2) You think some developer might
think such a proposition being worthwhile enough to attract developer love?

If so, we disagree yes.

>> The *main* feature that Tidy has today, is the ability to handle  templates, by
>> preservering/ignoring PHP or other server side code.
>
> I completely disagree. I'd say that the main features are its ability to
> transform broken HTML into valid markup and produce a node tree, while reporting
> the problems and corrections. I couldn't care less about php tags, but different
> people have different needs.

My statement was in comparison to a pure validator, that can't be used
for templates.

I agree that Tidy not only reporting errors, but also fixing broken
markup is essential, especially when used server side.

>>  From a maintenance and bug fixing POV, I see *huge* wins in having  a common
>> base for Tidy, the HTML5 validator and HTML parsing in  Gecko.
>>
>> But the actual possibility thereof is beyond my technical  knowledge to
>> evaluate.
>
> Well, I don't know about that. If somebody can do it, great. I'm not going to do
> any major development work in C; IF I'll do anything about HTML5, it will be in
> java.

Good news then. Large parts of the parser are written in Java already.
In fact Henri Sivonen wrote it in Java first and then ported it to C++
for Firefox.

> But for now, at the risk of repeating myself ad nauseam, I'm interested in
> getting some bugs fixed.
> If that's not going to happen, then I'll have to treat JTidy as a fork rather
> than a port.

My prediction is that you are going to have to do that. Maybe your fixes
can be back ported to the original code, but I see no one stepping up to
the plate to do any serious work on Tidy as it is today.

If proven wrong I won't be the least bit sad, though. ;-)


--
Keryx Web (Lars Gunther)
http://keryx.se/
http://twitter.com/itpastorn/
http://itpastorn.blogspot.com/

Reply | Threaded
Open this post in threaded view
|

Re: [html5] Tidy and HTML5

Henri Sivonen
In reply to this post by Keryx Web
Keryx Web wrote:
> X-posting to WHATWG help from [hidden email]
>
> 2010-11-26 09:36, Adrian Sandor skrev:
>
> > As I mentioned before, my main concern is about bug fixes. I don't
> > care much
> > about HTML5 support at this time.
> > (But if somebody else has a patch, I will be happy too)

FWIW, I expect supporting HTML5 in Tidy would be an undertaking of the 'rewrite' kind rather than a matter of 'a patch'.

> Thus, I do not see any future in a tool that does not rely on the
> HTML5 parsing algorithm.

I'd agree if Tidy was primarily a markup consumer. I think it's primarily a markup generator. The HTML5 parsing algorithm doesn't aim to fix author typos with the best DWIM imaginable. It just makes HTML consumption interoperable. A tool whose job is to mash invalid input into valid output on the author's side rather than the consumer's side could well compete on how good DWIM it implements compared to the HTML5 parsing algorithm. If I was writing a new Tidy-like tool, I'd probably hack the HTML tokenizer to treat '<' inside a tag token more like legacy Gecko and WebKit treated it (ending the current tag token, emitting it and starting a new one) instead of treating it the way HTML5, IE and Opera treat it.

I think there isn't a future for Tidy if it doesn't preserve valid HTML5 as valid, though.

> A basic "Tidy5" implementation would thus look like this:
> 1. Parse the tag soup into a DOM
> 2. Serialize HTML from that DOM
> 3. Compare the start and the end result.

As I understand why Tidy does, that's not sufficient. The above steps could still result in invalid output. I think the appropriate steps for doing what Tidy aims to do would be:
 1) Parse input into a tree while reporting Parse Errors as defined by the HTML5 spec.
 2) Drop all unknown attributes emitting an error message for each one.
 3) Remove all unknown elements by replacing them with their children and reporting an error message for each such removal.
 4) While the tree has machine-detectable conformance errors, transforming the tree to remove a machine-detectable error and emitting an error message when doing so. (This is the hard part and the core of the value Tidy currently offers.)
 5) Serializing the tree.

> Perhaps any error reporting can be made *during* the parsing process.
> Henri Sivonen could probably answer the question if that is possible.

Adding implied wrapper elements and dropping stuff could be done using a SAX pipeline without an actual in-memory tree. You need a tree somewhere to recover from all possible HTML errors, though.

> - HTML 4 style type attributes on <script> and <style> - tolerate,
> require or drop?

Why would anyone want to require those?

> - Security. This will require the possibility of white and/or
> blacklisting elements and attributes. And preferably also attribute
> values.

Only whitelisting would work for security.

--
Henri Sivonen
[hidden email]
http://hsivonen.iki.fi/