Native DOM way to get nodes of arbitrary type/name

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Native DOM way to get nodes of arbitrary type/name

Marat Tanalin | tanalin.com
Hello.

It would be nice to have a native (usable and performant) DOM way for retrieving DOM nodes by node type (or, alternatively, by node name).

This could be represented by these two simple methods:

    * element.getNodesByType(type) -- to get _all_ nodes
      of specified type contained in the element
      (like `element.getElementsByTagName('*')` for elements);

    * element.getChildNodesByType(type) -- to get _direct child_
      nodes of the element that have the specified type
      (like `element.children` for elements).

where the `type` argument is an Integer (literal or a corresponding predefined named constant as a property of the global `Node` object [1]) representing the type of nodes we need to retrieve.

For example:

    element.getNodesByType(Node.COMMENT_NODE)

would return all comment nodes inside the element.

And:

    element.getChildNodesByType(Node.TEXT_NODE)

would return all text nodes that are direct child nodes of the element.

=====================================================
Can get elements natively, cannot get arbitrary nodes
=====================================================

Currently we have dedicated DOM methods and properties for retrieving _elements_:

    * element.getElementsByTagName();
    * element.children;

and _all_ child nodes regardless of their type:

    * element.childNodes.

But we have no native DOM way to retrieve nodes of any arbitrary type as easily.

========
Usecases
========

    * For example, server-side script could minify HTML code
      by removing all HTML comments as DOM nodes. (DOM is not
      about just JavaScript inside browser. DOM can be used
      on server side for arbitrary DOM-tree modifications.)

    * Another usecase is processing text nodes via JavaScript
      in browser.

Currently, we are forced to use pure-script way, i.e. either by using Regular Expressions (which is a recognized wrong way in general as for parsing/processing markup) or by DOM traversing through all DOM nodes and filtering them manually by checking their types one by one.

Retrieving all nodes includes first retrieving all elements via `getElementsByTagName('*')` method, then in-loop retrieving direct child nodes of all of them via `element.childNodes` property. All of this is not only not quite developer-friendly (painful actually), but also just _slow_.

With the native `getNodesByType()` and `getChildNodesByType(type)` methods, retrieving DOM nodes of arbitrary type would became trivial and much faster than using a pure-script implementation.

======================
Node type or node name
======================

Alternatively, we could have methods to search nodes not by type, but by node name. For example:

    element.getNodesByNodeName('#comment')

would return all comment nodes exactly like `element.getNodesByType(Node.COMMENT_NODE)` described above.

Using node name looks more flexible since it, for example, would allow to get all child elements of specified tag name which is impossible currently (`element.children` returns all elements regardless of their tag name, and there is no `element.getChildElementsByTagName()` method),  but will probably be possible with `findAll('> SOME_TAG_NAME')`, though `findAll()` approach would probably be anyway slower than `element.getChildNodesByType('SOME_TAG_NAME')` since `find()` involves selector parsing while `getChildNodesByType()` does not.

Maybe the best option is just to allow both node-type Integer and node-name String as argument for `getNodesByType()` / `getChildNodesByType()` without need to choose one. For example:

    element.getNodesByType('#comment')

could effectively be exact equivalent to:

    element.getNodesByType(Node.COMMENT_NODE)

Anyway, whether to search by node type or node name or both of them does matter not too much.

What really matters is the idea of native (usable and performant compared with pure-script ways) DOM way to retrieve nodes of _any_ arbitrary type/name.

Thanks.

[1] https://developer.mozilla.org/en-US/docs/Web/API/Node.nodeType

Reply | Threaded
Open this post in threaded view
|

Re: Native DOM way to get nodes of arbitrary type/name

Glenn Maynard
On Fri, Oct 4, 2013 at 2:27 PM, Marat Tanalin <[hidden email]> wrote:
    * For example, server-side script could minify HTML code
      by removing all HTML comments as DOM nodes. (DOM is not
      about just JavaScript inside browser. DOM can be used
      on server side for arbitrary DOM-tree modifications.)

Browser specifications only target JavaScript inside browsers.  Other environments can define their own additions, but it's not something specs like HTML and DOM try to tackle.  (They have enough work to do already, and browsers are unlikely to spend time implementing APIs that nobody needs on the Web.)

    * Another usecase is processing text nodes via JavaScript
      in browser.

FYI, a use case is something you want to accomplish, such as "modify all text on a web site to be in alternating caps".  (That's not a use case for this, of course--that'd be CSS's job.)  Given that, can you give a concrete use case?

I don't know the use cases, so I don't know if any API should be added, but here's how you can do this without manually recursing yourself.  Note that you may be surprised by the results of "all text nodes".  For example, it'll include inline scripts.

    var getAllWalkerResults = function(walker)
    {
        var result = [];
        while(walker.nextNode())
            result.push(walker.currentNode);
        return result;
    }

    var getAllTextElements = function(element) { return getAllWalkerResults(document.createTreeWalker(element, NodeFilter.SHOW_TEXT)); }
    var getAllCommentElements = function(element) { return getAllWalkerResults(document.createTreeWalker(element, NodeFilter.SHOW_COMMENT)); }
    console.log(getAllTextElements(document));
    console.log(getAllCommentElements(document));

--
Glenn Maynard

Reply | Threaded
Open this post in threaded view
|

Re: Native DOM way to get nodes of arbitrary type/name

Marat Tanalin | tanalin.com
04.10.2013, 23:58, "Glenn Maynard" <[hidden email]>:
> On Fri, Oct 4, 2013 at 2:27 PM, Marat Tanalin <[hidden email]> wrote:

>>     * Another usecase is processing text nodes via JavaScript
>>       in browser.
>
> FYI, a use case is something you want to accomplish, such as
> "modify all text on a web site to be in alternating caps".
> (That's not a use case for this, of course--that'd be CSS's job.)
> Given that, can you give a concrete use case?

There are multiple possible usecases for text processing (and text nodes are just one of node types different from element nodes), for example:

    * applying typography tricks like hanging punctuation [1];

    * automatic (re)formatting of texts in web-based WYSIWYG editors
      (e.g. replacing `--` with `—`, or inserting nonbreaking spaces,
      or removing processing-instruction nodes);

    * removing whitespace-only text nodes between child elements
      of an element (to work around browser bugs in particular --
      for example Safari 5 and older has well-known bug related
      to that whitespace width is not zero even when font size is zero);

    * online client-side (functioning without sending anything to server)
      HTML-processing tools based on browser's DOM;

    * joky transformations of texts (e.g. shuffling letters in words
      during All Fools' Day).

I have encountered the task of retrieving nodes of arbitrary type often enough to finally write and send this proposal.

Also, it just looks like an inconsistency that we have native ways to retrieve element nodes, but don't have native ways to retrieve nodes of other types (so we have multiple node types: element nodes, text nodes, comment nodes, etc., but we can _natively_ search for _element_ nodes only -- it looks like sort of discrimination of nodes of other types. ;-).

Important good thing about the methods I've proposed is that they are _universal/general_ enough and provide ability to get nodes of _any_ type -- without need for somewhat polluting DOM standard with dedicated methods separately for each node type (e.g. `getCommentNodes()`, `getTextNodes()`, `getProcessingInctructionNodes()` [+ their `getChild-` variants] like existing `getElementsByTagName()`).

Also, as I've already mentioned in the original message, `getChildNodesByType()` would provide ability to get direct-child elements of specific tag-name (e.g. get `TH` cells, but not `TD` cells that are direct child elements of a `TR` element) which is currently impossible at all and will be potentially slower anyway using upcoming `findAll()`.

> here's how you can do this without manually recursing yourself.

Thanks, `createTreeWalker()` functionality is interesting and somewhat more neat, but, in essence, retrieving nodes of arbitrary type with it is still pure-script, not much usable compared with a native method, and most likely noticeably slower than a potential native implementation.

> Note that you may be surprised by the results of "all text nodes".  For example, it'll include inline scripts.

I'm aware of that. Inline scripts in general are considered bad form, so that's not a problem for me (as well as for probably any web developer following good practices).

Thanks.

[1] http://www.artlebedev.com/mandership/120/

Reply | Threaded
Open this post in threaded view
|

Re: Native DOM way to get nodes of arbitrary type/name

Glenn Maynard
On Fri, Oct 4, 2013 at 3:58 PM, Marat Tanalin <[hidden email]> wrote:
    * applying typography tricks like hanging punctuation [1];

This would belong in CSS.
 
    * automatic (re)formatting of texts in web-based WYSIWYG editors
      (e.g. replacing `--` with `—`, or inserting nonbreaking spaces,
      or removing processing-instruction nodes);

Maybe.
 
    * removing whitespace-only text nodes between child elements
      of an element (to work around browser bugs in particular --
      for example Safari 5 and older has well-known bug related
      to that whitespace width is not zero even when font size is zero);

Adding features to work around browser bugs doesn't make sense.  The features won't exist until a future version of the browser anyway, so they should just fix the bug.
 
    * online client-side (functioning without sending anything to server)
      HTML-processing tools based on browser's DOM;

This seems like a description of the API, rather than a use case.
 
    * joky transformations of texts (e.g. shuffling letters in words
      during All Fools' Day).

(Sorry if this seems a bit contrived.  :)

Important good thing about the methods I've proposed is that they are _universal/general_ enough and provide ability to get nodes of _any_ type -- without need for somewhat polluting DOM standard with dedicated methods separately for each node type (e.g. `getCommentNodes()`, `getTextNodes()`, `getProcessingInctructionNodes()` [+ their `getChild-` variants] like existing `getElementsByTagName()`).

Also, as I've already mentioned in the original message, `getChildNodesByType()` would provide ability to get direct-child elements of specific tag-name (e.g. get `TH` cells, but not `TD` cells that are direct child elements of a `TR` element) which is currently impossible at all and will be potentially slower anyway using upcoming `findAll()`.

I don't really understand what you mean (what does "TH cells but not TD cells" mean?), but you can already use querySelectorAll() to match using CSS selectors, eg. element.querySelectorAll("TH").  ("Potentially slower" isn't very interesting--you first need to show a real performance problem.  CSS selectors are very fast.)

> here's how you can do this without manually recursing yourself.

Thanks, `createTreeWalker()` functionality is interesting and somewhat more neat, but, in essence, retrieving nodes of arbitrary type with it is still pure-script, not much usable compared with a native method, and most likely noticeably slower than a potential native implementation.

If you want to make a performance argument, you'll want to show 1: that the non-native implementation is actually slow enough to cause real-world issues, and 2: that a native implementation is actually significantly faster.  I doubt that this is actually materially slower, actually, and I suspect many browsers wouldn't implement this natively at all.
 
> Note that you may be surprised by the results of "all text nodes".  For example, it'll include inline scripts.

I'm aware of that. Inline scripts in general are considered bad form, so that's not a problem for me (as well as for probably any web developer following good practices).

(I disagree.  I certainly don't try to move every single piece of script into external scripts.)

--
Glenn Maynard

Reply | Threaded
Open this post in threaded view
|

Re: Native DOM way to get nodes of arbitrary type/name

Bjoern Hoehrmann
In reply to this post by Marat Tanalin | tanalin.com
* Marat Tanalin wrote:
>It would be nice to have a native (usable and performant) DOM way for
>retrieving DOM nodes by node type (or, alternatively, by node name).
>
>This could be represented by these two simple methods:
>
>    * element.getNodesByType(type) -- to get _all_ nodes
>      of specified type contained in the element
>      (like `element.getElementsByTagName('*')` for elements);

Use XPath, the convenient API would have it

  element.selectNodes('.//text()')
  element.selectNodes('.//comment()')
  element.selectNodes('.//processing-instruction()')

but in some browsers you have to use the DOM Level 3 XPath methods,
which are slightly less convenient. Note in particular that XPath
allows you to use predicates to select a subset of "all" the nodes
that would otherwise be returned.

>    * element.getChildNodesByType(type) -- to get _direct child_
>      nodes of the element that have the specified type
>      (like `element.children` for elements).

Drop the `.//` in the paths above for this.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Reply | Threaded
Open this post in threaded view
|

Re: Native DOM way to get nodes of arbitrary type/name

Marat Tanalin | tanalin.com
05.10.2013, 14:12, "Bjoern Hoehrmann" <[hidden email]>:

> * Marat Tanalin wrote:
>
>> It would be nice to have a native (usable and performant) DOM way for
>> retrieving DOM nodes by node type (or, alternatively, by node name).
>>
>> This could be represented by these two simple methods:
>>
>>     * element.getNodesByType(type) -- to get _all_ nodes
>>       of specified type contained in the element
>>       (like `element.getElementsByTagName('*')` for elements);
>
> Use XPath

Thank you for your attention, Bjoern.

My message is not about searching for an instantaneous existing solution (nor the mailing list is a forum for questions of such type).

My message is a proposal intended to improve the DOM standard in general and to make it more universal and consistent.

XPath is not DOM or a part of it.
XPath is not as intuitive/easy-to-use as DOM.
XPath involves parsing "selector" always.

Of course, XPath could still be utilized for polyfilling the proposed DOM features in older XPath-capable browsers.

Reply | Threaded
Open this post in threaded view
|

RE: Native DOM way to get nodes of arbitrary type/name

Domenic Denicola
Hi Marat,

I'd like to offer some friendly advice on getting your proposal taken seriously. I hope you will take it as it is intended, in good spirits.

The first step in getting people interested in your proposal is demonstrating that you are solving a concrete problem. To do this, there are two key ingredients:

1. Showing that you are solving a real developer-facing problem, and not one of API aesthetics ("I don't like using selector strings" is aesthetics, for example);
2. Showing that there is no way to solve that real developer-facing problem currently.

This may seem like a high barrier, but features are not free in any sense. Someone has to specify them (in the extensive detail required for interop, not in the cursory detail of an email message); someone has to integrate them with the rest of the platform; someone has to apply the human-resourcing pressure to get implementers to allocate valuable developer time to writing the appropriate browser patches and developer documentation; and then there's the ongoing maintenance cost that comes with any API, e.g. its constraints on backwards-compatibility for the indefinite future of the web platform. With these costs in mind, you can see why a feature that might seem obvious to you needs a lot more convincing than you're currently doing. Expecting people to intuitively understand why your feature is worth all those costs is not fair or reasonable.

So far you have not demonstrated very good concrete use cases (recalling Glenn's definition). The best way to do this would be to find an existing web app (not server-side app) that uses existing methods, like XPath or tree walkers, and then show how much cleaner the code would be with your proposal in place. Not a contrived example, but a real, deployed web app with lots of users, that existed prior to you sending your message to the list.

If it is indeed much cleaner, then the question becomes: is the web app you've shown doing something that is common enough to add an API to the DOM for? For that, you will need multiple examples of such web apps, because if it's something that only one large web app is doing, then that web app should just write a library that allows these operations cleanly, and open-source it. Indeed, even if several are doing that, an open-source library may be the best solution for prototyping; that is how the platform gained e.g. CSS selector matching, by first prototyping it in Sizzle and only after it had gained widespread adoption and recognition, making it native.

But what about performance, you ask? Indeed, you seem to bring up this point often in your previous messages. However, I don't think this argument is very strong, for a few reasons. Native code is not generally faster than JS libraries for such tree-walking operations; the transition between the JS to C++ barrier, and back again, causes significant overhead in all current engines. As such many engines have taken to "self-hosting" parts of the platform, writing them purely in JS, without using C++ at all. This gets us back to the same performance as an open-source library.

But let's say that the C++ implementation is possibly faster; perhaps it has access to some privileged APIs. Well, you need to prove that, if you want to use performance as an argument. To do so, the best thing to do would be to submit some patches to the various open-source engines, showing how you could use these privileged APIs to create a faster version of your proposal than is possible in a JS library based on XPath or tree walker. The accompanying benchmarks would then provide a good argument, either for exposing your proposed APIs, or perhaps for exposing the privileged fast APIs that make things possible, thus enabling not only your desired operations to be written in performant JS, but also probably other operations that could benefit from such fast-path lower-level APIs.

I hope this is helpful to your goals, and that you can understand where I'm coming from. I'm very interested in getting developers like yourself involved in the standards process and in proposing new or better APIs; in my opinion the WHATWG and company could greatly benefit from help in such areas. (I do love the shape of your APIs; they are very sensible, and much better than many existing ones in the DOM!) But in this particular instance, I'm afraid you might not be tackling the problem in the right way, and so I wanted to give you some perspective to help you be more productive.