Support sub-lists and sub-sub-lists #232

vvasuki · 2018-05-28T20:59:02Z

I've been facing a problem using this excellent tool to markdownify my html pages.


<ul>
  <li>headingStyle (setext or atx)</li>
  <li>horizontalRule (*, -, or _)</li>
  <li>bullet (*, -, or +)</li>
  <li>codeBlockStyle (indented or fenced)</li>
  <li>fence (` or ~)</li>
  <li>emDelimiter (_ or *)</li>
<ul>

  <li>strongDelimiter (** or __)</li>
  <li>linkStyle (inlined or referenced)</li>
  <li>linkReferenceStyle (full, collapsed, or shortcut)</li>
</ul>
</ul>

is transformed to

*   headingStyle (setext or atx)
*   horizontalRule (*, -, or _)
*   bullet (*, -, or +)
*   codeBlockStyle (indented or fenced)
*   fence (` or ~)
*   emDelimiter (_ or *)

*   strongDelimiter (** or __)
*   linkStyle (inlined or referenced)
*   linkReferenceStyle (full, collapsed, or shortcut)

rather than

*   headingStyle (setext or atx)
*   horizontalRule (*, -, or _)
*   bullet (*, -, or +)
*   codeBlockStyle (indented or fenced)
*   fence (` or ~)
*   emDelimiter (_ or *)

    *   strongDelimiter (** or __)
    *   linkStyle (inlined or referenced)
    *   linkReferenceStyle (full, collapsed, or shortcut)

The text was updated successfully, but these errors were encountered:

domchristie · 2018-05-29T08:19:52Z

Hi @vvasuki,

The HTML specification requires that nested lists be contained within a <li>:

Permitted content: zero or more <li> elements, which in turn often contain nested <ol> or <ul> elements.
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/ul

For nested lists to be converted correctly to markdown, your HTML will need to be something like:

<ul>
  <li>headingStyle (setext or atx)</li>
  <li>horizontalRule (*, -, or _)</li>
  <li>bullet (*, -, or +)</li>
  <li>codeBlockStyle (indented or fenced)</li>
  <li>fence (` or ~)</li>
  <li>emDelimiter (_ or *)</li>
  <li>
    <ul>
      <li>strongDelimiter (** or __)</li>
      <li>linkStyle (inlined or referenced)</li>
      <li>linkReferenceStyle (full, collapsed, or shortcut)</li>
    </ul>
  </li>
</ul>

vvasuki · 2018-05-29T14:05:21Z

Thanks for clarifying; but this rule is not strictly followed (please see what google sites produces here: https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing ) and modern browsers interpret sublists not contained within li just fine..

It would indeed be very useful if this was supported here. As an example, there are so many sites pages I need to move to markdown - it would be hard to fix the html before pasting to turndown.

vvasuki · 2018-05-29T14:09:02Z

Also note that for the strictly valid html you presented, turndown outputs:

*   headingStyle (setext or atx)
*   horizontalRule (*, -, or _)
*   bullet (*, -, or +)
*   codeBlockStyle (indented or fenced)
*   fence (` or ~)
*   emDelimiter (_ or *)
*   *   strongDelimiter (** or __)
    *   linkStyle (inlined or referenced)
    *   linkReferenceStyle (full, collapsed, or shortcut)

with an extra * , which is rendered differently.
I suppose that this is why modern browsers distinguish and allow the "sublist outside li" option.

domchristie · 2018-05-30T08:03:56Z

I'm afraid it's not really practical for the library to support invalid HTML.

vvasuki · 2018-06-02T02:51:12Z

Well, should you change your mind -

function fixSublists(node) {
    var ulElements = Array.from(node.getElementsByTagName("ul"));
    var olElements = Array.from(node.getElementsByTagName("ol"));
    var listElements = ulElements.concat(olElements);
    listElements.forEach(function (listNode) {
        if (listNode.previousElementSibling.tagName.toUpperCase() === "LI") {
          // The below moves, not copies. https://stackoverflow.com/questions/7555442/move-an-element-to-another-parent-after-changing-its-id
            listNode.previousElementSibling.appendChild(listNode);
        }
    });
    return node;
}

function RootNode (input) {
...
 root = fixSublists(root);
...
}

( Forked at https://vvasuki.github.io/google-sheets-to-markdown/ )

dnotes · 2023-08-01T01:05:39Z

If you want to allow for this non-standard HTML (and you should—see below!) you can do so without forking the repository by overriding the rules for lists on your TurndownService instance.

let converter = new TurndownService()

converter
.addRule('list', {
  filter: ['ul', 'ol'],

  replacement: function (content, node) {
    var parent = node.parentNode

    // CHANGE: Allow for <ul> and <ol> that are improperly nested outside of <li>.
    if (node.parentNode.nodeName.match(/^(UL|OL)$/i)) {
      content = '    ' + content
        .replace(/^\n+/, '') // remove leading newlines
        .replace(/\n+$/, '\n') // replace trailing newlines with just a single one
        .replace(/\n/gm, '\n    ') // indent
    }

    if (parent.nodeName === 'LI' && parent.lastElementChild === node) {
      return '\n' + content
    } else {
      return '\n\n' + content + '\n\n'
    }
  }
})
.addRule('listItem', {
  filter: 'li',

  replacement: function (content, node, options) {
    content = content
      .replace(/^\n+/, '') // remove leading newlines
      .replace(/\n+$/, '\n') // replace trailing newlines with just a single one
      .replace(/\n/gm, '\n    ') // indent

    var prefix = options.bulletListMarker + '   '
    var parent = node.parentNode

    // CHANGE: Don't output two markers if there is an empty li.
    if (node.children.length === 1 && node.children[0].nodeName.match(/^(UL|OL)$/i) && node.textContent === node.children[0].textContent ) prefix = '    '

    else if (parent.nodeName === 'OL') {
      var start = parent.getAttribute('start')
      // CHANGE: When <ol> is improperly nested outside of <li>, get the numbering correct.
      var index = Array.prototype.indexOf.call(Array.prototype.filter.call(parent.children, el => el.nodeName === 'LI'), node)
      prefix = (start ? Number(start) + index : index + 1) + '.  '
    }

    return (
      prefix + content + (node.nextSibling && !/\n$/.test(content) ? '\n' : '')
    )
  }
})

Beyond tl;dr: Why is this important?

I'm running into these two errors with some regularity in converting texts from HTML to markdown. These departures from standard HTML have a very clear meaning on the webpage, and that meaning is LOST by conversion to markdown with this module. For example, consider this case:

<ol>
  <li>item 1</li>
  <li>item 2</li>
  <ol>
    <li>item 2.1</li>
    <li>item 2.2</li>
    <ol>
      <li>item 2.2.1</li>
      <li>item 2.2.2</li>
      <li>item 2.2.3</li>
    </ol>
  </ol>
  <li>item 3</li>
</ol>

That meaning is very clear, both on the web page and in the HTML itself. If you use turndown on it, the code will be transformed to something like this:

1. item 1
2. item 2

1. item 2.1
2. item 2.2

1. item 2.2.1
2. item 2.2.2
3. item 2.2.3

4. item 3

Now that doesn't really make sense in Markdown, because the hierarchy is gone and item 3 is badly numbered, but it's even worse if you view it on a page because the whole list gets flattened and you completely lose the intent of the original:

<ol>
  <li>item 1</li>
  <li>item 2</li>
  <li>item 2.1</li>
  <li>item 2.2</li>
  <li>item 2.2.1</li>
  <li>item 2.2.2</li>
  <li>item 2.2.3</li>
  <li>item 3</li>
</ol>

So that is why I think this should be supported, since even the non-standard HTML has de-facto support in every browser (i.e. it is displayed meaningfully) and in this case the meaning is getting lost by conversion to Markdown, and it is so easy to fix. I dunno, I might open a pull request even though this is old and @domchristie is totally right about it not being practical to support invalid HTML.

domchristie closed this as completed May 29, 2018

This was referenced Oct 10, 2020

Nested Lists Are Not Indented Correctly #354

Closed

fix: indent sublists correctly akosbalasko/yarle#61

Merged

dnotes mentioned this issue Aug 1, 2023

Support for nested <ul>/<ol> tags without <li> (not technically valid) #445

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sub-lists and sub-sub-lists #232

Support sub-lists and sub-sub-lists #232

vvasuki commented May 28, 2018

domchristie commented May 29, 2018

vvasuki commented May 29, 2018

vvasuki commented May 29, 2018

domchristie commented May 30, 2018

vvasuki commented Jun 2, 2018

dnotes commented Aug 1, 2023

Support sub-lists and sub-sub-lists #232

Support sub-lists and sub-sub-lists #232

Comments

vvasuki commented May 28, 2018

domchristie commented May 29, 2018

vvasuki commented May 29, 2018

vvasuki commented May 29, 2018

domchristie commented May 30, 2018

vvasuki commented Jun 2, 2018

dnotes commented Aug 1, 2023

Beyond tl;dr: Why is this important?