Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sub-lists and sub-sub-lists #232

Closed
vvasuki opened this issue May 28, 2018 · 6 comments
Closed

Support sub-lists and sub-sub-lists #232

vvasuki opened this issue May 28, 2018 · 6 comments

Comments

@vvasuki
Copy link

vvasuki commented May 28, 2018

I've been facing a problem using this excellent tool to markdownify my html pages.


<ul>
  <li>headingStyle (setext or atx)</li>
  <li>horizontalRule (*, -, or _)</li>
  <li>bullet (*, -, or +)</li>
  <li>codeBlockStyle (indented or fenced)</li>
  <li>fence (` or ~)</li>
  <li>emDelimiter (_ or *)</li>
<ul>

  <li>strongDelimiter (** or __)</li>
  <li>linkStyle (inlined or referenced)</li>
  <li>linkReferenceStyle (full, collapsed, or shortcut)</li>
</ul>
</ul>

is transformed to

*   headingStyle (setext or atx)
*   horizontalRule (*, -, or _)
*   bullet (*, -, or +)
*   codeBlockStyle (indented or fenced)
*   fence (` or ~)
*   emDelimiter (_ or *)

*   strongDelimiter (** or __)
*   linkStyle (inlined or referenced)
*   linkReferenceStyle (full, collapsed, or shortcut)

rather than

*   headingStyle (setext or atx)
*   horizontalRule (*, -, or _)
*   bullet (*, -, or +)
*   codeBlockStyle (indented or fenced)
*   fence (` or ~)
*   emDelimiter (_ or *)

    *   strongDelimiter (** or __)
    *   linkStyle (inlined or referenced)
    *   linkReferenceStyle (full, collapsed, or shortcut)
@domchristie
Copy link
Collaborator

Hi @vvasuki,

The HTML specification requires that nested lists be contained within a <li>:

Permitted content: zero or more <li> elements, which in turn often contain nested <ol> or <ul> elements.
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/ul

For nested lists to be converted correctly to markdown, your HTML will need to be something like:

<ul>
  <li>headingStyle (setext or atx)</li>
  <li>horizontalRule (*, -, or _)</li>
  <li>bullet (*, -, or +)</li>
  <li>codeBlockStyle (indented or fenced)</li>
  <li>fence (` or ~)</li>
  <li>emDelimiter (_ or *)</li>
  <li>
    <ul>
      <li>strongDelimiter (** or __)</li>
      <li>linkStyle (inlined or referenced)</li>
      <li>linkReferenceStyle (full, collapsed, or shortcut)</li>
    </ul>
  </li>
</ul>

@vvasuki
Copy link
Author

vvasuki commented May 29, 2018

Thanks for clarifying; but this rule is not strictly followed (please see what google sites produces here: https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing ) and modern browsers interpret sublists not contained within li just fine..

It would indeed be very useful if this was supported here. As an example, there are so many sites pages I need to move to markdown - it would be hard to fix the html before pasting to turndown.

@vvasuki
Copy link
Author

vvasuki commented May 29, 2018

Also note that for the strictly valid html you presented, turndown outputs:

*   headingStyle (setext or atx)
*   horizontalRule (*, -, or _)
*   bullet (*, -, or +)
*   codeBlockStyle (indented or fenced)
*   fence (` or ~)
*   emDelimiter (_ or *)
*   *   strongDelimiter (** or __)
    *   linkStyle (inlined or referenced)
    *   linkReferenceStyle (full, collapsed, or shortcut)

with an extra * , which is rendered differently.
I suppose that this is why modern browsers distinguish and allow the "sublist outside li" option.

@domchristie
Copy link
Collaborator

I'm afraid it's not really practical for the library to support invalid HTML.

@vvasuki
Copy link
Author

vvasuki commented Jun 2, 2018

Well, should you change your mind -

function fixSublists(node) {
    var ulElements = Array.from(node.getElementsByTagName("ul"));
    var olElements = Array.from(node.getElementsByTagName("ol"));
    var listElements = ulElements.concat(olElements);
    listElements.forEach(function (listNode) {
        if (listNode.previousElementSibling.tagName.toUpperCase() === "LI") {
          // The below moves, not copies. https://stackoverflow.com/questions/7555442/move-an-element-to-another-parent-after-changing-its-id
            listNode.previousElementSibling.appendChild(listNode);
        }
    });
    return node;
}

function RootNode (input) {
...
 root = fixSublists(root);
...
}

( Forked at https://vvasuki.github.io/google-sheets-to-markdown/ )

@dnotes
Copy link

dnotes commented Aug 1, 2023

If you want to allow for this non-standard HTML (and you should—see below!) you can do so without forking the repository by overriding the rules for lists on your TurndownService instance.

let converter = new TurndownService()

converter
.addRule('list', {
  filter: ['ul', 'ol'],

  replacement: function (content, node) {
    var parent = node.parentNode

    // CHANGE: Allow for <ul> and <ol> that are improperly nested outside of <li>.
    if (node.parentNode.nodeName.match(/^(UL|OL)$/i)) {
      content = '    ' + content
        .replace(/^\n+/, '') // remove leading newlines
        .replace(/\n+$/, '\n') // replace trailing newlines with just a single one
        .replace(/\n/gm, '\n    ') // indent
    }

    if (parent.nodeName === 'LI' && parent.lastElementChild === node) {
      return '\n' + content
    } else {
      return '\n\n' + content + '\n\n'
    }
  }
})
.addRule('listItem', {
  filter: 'li',

  replacement: function (content, node, options) {
    content = content
      .replace(/^\n+/, '') // remove leading newlines
      .replace(/\n+$/, '\n') // replace trailing newlines with just a single one
      .replace(/\n/gm, '\n    ') // indent

    var prefix = options.bulletListMarker + '   '
    var parent = node.parentNode

    // CHANGE: Don't output two markers if there is an empty li.
    if (node.children.length === 1 && node.children[0].nodeName.match(/^(UL|OL)$/i) && node.textContent === node.children[0].textContent ) prefix = '    '

    else if (parent.nodeName === 'OL') {
      var start = parent.getAttribute('start')
      // CHANGE: When <ol> is improperly nested outside of <li>, get the numbering correct.
      var index = Array.prototype.indexOf.call(Array.prototype.filter.call(parent.children, el => el.nodeName === 'LI'), node)
      prefix = (start ? Number(start) + index : index + 1) + '.  '
    }

    return (
      prefix + content + (node.nextSibling && !/\n$/.test(content) ? '\n' : '')
    )
  }
})

Beyond tl;dr: Why is this important?

I'm running into these two errors with some regularity in converting texts from HTML to markdown. These departures from standard HTML have a very clear meaning on the webpage, and that meaning is LOST by conversion to markdown with this module. For example, consider this case:

<ol>
  <li>item 1</li>
  <li>item 2</li>
  <ol>
    <li>item 2.1</li>
    <li>item 2.2</li>
    <ol>
      <li>item 2.2.1</li>
      <li>item 2.2.2</li>
      <li>item 2.2.3</li>
    </ol>
  </ol>
  <li>item 3</li>
</ol>

That meaning is very clear, both on the web page and in the HTML itself. If you use turndown on it, the code will be transformed to something like this:

1. item 1
2. item 2

1. item 2.1
2. item 2.2

1. item 2.2.1
2. item 2.2.2
3. item 2.2.3

4. item 3

Now that doesn't really make sense in Markdown, because the hierarchy is gone and item 3 is badly numbered, but it's even worse if you view it on a page because the whole list gets flattened and you completely lose the intent of the original:

<ol>
  <li>item 1</li>
  <li>item 2</li>
  <li>item 2.1</li>
  <li>item 2.2</li>
  <li>item 2.2.1</li>
  <li>item 2.2.2</li>
  <li>item 2.2.3</li>
  <li>item 3</li>
</ol>

So that is why I think this should be supported, since even the non-standard HTML has de-facto support in every browser (i.e. it is displayed meaningfully) and in this case the meaning is getting lost by conversion to Markdown, and it is so easy to fix. I dunno, I might open a pull request even though this is old and @domchristie is totally right about it not being practical to support invalid HTML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants