-
Notifications
You must be signed in to change notification settings - Fork 899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support sub-lists and sub-sub-lists #232
Comments
Hi @vvasuki, The HTML specification requires that nested lists be contained within a
For nested lists to be converted correctly to markdown, your HTML will need to be something like: <ul>
<li>headingStyle (setext or atx)</li>
<li>horizontalRule (*, -, or _)</li>
<li>bullet (*, -, or +)</li>
<li>codeBlockStyle (indented or fenced)</li>
<li>fence (` or ~)</li>
<li>emDelimiter (_ or *)</li>
<li>
<ul>
<li>strongDelimiter (** or __)</li>
<li>linkStyle (inlined or referenced)</li>
<li>linkReferenceStyle (full, collapsed, or shortcut)</li>
</ul>
</li>
</ul> |
Thanks for clarifying; but this rule is not strictly followed (please see what google sites produces here: https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing ) and modern browsers interpret sublists not contained within li just fine.. It would indeed be very useful if this was supported here. As an example, there are so many sites pages I need to move to markdown - it would be hard to fix the html before pasting to turndown. |
Also note that for the strictly valid html you presented, turndown outputs:
with an extra * , which is rendered differently. |
I'm afraid it's not really practical for the library to support invalid HTML. |
Well, should you change your mind -
( Forked at https://vvasuki.github.io/google-sheets-to-markdown/ ) |
If you want to allow for this non-standard HTML (and you should—see below!) you can do so without forking the repository by overriding the rules for lists on your TurndownService instance. let converter = new TurndownService()
converter
.addRule('list', {
filter: ['ul', 'ol'],
replacement: function (content, node) {
var parent = node.parentNode
// CHANGE: Allow for <ul> and <ol> that are improperly nested outside of <li>.
if (node.parentNode.nodeName.match(/^(UL|OL)$/i)) {
content = ' ' + content
.replace(/^\n+/, '') // remove leading newlines
.replace(/\n+$/, '\n') // replace trailing newlines with just a single one
.replace(/\n/gm, '\n ') // indent
}
if (parent.nodeName === 'LI' && parent.lastElementChild === node) {
return '\n' + content
} else {
return '\n\n' + content + '\n\n'
}
}
})
.addRule('listItem', {
filter: 'li',
replacement: function (content, node, options) {
content = content
.replace(/^\n+/, '') // remove leading newlines
.replace(/\n+$/, '\n') // replace trailing newlines with just a single one
.replace(/\n/gm, '\n ') // indent
var prefix = options.bulletListMarker + ' '
var parent = node.parentNode
// CHANGE: Don't output two markers if there is an empty li.
if (node.children.length === 1 && node.children[0].nodeName.match(/^(UL|OL)$/i) && node.textContent === node.children[0].textContent ) prefix = ' '
else if (parent.nodeName === 'OL') {
var start = parent.getAttribute('start')
// CHANGE: When <ol> is improperly nested outside of <li>, get the numbering correct.
var index = Array.prototype.indexOf.call(Array.prototype.filter.call(parent.children, el => el.nodeName === 'LI'), node)
prefix = (start ? Number(start) + index : index + 1) + '. '
}
return (
prefix + content + (node.nextSibling && !/\n$/.test(content) ? '\n' : '')
)
}
}) Beyond tl;dr: Why is this important?I'm running into these two errors with some regularity in converting texts from HTML to markdown. These departures from standard HTML have a very clear meaning on the webpage, and that meaning is LOST by conversion to markdown with this module. For example, consider this case: <ol>
<li>item 1</li>
<li>item 2</li>
<ol>
<li>item 2.1</li>
<li>item 2.2</li>
<ol>
<li>item 2.2.1</li>
<li>item 2.2.2</li>
<li>item 2.2.3</li>
</ol>
</ol>
<li>item 3</li>
</ol> That meaning is very clear, both on the web page and in the HTML itself. If you use turndown on it, the code will be transformed to something like this: 1. item 1
2. item 2
1. item 2.1
2. item 2.2
1. item 2.2.1
2. item 2.2.2
3. item 2.2.3
4. item 3 Now that doesn't really make sense in Markdown, because the hierarchy is gone and item 3 is badly numbered, but it's even worse if you view it on a page because the whole list gets flattened and you completely lose the intent of the original:
So that is why I think this should be supported, since even the non-standard HTML has de-facto support in every browser (i.e. it is displayed meaningfully) and in this case the meaning is getting lost by conversion to Markdown, and it is so easy to fix. I dunno, I might open a pull request even though this is old and @domchristie is totally right about it not being practical to support invalid HTML. |
I've been facing a problem using this excellent tool to markdownify my html pages.
is transformed to
rather than
The text was updated successfully, but these errors were encountered: