Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lang does not set mainlang when using xelatex #2174

Closed
njbart opened this issue May 24, 2015 · 5 comments
Closed

lang does not set mainlang when using xelatex #2174

njbart opened this issue May 24, 2015 · 5 comments

Comments

@njbart
Copy link

njbart commented May 24, 2015

http://pandoc.org/releases.html: “mainlang is the last of a comma-separated list of languages in lang.”

This does not work as expected when using xelatex and the default.latex template. – Example:

pandoc -s --latex-engine=xelatex -t latex << EOT

---
lang: arabic,english
#mainlang: english
...
EOT

Actual output contains \usepackage{polyglossia} \setmainlanguage{}.

Only uncommenting mainlang: english produces the expected \usepackage{polyglossia} \setmainlanguage{english}.

Additional feature request: Create a variable for the “other” languages (all except the last of a comma-separated list of languages in lang), too (for example, “otherlang”), and add $if(otherlang)$ \setotherlanguages{$otherlang$}$endif$ to default.latex, after \setmainlanguage{$mainlang$}.

@ousia
Copy link
Contributor

ousia commented May 24, 2015

@nickbart1980, sorry if I missing something, but it seems that LaTeX is specially problematic when dealing with languages.

First of all, LaTeXt doesn’t share the same values for language identifer names for XML (#1614). I think this should be fixed first to have common values (translating between them).

I’m not a LaTeX user myself (anymore), but I don’t know why we need three variables for languages (lang, mainlang, and the proposed otherlang) instead of only one: lang.

@jgm, I don’t think that lang should be used to contain a list of languages used in the document. Since this value is used to set the document main language in HTML (and ePub) and it should be used for other formats (#1667), lang should only contain a single language.

And when other languages are used in the document, pandoc itself should add this information for polyglossia or babel.

BTW, @nickbart1980, just out of curiosity, how do you markup passages that contain foreign languages in your Markdown documents?

@njbart
Copy link
Author

njbart commented May 24, 2015

Well, I don't think LaTeX is that problematic. Of course, as you point out quite rightly, pandoc should only use two-letter ISO 639-1 codes (“en-US”, “fr-FR”, “ar”, …) as language identifiers and map these to babel or polyglossia identifiers as appropriate.

As to proliferation of variables, mainlang and otherlang are only to be used internally in templates anyway; ideally users would only have to specify, e.g. lang: fr-FR or lang: fr-FR, en-US, and pandoc would see to the rest. By the way, just specifying locale: fr-FR should be possible too, and this should set lang: fr-FR automatically.

@ousia, your idea “when other languages are used in the document, pandoc itself should add this information for polyglossia or babel” is intriguing: if pandoc can figure out by itself which other languages are used in a document, we could indeed limit the use of lang to specifying the main language. If pandoc could extract all lang attributes used in spans and divs in a document and use these to assemble a list of “other” languages that would indeed be great.

As to markup, for basic testing I have just been using raw LaTeX commands such as \textarabic{} so far, but what I'd like pandoc to adopt is basically the HTML lang attribute, e.g., <span lang="ar"></span>, or for longer passages, <div lang="ar"></div>.

@ousia
Copy link
Contributor

ousia commented May 24, 2015

@nickbart1980, different language codes is only one part in the issue of languages in LaTeX.

The other part of the issue is that (it seems that) LaTeX needs to know all languages used in the document before reading the text source. Sorry, but this isn’t a requirement even in ConTeXt. This language list is why we would need three variables in YAML metadata to deal with multilanguage documents.

Sorry, but it makes no sense to me. Mainly, because pandoc is mimicing LaTeX. And LaTeX is only a non-standard way of dealing with languages.

One of my documents would have to have the following language information lang: grc, it, fr, en, de, es.

In the ePub file, I get the following field:

<dc:language>grc, it, fr, en, de, es</dc:language>

calibre-2.28 (latest version released) cannot read this language information. And I wonder whether iBooks or any other ePub reader that enables hyphenation can read the list above. (http://dublincore.org/documents/usageguide/elements.shtml#language allows lists, but it doesn’t prescribe which element should be the main language.)

HTML conversion would include the following element:

<html xmlns="http://www.w3.org/1999/xhtml" lang="grc, it, fr, en, de, es" xml:lang="grc, it, fr, en, de, es">

Well, I would say this is wrong, since HTML elements are supposed to have a single language. Which doesn’t prevent that child or descendant elements may have different language values.

Including a new variable (this would make the fourth) locale: en-US would make sense in Windows (since it seems to be provided by Linux [and I guess in MacOS X]). But this wouldn’t replace the need of lang, since you may write documents in foreign languages or you may even have your locale in a different language than your document languages.

The whole reason for having a language list in lang is LaTeX. If LaTeX behaves so differently, pandoc should build a list of used languages internally. Because there is no other option: either pandoc behaves like LaTeX by default or pandoc should provide LaTeX with the extra requirements it imposes. The first option imposes that you need different sources when generating LaTeX and other XML documents (although this change is tiny, your need it). The second option enables compilation from exactly the same source. I think that this is exactly what pandoc is about.

Language tags can be added like with division and span elements. But this has the same problem as with LaTeX language tagging. HTML tags only work with HTML output (which includes ePub). LaTeX language markup only works with LaTeX.

I think we need a special attribute syntax for language (#895), but we need a special syntax for division and span elements to be implemented before (#168).

I would ask to comment in the issue for special syntax for languages, if you find it useful. So that there is a more accurate image of the need for this feature.

@HughP
Copy link

HughP commented May 24, 2015

FWIW: <dc:language>grc, it, fr, en, de, es</dc:language> seems wrong to me. When I have used DC metadata I have always done something like the following:

<dc:language>grc</dc:language>
<dc:language>it</dc:language>
<dc:language>fr</dc:language>
<dc:language>en</dc:language>
<dc:language>de</dc:language>
<dc:language>es</dc:language>

Multible instances of of the same element containing a single language declarations in each instance is also my understanding of DC impelmentation in the ePub standard: http://www.idpf.org/epub/30/spec/epub30-publications.html#elemdef-opf-dclanguage

@njbart
Copy link
Author

njbart commented May 25, 2015

Let me try and keep this simple:

  1. I reported a bug, but if this provides the opportunity to introduce some general improvements, so much the better.
  2. The convention “mainlang is the last of a comma-separated list of languages in lang” is not such a bad solution – unless, that is, pandoc could start to figure out the “other” languages automatically from the document content, which I would welcome as a huge improvement.
  3. I think one variable for users to specify the “main” language of a document should be enough; we should get rid of the separate “locale” variable we currently use for setting the CSL locale. (If mainlang and otherlang are used in templates, that's a different thing.)
  4. I don't think we would have to wait for a consensus to emerge on a markdown-specific markup for spans and divs – the simple <span lang="fr-FR"></span>, or for longer passages, <div lang="fr-FR"></div> would be perfectly ok with me. (<span …> and <div …> are used in pandoc markdown a lot already anyway.)
  5. For the LaTeX writer, mapping <span lang="fr-FR"></span> and <div lang="fr-FR"></div> to \textfrench{} and \begin{french}…\end{french} should be straightforward (this is for polyglossia; the babel syntax differs, but macros in the template could probably take care of that).
  6. How information on “main” and “other” languages should be represented in HTML and ePub headers is a different question altogether (though @HughP seems to be right that the ePub specs call for separate elements).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants