draft-basis.tex

\documentclass{farlamp}

% latexmk -pvc -pdfxe -bibtex -interaction=nonstopmode -outdir=build draft-basis.tex

\addbibresource{references.bib}

\subject{Draft Basis}
\title{What I need for planning the Farlamp draft}
\author{Richard Möhn}
\date{\today}
\addtitledatatopdf

\begin{document}
\maketitle
\RaggedRight

(For a project overview and a glossary, see the \FarlampRepo.)

The questions in the following are adapted from \textcite[p. 175]{CoR}. Some of
the section headings quote the chapter titles in that book.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Readers}

\paragraph{Who are my readers?}

\begin{itemize}
\item The members of LessWrong and the AI Alignment Forum. Perhaps the members
    of MIRIxDiscord.
\item Ideally conference or workshop attendants, or readers of a journal. I
    don't know if I can get my article into such circles, though.
\end{itemize}

\paragraph{What do they know?}

\begin{itemize}
\item Most know ML and CS better than I.
\item They might not know about IDA.
\end{itemize}

\paragraph{Why should they care about my problem?}

IDA is a major approach to AI alignment. How reliable it is, we don't know, and
therefore not whether and what precautions are needed. My research would provide
empirical evidence to help answer these questions.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Ethos}

\paragraph{What kind of ethos or character do I want to project?} From
\textcite[p. 119]{CoR}:

\begin{itemize}
\item ‘[support] claims with evidence that readers accept’
\item ‘[consider] issues from all sides’
\item ‘anticipate and address [readers'] questions and concerns’
\item ‘thoughtfully [consider] other points of view’
\item ‘acknowledge other views and explain [my] principles of reasoning in
    warrants’
\end{itemize}

→ ‘give readers good reason to work \emph{with} [me] in developing and testing
new ideas’


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Question and answer}

\paragraph{Sketch my question and its answer in two or three sentences.}
Question: Can SupAmp or ReAmp stay reliable despite overseer failure?
The answer is to be determined.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Reasons and evidence}

Sketch the reasons and evidence supporting my claim. – TBD


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Acknowledgements and responses}

\begin{itemize}
\item What questions, alternatives and objections are my readers likely to
    raise?
\item How do I respond to them?
\end{itemize}

\begin{itemize}
    \item \cite{ChriRelAmp} talks about policies being capability-amplified.
        That sounds like they are the assistants. So shouldn't the project be
        about assistant failure, rather than overseer failure?

        \Response\ It depends on what ‘amplify’ means, which appears to vary. –
        In \cite{ChriALBA} it sounds like the overseer is getting amplified, in
        \cite{ChriCapAmp} it sounds like the assistant is getting amplified, and
        in \cite{CotrIDA} both the overseer and the assistant are arguments to
        the \code{Amplify} procedure.

        In the end it doesn't matter for this project. The overseer is a simple
        procedure that will fail if it gets wrong input from an assistant. So I
        might as well stick with what I've written so far and situate failures
        in the overseer.

    \item \cite{ChriRelAmp} assumes assistants that are powerful enough to be
        able to negotiate with each other. In IDA generally the first assistant
        is trained to almost human level. (This might be different with low
        bandwidth overseers \parencite{SaunUndIDAClOv}; I haven't thought it
        through.) So for distillation we need powerful learning algorithms,
        which don't exist yet. And in \Overfail\ I predict that failure
        tolerance depends on the learning algorithm. Then won't the results of
        this project be irrelevant in a few years (months?), when we have
        different learning algorithms?

        \Response\ I'm not aiming to document the response of specific learning
        algorithms to overseer failure. Rather, I'm testing the hypothesis that
        failure tolerance depends on various parameters, such as learning
        algorithm and regularization strength. And I'm testing the hypothesis
        that only when the overseer failure rate exceeds a certain threshold
        (the failure tolerance of a particular configuration) does the overall
        failure rate blow up. These hypotheses should hold independent of the
        learning algorithm used.
\end{itemize}

To be continued.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Warrants}

\begin{itemize}
\item When may my readers not see the relevance of a reason to a claim?
\item Can I state the warrant that connects them?
\end{itemize}

TBD


\begin{FlushLeft}
% Couldn't find out within reasonable time how to make the formatting of titles
% uniform.
\printbibliography
\end{FlushLeft}
\end{document}