-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdraft-basis.tex
145 lines (106 loc) · 5.03 KB
/
draft-basis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
\documentclass{farlamp}
% latexmk -pvc -pdfxe -bibtex -interaction=nonstopmode -outdir=build draft-basis.tex
\addbibresource{references.bib}
\subject{Draft Basis}
\title{What I need for planning the Farlamp draft}
\author{Richard Möhn}
\date{\today}
\addtitledatatopdf
\begin{document}
\maketitle
\RaggedRight
(For a project overview and a glossary, see the \FarlampRepo.)
The questions in the following are adapted from \textcite[p. 175]{CoR}. Some of
the section headings quote the chapter titles in that book.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Readers}
\paragraph{Who are my readers?}
\begin{itemize}
\item The members of LessWrong and the AI Alignment Forum. Perhaps the members
of MIRIxDiscord.
\item Ideally conference or workshop attendants, or readers of a journal. I
don't know if I can get my article into such circles, though.
\end{itemize}
\paragraph{What do they know?}
\begin{itemize}
\item Most know ML and CS better than I.
\item They might not know about IDA.
\end{itemize}
\paragraph{Why should they care about my problem?}
IDA is a major approach to AI alignment. How reliable it is, we don't know, and
therefore not whether and what precautions are needed. My research would provide
empirical evidence to help answer these questions.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Ethos}
\paragraph{What kind of ethos or character do I want to project?} From
\textcite[p. 119]{CoR}:
\begin{itemize}
\item ‘[support] claims with evidence that readers accept’
\item ‘[consider] issues from all sides’
\item ‘anticipate and address [readers'] questions and concerns’
\item ‘thoughtfully [consider] other points of view’
\item ‘acknowledge other views and explain [my] principles of reasoning in
warrants’
\end{itemize}
→ ‘give readers good reason to work \emph{with} [me] in developing and testing
new ideas’
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Question and answer}
\paragraph{Sketch my question and its answer in two or three sentences.}
Question: Can SupAmp or ReAmp stay reliable despite overseer failure?
The answer is to be determined.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Reasons and evidence}
Sketch the reasons and evidence supporting my claim. – TBD
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Acknowledgements and responses}
\begin{itemize}
\item What questions, alternatives and objections are my readers likely to
raise?
\item How do I respond to them?
\end{itemize}
\begin{itemize}
\item \cite{ChriRelAmp} talks about policies being capability-amplified.
That sounds like they are the assistants. So shouldn't the project be
about assistant failure, rather than overseer failure?
\Response\ It depends on what ‘amplify’ means, which appears to vary. –
In \cite{ChriALBA} it sounds like the overseer is getting amplified, in
\cite{ChriCapAmp} it sounds like the assistant is getting amplified, and
in \cite{CotrIDA} both the overseer and the assistant are arguments to
the \code{Amplify} procedure.
In the end it doesn't matter for this project. The overseer is a simple
procedure that will fail if it gets wrong input from an assistant. So I
might as well stick with what I've written so far and situate failures
in the overseer.
\item \cite{ChriRelAmp} assumes assistants that are powerful enough to be
able to negotiate with each other. In IDA generally the first assistant
is trained to almost human level. (This might be different with low
bandwidth overseers \parencite{SaunUndIDAClOv}; I haven't thought it
through.) So for distillation we need powerful learning algorithms,
which don't exist yet. And in \Overfail\ I predict that failure
tolerance depends on the learning algorithm. Then won't the results of
this project be irrelevant in a few years (months?), when we have
different learning algorithms?
\Response\ I'm not aiming to document the response of specific learning
algorithms to overseer failure. Rather, I'm testing the hypothesis that
failure tolerance depends on various parameters, such as learning
algorithm and regularization strength. And I'm testing the hypothesis
that only when the overseer failure rate exceeds a certain threshold
(the failure tolerance of a particular configuration) does the overall
failure rate blow up. These hypotheses should hold independent of the
learning algorithm used.
\end{itemize}
To be continued.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Warrants}
\begin{itemize}
\item When may my readers not see the relevance of a reason to a claim?
\item Can I state the warrant that connects them?
\end{itemize}
TBD
\begin{FlushLeft}
% Couldn't find out within reasonable time how to make the formatting of titles
% uniform.
\printbibliography
\end{FlushLeft}
\end{document}