Rtftohtml map for RFC2070/HTML4.0 conversion

Context (added 2003)

Before making use of this document, please consult the covering note on the index page for an explanation of its context.

(this note added 2003 to a programmatically-converted document from around 1998.)



Rtftohtml map for RFC2070/HTML4.0 conversion

Converts Microsoft Word (with Symbol fonts) to unicode

Introduction

The idea of using iso-10646/unicode character representation in HTML was already foreseen by the HTML2.0 spec, RFC1866, and was reinforced by RFC2070, the standards-track RFC for the internationalization of HTML. Specialist browsers have been supporting it for some time, but it's now being usefully supported by version 4 of the popular browsers. (HTML4.0 contains lots that they haven’t implemented yet, but this, at least, they are supporting, so let’s use it.)
html-map has been updated to rtftohtlml 3.8. I'm making both "pedantic" and "practical" versions, using near equivalences where browser support for the pedantically correct unicode character seems to be poor. See the TODO list for further discussion.
For HTML4.0 entities information: http://www.w3.org/TR/PR-html40-971107/sgml/entities.html .
For recommendations on mapping, seeftp://ftp.unicode.org/Public/MAPPINGS/VENDORS / (use an FTP client if web browser gives problems).
Please note that this is a working document, produced for my own purposes. Although it includes some material from reliable sources, it also establishes some character equivalences or substitutes that are purely for the convenience of this application. References are now generally to version 3.8 of rtftohtml; some differences in the 3.6 version are shown (3.6: in parentheses).
The ALT/0nnn column shows how to type these characters in on a PC (there might of course also be keyboard shortcuts for some of them). Don't omit the leading zero!
The Unicode character names are rather verbose - I have abbreviated some for convenience.
(*) - use normal ASCII/iso-8859-1 character (as duplicate or as approximation)

CP-1252 (hex 80-9f), normal font

Specimen


HTML4.0 WD


rtftohtml name


CP1252 hex


ALT/ 0nnn


Unicode hex


Unicode decimal


Unicode name (some abbreviated)











sbquo


quotesinglbase


82


130


201a


8218


single low-9 quote


ƒ


fnof


florin


83


131


0192


402


small f with hook



bdquo


quotedblbase


84


132


201e


8222


double low-9 quote



hellip


ellipsis


85


133


* or 2026


* or 8230


(...) horizontal ellipsis



dagger


dagger


86


134


2020


8224


dagger



Dagger


daggerdbl


87


135


2021


8225


double dagger


ˆ


circ


circumflex


88


136


02c6


710


modifier circumflex



permil


perthousand


89


137


2030


8240


per mille


Š


Scaron


Sinvcircumflex


8a


138


0160


352


capital S caron



lsaquo


guilsinglleft


8b


139


2039


8249


single left angle quote


Œ


OElig


OE


8c


140


0152


338


capital ligature OE











lsquo


quoteleft


91


145


2018


8216


left single quote



rsquo


quoteright


92


146


2019


8217


right single quote



ldquo


quotedblleft


93


147


201c


8220


left double quote



rdquo


quotedblright


94


148


201d


8221


right double quote




bull


bullet


95


149


2022


8226


bullet



ndash


endash


96


150


2013


8211


en dash



mdash


emdash


97


151


2014


8212


em dash


˜


tilde


tilde


98


152


02dc


732


small tilde



trade


trademark


99


153


2122


8482


trademark


š


scaron


sinvcircumflex


9a


154


0161


353


small s caron



rsaquo


guilsinglright


9b


155


203a


8250


single right angle quote


œ


oelig


oe


9c


156


0153


339


small ligature OE










Ÿ


Yuml


Ydieresis


9f


159


0178


376


capital Y dieresis


Symbol font, hex20-3f

Specimen


HTML4.0 WD


Rtftohtml and [Adobe] names


Symbol font hex


ALT/ 0nnn


Unicode hex


Unicode decimal


Unicode name (some abbreviated)


!



exclam


21


33


*


*




forall


universal


22


34


2200


8704


for all


#



mathnumbersign


23


35


*


*




exist


existential


24


36


2203


8707


there exists


%



percent


25


37


*


*



&


amp


ampersand


26


38


*


*




ni


suchthat


27


39


220b


8715


contains as member


(



parenleft


28


40


*


*



)



parenright


29


41


*


*




lowast


mathasterisk [asteriskmath]


2a


42


* or 2217


* or 8727


asterisk operator (or use asterisk)


+



mathplus


2b


43


*


*


(use ascii plus)


,



comma


2c


44


*


*




minus


mathminus [minus]


2d


45


* or 2212


* or 8722


minus sign (or use hyphen)












( - )


2e-3f


44-63


*


*


(as corresp. ascii codes)


Symbol font, hex40-5f

Specimen


HTML4.0 WD


Rtftohtml and [Adobe] names


Symbol font hex


ALT/ 0nnn


Unicode hex


Unicode decimal


Unicode name (some abbreviated)











cong


congruent


40


64


2245


8773


approx equal to


Α


Alpha


Alpha


41


65


0391


913


Alpha


Β


Beta


Beta


42


66


0392


914


Beta


Χ


Chi


Chi


43


67


03a7


935


Chi


Δ


Delta


Delta


44


68


0394


916


Delta


Ε


Epsilon


Epsilon


45


69


0395


917


Epsilon


Φ


Phi


Phi


46


70


03a6


934


Phi


Γ


Gamma


Gamma


47


71


0393


915


Gamma


Η


Eta


Eta


48


72


0397


919


Eta


Ι


Iota


Iota


49


73


0399


921


Iota


ϑ


thetasym


GreekJ [theta1]


4a


74


03d1


977


script theta


Κ


Kappa


Kappa


4b


75


039a


922


Kappa


Λ


Lambda


Lambda


4c


76


039b


923


Lambda


Μ


Mu


Mu


4d


77


039c


924


Mu


Ν


Nu


Nu


4e


78


039d


925


Nu


Ο


Omicron


Omicron


4f


79


039f


927


Omicron










Π


Pi


Pi


50


80


03a0


928


Pi


Θ


Theta


Theta


51


81


0398


920


Theta


Ρ


Rho


Rho


52


82


03a1


929


Rho


Σ


Sigma


Sigma


53


83


03a3


931


Sigma


Τ


Tau


Tau


54


84


03a4


932


Tau


Υ


Upsilon


Upsilon


55


85


03a5


933


Upsilon


ς


sigmaf


varsigma [sigma1]


56


86


03c2


962


final sigma


Ω


Omega


Omega


57


87


03a9


937


Omega


Ξ


Xi


Xi


58


88


039e


926


Xi


Ψ


Psi


Psi


59


89


03a8


936


Psi


Ζ


Zeta


Zeta


5a


90


0396


918


Zeta


[



bracketleft


5b


91


*


*


[



there4


therefore


5c


92


2234


8756


therefore


]



bracketright


5d


93


*


*


]



perp


invtee [perpendicular]


5e


94


22a5


8869


uptack


_



underscore


5f


95


*


*


_


Symbol font, hex60-7f

Specimen


HTML4.0 WD


Rtftohtml and [Adobe] names


Symbol font hex


ALT/ 0nnn


Unicode hex


Unicode decimal


Unicode name (some abbreviated)



oline


uppscore [radicalex]


60


96


* or 203e


*175 or 8254


spacing overscore (use macron)


α


alpha


alpha


61


97


03b1


945


alpha


β


beta


beta


62


98


03b2


946


beta


χ


chi


chi


63


99


03c7


967


chi


δ


delta


delta


64


100


03b4


948


delta


ε


epsilon


epsilon


65


101


03b5


949


epsilon


φ


phi


phi


66


102


03c6


966


phi


γ


gamma


gamma


67


103


03b3


947


gamma


η


eta


eta


68


104


03b7


951


eta


ι


iota


iota


69


105


03b9


953


iota


ϕ


(?)


script_phi [phi1] (3.6: Greekj)


6a


106


03d5


981


script phi


κ


kappa


kappa


6b


107


03ba


954


kappa


λ


lambda


lambda


6c


108


03bb


955


lambda


μ


mu


mu


6d


109


03bc


956


mu (cf. micro in Latin-1)


ν


nu


nu


6e


110


03bd


957


nu


ο


omicron


omicron


6f


111


03bf


959


omicron


π


pi


pi


70


112


03c0


960


pi


θ


theta


theta


71


113


03b8


952


theta


ρ


rho


rho


72


114


03c1


961


rho


σ


sigma


sigma


73


115


03c3


963


sigma


τ


tau


tau


74


116


03c4


964


tau


υ


upsilon


upsilon


75


117


03c5


965


upsilon


ϖ


piv


omegahat [omega1]


76


118


03d6


982


omega pi


ω


omega


omega


77


119


03c9


969


omega


ξ


xi


xi


78


120


03be


958


xi


ψ


psi


psi


79


121


03c8


968


psi


ζ


zeta


zeta


7a


122


03b6


950


zeta


{



braceleft


7b


123


*


*


{


|



bar


7c


124


*


*


some kind of bar


}



braceright


7d


125


*


*


}



sim


mathtilde [similar]


7e


126


* or 223c


* or 8764


tilde operator (or use ascii tilde)


Symbol font, hex a0-bf

(the no-break space is shown in brackets for clarity)

Specimen


HTML4.0 WD


Rtftohtml and [Adobe] names


Symbol font hex


ALT/ 0nnn


Unicode hex


Unicode decimal


Unicode name (some abbreviated)


[ ]


nbsp


nobrkspace


a0


160


*


*



ϒ


upsih


hammer [Upsilon1]


a1


161


03d2


978


Upsilon hook



prime


minute


a2


162


2032


8242


prime



le


lessequal


a3


163


2264


8804


less than or equal



frasl


fraction


a4


164


2044 (or 2215 or *)


8260 (or 8725 or /)


fraction slash (or division slash or /)



infin


infinity


a5


165


221e


8734


infinity


ƒ


fnof


florin


a6


166


0192


402


latin small script f



clubs


club


a7


167


2663


9827


black club



diams


diamond


a8


168


2666


9830


black diamond



hearts


heart


a9


169


2665


9829


black heart



spades


spade


aa


170


2660


9824


black spade



harr


arrowboth


ab


171


2194


8596


left right arrow



larr


arrowleft


ac


172


2190


8592


left arrow



uarr


arrowup


ad


173


2191


8593


up arrow



rarr


arrowright


ae


174


2192


8594


right arrow



darr


arrowdown


af


175


2193


8595


down arrow


°


deg


degree


b0


176


*


*176



±


plusmn


plusminus


b1


177


*


*




Prime


second


b2


178


2033 or *


8243


double prime



ge


greaterequal


b3


179


2265


8805


greater than or equal


×


times


multiply


b4


180


*


*215




prop


proportional


b5


181


221d


8733


proportional to



part


partialdiff


b6


182


2202


8706


partial differential



bull


bullet


b7


183


2022


8226


bullet


÷


divide


divide


b8


184


*


*247




ne


notequal


b9


185


2260


8800


not equal to



equiv


equivalence


ba


186


2261


8801


identical to



asymp


approxequal


bb


187


2248


8776


almost equal to



hellip


ellipsis


bc


188


* or 2026


* or 8230


(three ascii dots)


|



arrowvertex


bd


189


*


*


(vertical bar)


-



arrowhorizex


be


190


*


*


(minus sign)



crarr


carriagereturnmark


bf


191


21b5


8629


down arrow with corner left


Symbol font, hex c0-df

Specimen


HTML4.0 WD


Rtftohtml and [Adobe] names


Symbol font hex


ALT/ 0nnn


Unicode hex


Unicode decimal


Unicode name (some abbreviated)



alefsym


aleph


c0


192


2135


8501


first transfinite cardinal



image


Ifraktur


c1


193


2111


8465


black-letter I



real


Rfraktur


c2


194


211c


8476


black-letter R



weierp


weierstrass


c3


195


2118


8472


script P



otimes


circlemultiply


c4


196


2297


8855


circled times



oplus


circleplus


c5


197


2295


8853


circled plus



empty


emptyset


c6


198


2205


8709


empty set



cap


intersection


c7


199


2229


8745


intersection



cup


union


c8


200


222a


8746


union



sup


propersuperset


c9


201


2283


8835


superset of



supe


reflexsuperset (3.6:superset)


ca


202


2287


8839


superset of or equal



nsub


notsubset


cb


203


2284


8836


not a subset of



sub


propersubset


cc


204


2282


8834


subset of



sube


subset, should be reflexsubset


cd


205


should be 2286


8838


subset of or equal



isin


element (3.6:wrong)


ce


206


2208


8712


element of



notin


notelement (3.6:wrong)


cf


207


2209


8713


not an element of



ang


angle


d0


208


2220


8736


angle



nabla


triangle


c1


209


2207


8711


nabla


®


reg


registered


d2


210


*


*®



©


copy


copyright


d3


211


*


*©




trade


trademark, trademarkserif


d4


212


2122


8482


trademark



prod


product


d5


213


220f


8719


n-ary product



radic


radical


d6


214


221a


8730


square root



sdot


mathdot (3.6: periodcentered)


d7


215


* or 22c5


*183 or 8901


dot operator


¬


*not


logicalnot


d8


216


*


*172




and


logicaland


d9


217


2227


8743


logical and



or


logicalor


da


218


2228


8744


logical or



hArr


arrowdblboth


db


219


21d4


8660


left right double arrow



lArr


arrowdblleft


dc


220


21d0


8656


left double arrow



uArr


arrowdblup


dd


221


21d1


8657


up double arrow



rArr


arrowdblright


de


222


21d2


8658


right double arrow



dArr


arrowdbldown


df


223


21d3


8659


down double arrow


Symbol font, hex e0-ff

Specimen


HTML4.0 WD


Rtftohtml and [Adobe] names


Symbol font hex


ALT/ 0nnn


Unicode hex


Unicode decimal


Unicode name (some abbreviated)



loz


lozenge


e0


224


25ca


9674


lozenge



lang


angleleft


e1


225


2329


9001


left pntng angle brckt, bra


®


*reg


registersans


e2


226


*


*®


registered (approx)


©


*copy


copyrightsans


e3


227


*


*©


copyright (approx)



*trade


trademarksans


e4


228


2122


8482


trademark (approx)



sum


summation (3.6:Sigma)


e5


229


2211


8721


n-ary summation


[lparentop]



lparentop


e6


230


-




[lparenmid]



lparenmid


e7


231


-




[lparenbot]



lparenbot


e8


232


-





lceil


lbracktop


e9


233


2308


8968


left ceiling


[lbrackmid]



lbrackmid


ea


234


-





lfloor


lbrackbot


eb


235


230a


8970


left floor


[lbracetop]



lbracetop


ec


236


-




[lbracemid]



lbracemid


ed


237


-




[lbracebot]



lbracebot


ee


238


-




|



longbar [bracex]


ef


239


*


*


(kludge)











rang


angleright


f1


241


232a


9002


right pntng angle brckt, ket



int


integral


f2


242


222b


8747


integral




integraltop


f3


243


2320


8992


top half integral


[integralmid]



integralmid


f4


244


-






integralbot


f5


245


2321


8993


bottom half integral


[rparentop]



rparentop


f6


246


-




[rparenmid]



rparenmid


f7


247


-




[rparenbot]



rparenbot


f8


248


-





rceil


rbracktop


f9


249


2309


8969


right ceiling


[rbrackmid]



rbrackmid


fa


250


-





rfloor


rbrackbot


fb


251


230b


8971


right floor


[rbracetop]



rbracetop


fc


252


-




[rbracemid]



rbracemid


fd


253


-




[rbracebot]



rbracebot


fe


254


-












Relevant Mac codes

(Mac code specimens will only show in the HTML version)

Specimen


HTML4.0 WD


Rtftohtml and [Adobe] names


Mac code


(n/a)


Unicode hex


Unicode decimal


Unicode name (some abbreviated)




apple


f0



none (2318)


none (8984)


(just a suggestion: place of interest sign)










ı

dotlessi


f5



0131


305


Latin small dotless I










˜

tilde


f7



02dc


732


small tilde










˘

breve


f9



02d8


728


breve


˙

dotaccent


fa



02d9


729


dot above


˚

ring


fb



02da


730


ring above










˝

hungarumlaut


fd



02dd


733


spacing double acute


˛

ogonek


fe



02db


731


ogonek


ˇ

caron


ff



02c7


711


caron


Notes

One whole section of the Symbol font duplicates the digits and some common punctuation characters, and no attempt was made to keep those distinct from their ASCII equivalents.
html-map treats as equivalent the serif and sansserif versions of registered, copyright, and trademark.
If you are reading this as an HTML document, then it would have been converted from RTF using rtftohtml and one of the character map files which it describes.

TODO

• rtftohtml3.8 had "reflexsubset" correctly mapped to "⊆", but the *-sym code maps still had 0xcd mapped to "subset". Either correcting 0xcd to "reflexsubset", or amending "subset" from ⊂ to "⊆", gets the right end-result. Making both changes would seem best (I reported this error to the author but it doesn’t seem to have made it into the released version).
• The normal "phi", Symbol font position 0x66, is translated into the unicode character x03c6, which all the documentation (unicode, HTML+, HTML4.0 etc.) shows to be correct, and the script_phi (a.k.a phiv or phi1), Symbol font position 0x6a, to x03d5, again as shown in the documentation. However, using their respective fonts, Alis Tango 3.1 showed the characters apparently interchanged, and Win Netscape 4.01a shows what looks to be the script phi in the first position, and the standard missing-char filler box in the second position. F.Yergeau of Alis sent me a nice email saying there was an error in the Tango character maps; he also explained that the script-looking phi displayed by Netscape is in fact the normal "Times New Roman" upright phi, in spite of its appearance, and giving an explanation why Netscape is unable to display the script phi from the Symbol font.
• The "apple" character (in the Mac character code) seems not to exist in Unicode, though the vendor's mapping uses U+F8FF. Some informants feel that the "clover" sign, as used on many Mac keyboards, would be an acceptable substitute. The corresponding unicode character is the "place of interest sign", U+2318, and the "Jargon File" offers an explanation for how this came about, see e.g:http://www.comedia.com/Hot/jargon_3.0/JARGON_F/FEATUKEY.HTML

Changes 3.6 to 3.8

In order to benefit from the corrections in 3.8 one must use the 3.8 code maps, although for backwards compatibility a self-consistent set of old map files would still work (subject to these limitations).
• At 3.6 there were only two names for vertical bars - bar and longbar - but there are three different characters (vertical bar, broken vertical bar, and the symbol font's "bracex"); the *-gen mappings assigned the name "bar" to both the vertical bar and the broken vertical bar, and the *-sym mappings used the name "longbar". 3.8 introduced “brvbar”, with appropriate adjustments.
• At 3.6 the name "periodcentered" was used for both the middot and the mathematical dot operator: 3.8 introduced “mathdot”.
• At 3.6 there were errors in the naming of four of the set-theory operators. In order to get these into line with the documented Adobe names, 3.8 added the names "reflexsuperset", "element" and "notelement", with appropriate adjustments.
• At 3.6 the iso-8859-1 "micro" sign and the Greek lower-case "mu" were both mapped to the character name "mu" which in turn is mapped to the iso-8859-1 "micro". Pedantically this isn't correct: they are two different characters in their own right, in spite of appearances. 3.8 corrected this.

Other References

For historical background, see the splendid material produced as an HTML+ discussion document from 1994 (no author given, but the document names Bob Stayton for doing the research). My CP1252 equivalences were originally drawn from a table by Markus Kuhn, that is quoted in a handy web page athttp://www.pemberley.com/janeinfo/latin1.html , although this information too can be found in the W3C working draft. All has been checked as far as possible against authoritative sources, principally those cited above.

Endnote

"Q: How do you make an em dash? - A: set fire to its serif!" (from a usenet posting by Lars Eighner)