Joachim Breitner's Homepage
More thoughts on a bootstrappable GHC
The bootstrappable builds project tries to find ways of building all our software from source, without relying on binary artifacts. A noble goal, and one that is often thwarted by languages with self-hosting compilers, like GHC: In order to build GHC, you need GHC. A Pull Request against nixpkgs, adding first steps of the bootstrapping pipeline, reminded me of the issue with GHC, which I have noted down some thoughts about before and I played around a bit more.
The most promising attempt to bootstrap GHC was done by rekado in 2017. He observed that Hugs is maybe the most recently maintained bootstrappable (since written in C) Haskell compiler, but noticed that “it cannot deal with mutually recursive module dependencies, which is a feature that even the earliest versions of GHC rely on. This means that running a variant of GHC inside of Hugs is not going to work without major changes.” He then tries to bootstrap another very old Haskell compiler (nhc) with Hugs, and makes good but incomplete progress.
This made me wonder: What if Hugs supported mutually recursive modules? Would that make a big difference? Anthony Clayden keeps advocating Hugs as a viable Haskell implementation, so maybe if that was the main blocker, then adding support to Hugs for that is probably not too hard (at least in a compile-the-strongly-connected-component-as-one-unit mode) and worthwhile?
All of GHC in one file?
That reminded me of a situation I was in before, where I had to combine multiple Haskell modules into one before: For my talk “Lock-step simulation is child’s play” I wrote a multi-player game, a simulation environment for it, and a presentation tool around it, all in the CodeWorld programming environment, which supports only a single module. So I hacked the a small tool hs-all-in-one that takes multiple Haskell modules and combines them into one, mangling the names to avoid name clashes.
This made me wonder: Can I turn all of GHC into one module, and compile that?
At this point I have probably left the direct path towards bootstrapping, but I kinda good hooked.
Using GHC’s
hadrian/ghci
tool, I got it to produce the necessary generated files (e.g. fromhappy
grammars) and spit out the lit of modules that make up GHC, which I could feed tohs-all-in-one
.It uses
haskell-src-exts
for parsing, and it was almost able to parse all of that. It has a different opinion about howMultiWayIf
should be indented, whetherEmptyCase
needs{}
and issues pretty-printing some promoted values, but otherwise the round-tripping worked fine, and I as able to generate a large file (680,000 loc, 41 MB) that passes GHC’s parser.It also uses
haskell-names
to resolve names.This library is less up-to-date with various Haskell features, so I added support for renaming in some pragmas (
ANN
,SPECIALIZE
), pattern signatures etc.For my previous use-case I could just combine all the imports, knowing that I would not introduce conflicts. For GHC, this is far from true: Some modules import
Data.Map.Strict
, othersData.Map.Lazy
, and yet others introduce names that clash with stuff imported from the prelude… so I had to change the tool to fully qualify all imported values. This isn’t so bad, I can do that usinghaskell-names
, if I somehow know what all the modules inbase
,containers
,transformers
andarray
export.The
haskell-names
library itself comes with a hard-coded database ofbase
exports, but it is incomplete and doesn’t help me with, say,containers
.I then wrote a little parser for the
.txt
files thathaddock
produces for the benefit ofhoogle
, and that are conveniently installed along the packages (at least on nix). This would have been great, if these files wouldn’t simply omit all reexported entities! I added some manual hacks (Data.Map
has the same exports asData.IntMap
;Prelude
exports all entities as known byhaskell-names
, but those that are also exported fromData.List
, use the symbol from there…)I played this game of whack-a-mole for a while, solving many of the problems that GHC’s renamer reports, but eventually stopped to write this blog post. I am fairly confident that this could be pulled through, though.
Back to bootstrapping
So what if we could pull this through? We’d have a very large code file that GHC may be able to compile to produce a ghc
binary without exhausting my RAM. But that doesn’t help with bootstrapping yet.
If lack of support for recursive modules is all that Hugs is missing, we’d be done indeed. But quite contrary, it is probably the least of our worries, given that contemporary GHC uses many many other features not supported by Hugs.
Some of them a syntactic and can easily be rewritten to more normal Haskell in a preprocessing step (e.g. MultiWayIf
).
Others are deep and hard (GADTs
, Pattern synonyms, Type Families), and prohibit attempting to compile a current version of GHC (even if its all one module) with Hugs. So one would certainly have to go back in time and find a version of GHC that is not yet using all these features. For example, the first use of GADTs was introduced by Simon Marlow in 2011, so this suggests going back to at least GHC 7.0.4, maybe earlier.
Still, being able to mangle the source code before passing it to Hugs is probably a useful thing. This poses the question whether Hugs can compile such a tool; in particular, is it capable of compiling haskell-src-exts
, which I am not too optimistic about either. Did someone check this already?
So one plan of attack could be
Identify an old version of GHC that
- One can use to bootstrap subsequent versions until today.
- Is old enough to use as few features not supported by hugs as possible.
- Is still new enough so that one can obtain a compatible toolchain.
Wrangle the build system to tell you which files to compile, with which preprocessor flags etc.
Boostrap all pre-processing tools used by GHC (
cpphs
or use plan cpp,happy
,alex
).For every language feature not supported by Hugs, either
- Implement it in Hugs,
- Manually edit the source code to avoid compiling the problematic code, if it is optional (e.g. in an optimization pass)
- Rewrite the problematic code
- Write a pre-processing tool (like the one above) that compiles the feature away
Similarly, since Hugs probably ships a
base
that is different than what GHC, or the libraries used by GHC expects, either adjust Hugs’base
, or modify the GHC code that uses it.
My actual plan, though, for now is to throw these thoughts out, maybe make some noise on Discourse, Mastodon, Twitter and lobste.rs, and then let it sit and hope someone else will pick it up.
Comments
Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.
The current bootstrap chain of GHC reads in Guix:
which can be easily reduced to:
The version 7.8.4 is not yet bootstrapped with previous versions.
On the other hand, Ricardo Wurmus (rekado) did this chain in Guix:
Therefore, the current state is:
Would 4.08.2 be a good candidate for building it using Hugs?