upload android base code part3

2018-08-08 16:48:17 +08:00 · 2018-08-08 16:48:17 +08:00 · b9e30e05b1
commit b9e30e05b1
parent 71b83c22f1
15122 changed files with 2089659 additions and 0 deletions
--- a/android/build/kati/INTERNALS.md
+++ b/android/build/kati/INTERNALS.md
@ -0,0 +1,548 @@
+Kati internals
+==============
+
+This is an informal document about internals of kati. This document is not meant
+to be a comprehensive document of kati or GNU make. This explains some random
+topics which other programmers may be interested in.
+
+Motivation
+----------
+
+The motivation of kati was to speed up Android platform build. Especially, its
+incremental build time was the main focus. Android platform's build system is a
+very unique system. It provides a DSL, (ab)using Turing-completeness of GNU
+make. The DSL allows developers to write build rules in a descriptive way, but
+the downside is it's complicated and slow.
+
+When we say a build system is slow, we consider "null build" and "full
+build". Null build is a build which does nothing, because all output files are
+already up-to-date. Full build is a build which builds everything, because there
+were nothing which have been already built. Actual builds in daily development
+are somewhere between null build and full build. Most benchmarks below were done
+for null build.
+
+For Android with my fairly beefy workstation, null build took ~100 secs with GNU
+make. This means you needed to wait ~100 secs to see if there's a compile error
+when you changed a single C file. To be fair, things were not that bad. There
+are tools called mm/mmm. They allow developers to build an individual module. As
+they ignore dependencies between modules, they are fast. However, you need to be
+somewhat experienced to use them properly. You should know which modules will be
+affected by your change. It would be nicer if you can just type "make" whenever
+you change something.
+
+This is why we started this project. We decided to create a GNU make clone from
+scratch, but there were some other options. One option was to replace all
+Android.mk by files with a better format. There is actually a longer-term
+project for this. Kati was planned to be a short-term project. Another option
+was to hack GNU make instead of developing a clone. We didn't take this option
+because we thought the source code of GNU make is somewhat complicated due to
+historical reason. It's written in old-style C, has a lot of ifdefs for some
+unknown architectures, etc.
+
+Currently, kati's main mode is --ninja mode. Instead of executing build commands
+by itself, kati generates build.ninja file and
+[ninja](https://github.com/martine/ninja) actually runs commands. There were
+some back-and-forths before kati became the current form. Some experiments
+succeeded and some others failed. We even changed the language for kati. At
+first, we wrote kati in Go. We naively expected we can get enough performance
+with Go. I guessed at least one of the following statements are true: 1. GNU
+make is not very optimized for computation heavy Makefiles, 2. Go is fast for
+our purpose, or 3. we can come up with some optimization tricks for Android's
+build system. As for 3, some of such optimization succeeded but it's performance
+gain didn't cancel the slowness of Go.
+
+Go's performance would be somewhat interesting topic. I didn't study the
+performance difference in detail, but it seemed both our use of Go and Go
+language itself were making the Go version of kati slower. As for our fault, I
+think Go version has more unnecessary string allocations than C++ version
+has. As for Go itself, it seemed GC was the main show-stopper. For example,
+Android's build system defines about one million make variables, and buffers for
+them will be never freed. IIRC, this kind of allocation pattern isn't good for
+non-generational GC.
+
+Go version and test cases were written by ukai and me, and C++ rewrite was done
+mostly by me. The rest of this document is mostly about the C++ version.
+
+Overall architecture
+--------------------
+
+Kati consists of the following components:
+
+* Parser
+* Evaluator
+* Dependency builder
+* Executor
+* Ninja generator
+
+A Makefile has some statements which consist of zero or more expressions. There
+are two parsers and two evaluators - one for statements and the other for
+expressions.
+
+Most of users of GNU make may not care about the evaluator much. However, GNU
+make's evaluator is very powerful and is Turing-complete. For Android's null
+build, most time is spent in this phase. Other tasks, such as building
+dependency graphs and calling stat function for build targets, are not the
+bottleneck. This would be a very Android specific characteristics. Android's
+build system uses a lot of GNU make black magics.
+
+The evaluator outputs a list of build rules and a variable table. The dependency
+builder creates a dependency graph from the list of build rules. Note this step
+doesn't use the variable table.
+
+Then either executor or ninja generator will be used. Either way, kati runs its
+evaluator again for command lines. The variable table is used again for this
+step.
+
+We'll look at each components closely. GNU make is a somewhat different language
+from modern languages. Let's see.
+
+Parser for statements
+---------------------
+
+I'm not 100% sure, but I think GNU make parses and evaluates Makefiles
+simultaneously, but kati has two phases for parsing and evaluation. The reason
+of this design is for performance. For Android build, kati (or GNU make) needs
+to read ~3k files ~50k times. The file which is read most often is read ~5k
+times. It's waste of time to parse such files again and again. Kati can re-use
+parsed results when it needs to evaluate a Makefile second time. If we stop
+caching the parsed results, kati will be two times slower for Android's
+build. Caching parsed statements is done in *file_cache.cc*.
+
+The statement parser is defined in *parser.cc*. In kati, there are four kinds of
+statements:
+
+* Rules
+* Assignments
+* Commands
+* Make directives
+
+Data structures for them are defined in *stmt.h*. Here are examples of these
+statements:
+
+    VAR := yay!      # An assignment
+    all:             # A rule
+    	echo $(VAR)  # A command
+    include xxx.mk   # A make directive (include)
+
+In addition to include directive, there are ifeq/ifneq/ifdef/ifndef directives
+and export/unexport directives. Also, kati internally uses "parse error
+statement". As GNU make doesn't show parse errors in branches which are not
+taken, we need to delay parse errors to evaluation time.
+
+### Context dependent parser
+
+A tricky point of parsing make statements is that the parsing depends on the
+context of the evaluation. See the following Makefile chunk for example:
+
+    $(VAR)
+    	X=hoge echo $${X}
+
+You cannot tell whether the second line is a command or an assignment until
+*$(VAR)* is evaluated. If *$(VAR)* is a rule statement, the second line is a
+command and otherwise it's an assignment. If the previous line is
+
+    VAR := target:
+
+the second line will turn out to be a command.
+
+For some reason, GNU make expands expressions before it decides the type of
+a statement only for rules. Storing assignments or directives in a variable
+won't work as assignments or directives. For example
+
+    ASSIGN := A=B
+    $(ASSIGN):
+
+doesn't assign "*B:*" to *A*, but defines a build rule whose target is *A=B*.
+
+Anyway, as a line starts with a tab character can be either a command statement
+or other statements depending on the evaluation result of the previous line,
+sometimes kati's parser cannot tell the statement type of a line. In this case,
+kati's parser speculatively creates a command statement object, keeping the
+original line. If it turns out the line is actually not a command statement,
+the evaluator re-runs the parser.
+
+### Line concatenations and comments
+
+In most programming languages, line concatenations by a backslash character and
+comments are handled at a very early stage of a language
+implementation. However, GNU make changes the behavior for them depending on
+parse/eval context. For example, the following Makefile outputs "has space" and
+"hasnospace":
+
+    VAR := has\
+    space
+    all:
+    	echo $(VAR)
+    	echo has\
+    nospace
+
+GNU make usually inserts a whitespace between lines, but for command lines it
+doesn't. As we've seen in the previous subsection, sometimes kati cannot tell
+a line is a command statement or not. This means we should handle them after
+evaluating statements. Similar discussion applies for comments. GNU make usually
+trims characters after '#', but it does nothing for '#' in command lines.
+
+We have a bunch of comment/backslash related testcases in the testcase directory
+of kati's repository.
+
+Parser for expressions
+----------------------
+
+A statement may have one or more expressions. The number of expressions in a
+statement depends on the statement's type. For example,
+
+    A := $(X)
+
+This is an assignment statement, which has two expressions - *A* and
+*$(X)*. Types of expressions and their parser are defined in *expr.cc*. Like
+other programming languages, an expression is a tree of expressions. The type of
+a leaf expression is either literal, variable reference,
+[substitution references](http://www.gnu.org/software/make/manual/make.html#Substitution-Refs),
+or make functions.
+
+As written, backslashes and comments change their behavior depending on the
+context. Kati handles them in this phase. *ParseExprOpt* is the enum for the
+contexts.
+
+As a nature of old systems, GNU make is very permissive. For some reason, it
+allows some kind of unmatched pairs of parentheses. For example, GNU make
+doesn't think *$($(foo)* is an error - this is a reference to variable
+*$(foo*. If you have some experiences with parsers, you may wonder how one can
+implement a parser which allows such expressions. It seems GNU make
+intentionally allows this:
+
+http://git.savannah.gnu.org/cgit/make.git/tree/expand.c#n285
+
+No one won't use this feature intentionally. However, as GNU make allows this,
+some Makefiles have unmatched parentheses, so kati shouldn't raise an error for
+them, unfortunately.
+
+GNU make has a bunch of functions. Most users would use only simple ones such as
+*$(wildcard ...)* and *$(subst ...)*. There are also more complex functions such
+as *$(if ...)* and *$(call ...)*, which make GNU make Turing-complete. Make
+functions are defined in *func.cc*. Though *func.cc* is not short, the
+implementation is fairly simple. There is only one weirdness I remember around
+functions. GNU make slightly changes its parsing for *$(if ...)*, *$(and ...)*,
+and *$(or ...)*. See *trim_space* and *trim_right_space_1st* in *func.h* and how
+they are used in *expr.cc*.
+
+Evaluator for statements
+------------------------
+
+Evaluator for statements are defined in *eval.cc*. As written, there are four
+kinds of statements:
+
+* Rules
+* Assignments
+* Commands
+* Make directives
+
+There is nothing tricky around commands and make directives. A rule statement
+have some forms and should be parsed after evaluating expression by the third
+parser. This will be discussed in the next section.
+
+Assignments in GNU make is tricky a bit. There are two kinds of variables in GNU
+make - simple variables and recursive variables. See the following code snippet:
+
+    A = $(info world!)   # recursive
+    B := $(info Hello,)  # simple
+    $(A)
+    $(B)
+
+This code outputs "Hello," and "world!", in this order. The evaluation of
+a recursive variable is delayed until the variable is referenced. So the first
+line, which is an assignment of a recursive variable, outputs nothing. The
+content of the variable *$(A)* will be *$(info world!)* after the first
+line. The assignment in the second line uses *:=* which means this is a simple
+variable assignment. For simple variables, the right hand side is evaluated
+immediately. So "Hello," will be output and the value of *$(B)* will be an empty
+string ($(info ...) returns an empty string). Then, "world!" will be shown when
+the third line is evaluated as *$(A)* is evaluated, and lastly the forth line
+does nothing, as *$(B)* is an empty string.
+
+There are two more kinds of assignments (i.e., *+=* and *?=*). These assignments
+keep the type of the original variable. Evaluation of them will be done
+immediately only when the left hand side of the assignment is already defined
+and is a simple variable.
+
+Parser for rules
+----------------
+
+After evaluating a rule statement, kati needs to parse the evaluated result. A
+rule statement can actually be the following four things:
+
+* A rule
+* A [target specific variable](http://www.gnu.org/software/make/manual/make.html#Target_002dspecific)
+* An empty line
+* An error (there're non-whitespace characters without a colon)
+
+Parsing them is mostly done in *rule.cc*.
+
+### Rules
+
+A rule is something like *all: hello.exe*. You should be familiar with it. There
+are several kinds of rules such as pattern rules, double colon rules, and order
+only dependencies, but they don't complicate the rule parser.
+
+A feature which complicates the parser is semicolon. You can write the first
+build command on the same line as the rule. For example,
+
+    target:
+    	echo hi!
+
+and
+
+    target: ; echo hi!
+
+have the same meaning. This is tricky because kati shouldn't evaluate expressions
+in a command until the command is actually invoked. As a semicolon can appear as
+the result of expression evaluation, there are some corner cases. A tricky
+example:
+
+    all: $(info foo) ; $(info bar)
+    $(info baz)
+
+should output *foo*, *baz*, and then *bar*, in this order, but
+
+    VAR := all: $(info foo) ; $(info bar)
+    $(VAR)
+    $(info baz)
+
+outputs *foo*, *bar*, and then *baz*.
+
+Again, for the command line after a semicolon, kati should also change how
+backslashes and comments are handled.
+
+    target: has\
+    space ; echo no\
+    space
+
+The above example says *target* depends on two targets, *has* and *space*, and
+to build *target*, *echo nospace* should be executed.
+
+### Target specific variables
+
+You may not familiar with target specific variables. This feature allows you to
+define variable which can be referenced only from commands in a specified
+target. See the following code:
+
+    VAR := X
+    target1: VAR := Y
+    target1:
+    	echo $(VAR)
+    target2:
+    	echo $(VAR)
+
+In this example, *target1* shows *Y* and *target2* shows *X*. I think this
+feature is somewhat similar to namespaces in other programming languages. If a
+target specific variable is specified for a non-leaf target, the variable will
+be used even in build commands of prerequisite targets.
+
+In general, I like GNU make, but this is the only GNU make's feature I don't
+like. See the following Makefile:
+
+    hello: CFLAGS := -g
+    hello: hello.o
+    	gcc $(CFLAGS) $< -o $@
+    hello.o: hello.c
+    	gcc $(CFLAGS) -c $< -o $@
+
+If you run make for the target *hello*, *CFLAGS* is applied for both commands:
+
+    $ make hello
+    gcc -g -c hello.c -o hello.o
+    gcc -g hello.o -o hello
+
+However, *CFLAGS* for *hello* won't be used when you build only *hello.o*:
+
+    $ make hello.o
+    gcc  -c hello.c -o hello.o
+
+Things could be even worse when two targets with different target specific
+variables depend on a same target. The build result will be inconsistent. I
+think there is no valid usage of this feature for non-leaf targets.
+
+Let's go back to the parsing. Like for semicolons, we need to delay the
+evaluation of the right hand side of the assignment for recursive variables. Its
+implementation is very similar to the one for semicolons, but the combination of
+the assignment and the semicolon makes parsing a bit trickier. An example:
+
+    target1: ;X=Y echo $(X)  # A rule with a command
+    target2: X=;Y echo $(X)  # A target specific variable
+
+Evaluator for expressions
+-------------------------
+
+Evaluation of expressions is done in *expr.cc*, *func.cc*, and
+*command.cc*. The amount of code for this step is fairly large especially
+because of the number of GNU make functions. However, their implementations are
+fairly straightforward.
+
+One tricky function is $(wildcard ...). It seems GNU make is doing some kind of
+optimization only for this function and $(wildcard ...) in commands seem to be
+evaluated before the evaluation phase for commands. Both C++ kati and Go kati
+are different from GNU make's behavior in different ways, but it seems this
+incompatibility is OK for Android build.
+
+There is an important optimization done for Android. Android's build system has
+a lot of $(shell find ...) calls to create a list of all .java/.mk files under a
+directory, and they are slow. For this, kati has a builtin emulator of GNU
+find. The find emulator traverses the directory tree and creates an in-memory
+directory tree. Then the find emulator returns results of find commands using
+the cached tree. For my environment, the find command emulator makes kati ~1.6x
+faster for AOSP.
+
+The implementations of some IO-related functions in commands are tricky in the
+ninja generation mode. This will be described later.
+
+Dependency builder
+------------------
+
+Now we get a list of rules and a variable table. *dep.cc* builds a dependency
+graph using the list of rules. I think this step is what GNU make is supposed to
+do for normal users.
+
+This step is fairly complex like other components but there's nothing
+strange. There are three types of rules in GNU make:
+
+* explicit rule
+* implicit rule
+* suffix rule
+
+The following code shows the three types:
+
+    all: foo.o
+    foo.o:
+    	echo explicit
+    %.o:
+    	echo implicit
+    .c.o:
+    	echo suffix
+
+In the above example, all of these three rules match the target *foo.o*. GNU
+make prioritizes explicit rules first. When there's no explicit rule for a
+target, it uses an implicit rule with longer pattern string. Suffix rules are
+used only when there are no explicit/implicit rules.
+
+Android has more than one thousand implicit rules and there are ten thousands of
+targets. It's too slow to do matching for them with a naive O(NM)
+algorithm. Kati uses a trie to speed up this step.
+
+Multiple rules without commands should be merged into the rule with a
+command. For example:
+
+    foo.o: foo.h
+    %.o: %.c
+    	$(CC) -c $< -o $@
+
+*foo.o* depends not only on *foo.c*, but also on *foo.h*.
+
+Executor
+--------
+
+C++ kati's executor is fairly simple. This is defined in *exec.cc*. This is
+useful only for testing because this lacks some important features for a build
+system (e.g., parallel build).
+
+Expressions in commands are evaluated at this stage. When they are evaluated,
+target specific variables and some special variables (e.g., $< and $@) should be
+considered. *command.cc* is handling them. This file is used by both the
+executor and the ninja generator.
+
+Evaluation at this stage is tricky when both *+=* and target specific variables
+are involved. Here is an example code:
+
+    all: test1 test2 test3 test4
+    
+    A:=X
+    B=X
+    X:=foo
+    
+    test1: A+=$(X)
+    test1:
+    	@echo $(A)  # X bar
+    
+    test2: B+=$(X)
+    test2:
+    	@echo $(B)  # X bar
+    
+    test3: A:=
+    test3: A+=$(X)
+    test3:
+    	@echo $(A)  # foo
+    
+    test4: B=
+    test4: B+=$(X)
+    test4:
+    	@echo $(B)  # bar
+    
+    X:=bar
+
+*$(A)* in *test3* is a simple variable. Though *$(A)* in the global scope is
+simple, *$(A)* in *test1* is a recursive variable. This means types of global
+variables don't affect types of target specific variables. However, The result
+of *test1* ("X bar") shows the value of a target specific variable is
+concatenated to the value of a global variable.
+
+Ninja generator
+---------------
+
+*ninja.cc* generates a ninja file using the results of other components. This
+step is actually fairly complicated because kati needs to map GNU make's
+features to ninja's.
+
+A build rule in GNU make may have multiple commands, while ninja's has always a
+single command. To mitigate this, the ninja generator translates multiple
+commands into something like *(cmd1) && (cmd2) && ...*. Kati should also escape
+some special characters for ninja and shell.
+
+The tougher thing is $(shell ...) in commands. Current kati's implementation
+translates it into shell's $(...). This works for many cases. But this approach
+won't work when the result of $(shell ...) is passed to another make
+function. For example
+
+    all:
+    	echo $(if $(shell echo),FAIL,PASS)
+
+should output PASS, because the result of $(shell echo) is an empty string. GNU
+make and kati's executor mode output PASS correctly. However, kati's ninja
+generator emits a ninja file which shows FAIL.
+
+I wrote a few experimental patches for this issue, but they didn't
+work well. The current kati's implementation has an Android specific workaround
+for this. See *HasNoIoInShellScript* in *func.cc* for detail.
+
+Ninja regeneration
+------------------
+
+C++ kati has --regen flag. If this flag is specified, kati checks if anything
+in your environment was changed after the previous run. If kati thinks it doesn't
+need to regenerate the ninja file, it finishes quickly. For Android, running
+kati takes ~30 secs at the first run but the second run takes only ~1 sec.
+
+Kati thinks it needs to regenerate the ninja file when one of the followings is
+changed:
+
+* The command line flags passed to kati
+* A timestamp of a Makefile used to generate the previous ninja file
+* An environment variable used while evaluating Makefiles
+* A result of $(wildcard ...)
+* A result of $(shell ...)
+
+Quickly doing the last check is not trivial. It takes ~18 secs to run all
+$(shell ...) in Android's build system due to the slowness of $(shell find
+...). So, for find commands executed by kati's find emulator, kati stores the
+timestamps of traversed directories with the find command itself. For each find
+commands, kati checks the timestamps of them. If they are not changed, kati
+skips re-running the find command.
+
+Kati doesn't run $(shell date ...) and $(shell echo ...) during this check. The
+former always changes so there's no sense to re-run them. Android uses the
+latter to create a file and the result of them are empty strings. We don't want
+to update these files to get empty strings.
+
+TODO
+----
+
+A big TODO is sub-makes invoked by $(MAKE). I wrote some experimental patches
+but nothing is ready to be used as of writing.