[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Speeding up Python (Was: FICL has a nice VM interpreter loop...)

There are likely a number of ways to speed up Python.
Many would require extensive changes to the Python source code,
and might be good candidates for Python 2.

One good source of techniques is Smalltalk.
Like Python, Smalltalk's variables can store any type,
and the type can change at any time.
Smalltalk has been going down the learning curve
of implementing dynamically typed languages for over
18 years, and some reasonably fast implementations
have been developed.  Many of the techniques have
been documented in the literature.

The slowest widely used version of Smalltalk is  Squeak,
a highly portable open source bytecode interpreter that
runs about 3 times as fast as Python in small C-style
benchmarks.  (See benchmark results at the end of this
message.)  The fastest is likely to be VisualWorks,
which uses dynamic (JIT) compilation and some other
tricks to end up running at 10 times the speed of Squeak.

Some of the techniques used to speed up Smalltalk include:

1. Garbage collection

     Some of the early implementations of Smalltalk used
     reference counting.  However, they found that when they
     implemented actual garbage collection, in addition
     to being able to collect cycles, they reduced overhead.
     Garbage collection actually sped up the implemention.
     Modern garbage collection techniques can often
     have lower overhead than manual memory allocation 
     using malloc and free even, even if you don't include
     the reference counting overhead.

     Garbage collection doesn't have to be complicated
     nor non-portable.  Conservative garbage collectors 
     that are aimed at C or C++ are likely to have
     porting problems because they are trying to add
     garbage collection on top of a language implemention
     that doesn't expect it and optimizing compilers
     that can hide information from the collector.
     By contrast, Squeak has a simple, effecient, but
     highly portable garbage collector because it is
     built into the language implemention.

2. Tagged values / "Unboxed" integers

     In most implementations of Smalltalk, a reference
     is a 32-bit value that can be either a "small"
     integer or a pointer.  One example might be
     using the lowest bit to differentiate between
     the two.  If it is a 1, it means the other 31 bits
     contain the integer value.  If it is 0, it means
     the number is a pointer.  Since pointers must be
     aligned on a 4-byte boundary they use their natural
     form without shifting.

     In Smalltalk, this allows up to 31-bit integers
     to be stored in manipulated and stored without
     the overhead of allocating a real object.
     The virtual machine checks overflow and will convert
     the "small" integer to a large (unlimited precision)
     integer if necessary.  There is no overflow error.

3. Fast instance variable access

     Instance variables in objects are accessed by offset.
     No search is required.  This is the same as in C++.
     However, it does mean that a classes methods have
     to be recompiled when the instance variable in a class
     change.  (The Smalltalk implementation handles this

4. Fast global variable access

     Global and class variables are accessed through a dictionary.
     However, instead of being formed as a set of parallel 
     key/value arrays, the dictionary is formed by a set of
     "association" objects.  Each association contains a key
     and value pair.  The compiled code contains references
     directly to the association.  So, the program can
     change the global value just like in Python, but no
     search is required to find that global value.

     The only difference is that the global value is deleted
     from the global (or class) dictionary, code already
     compiled will keep using the old value rather than
     give a "Name error".  If the global is later redefined
     the old code will still use the old value unless redefined.

     A way around this would be to store an error value 
     in the association before deleting it from the dictionary.
     The LOAD_GLOBAL bytecode could check for the error value,
     and lookup a new value if it found one, or return a name
     error if it didn't.  This check should be faster than
     a hash-table search, which requires a miss check anyway.
3. Fast method dispatch

     In Smalltalk, as in Python, the actual code to call
     is based on the function (method) name and the type
     (class) of the object receiving that call.  Finding
     the actual method to call is called method dispatch.

     Their are two ways Smalltalk implementations do
     method dispatch.  One is to maintain a global hash
     table indexed by a simple function (such as XOR)
     of the class and method name.  The hash table is
     generally not large enough to include all combinations
     but rather serves as a cache to the more commonly used
     ones.  It implementation checks the hash table
     against the class and method name, and if it finds
     a match uses the associated code pointer.
     If it doesn't find a match it starts searching the
     classes method dictionary, then its superclasses, etc.

     Another way is to have associated with
     each call site storage for a method name, a
     last used class, and a method address.  
     At each call, the code checks
     to see if the class of the current object is the
     same as it was last time.  If it is, it calls
     the same method it used last time.  If not, it
     calls the generalized lookup routine (which may
     include the method cache described above).

     In the general case that the object type doesn't
     change between calls, the code for this can be
     executed about as quickly as a C++ vtable lookup,
     but it is a much more general function.

6. Above and beyond

     Beyond these techniques, and quite a bit of tuning,
     two additional techniques can give radical speedups.
     The first is dynamic (JIT) compilation, or compiling
     to native code.  VisualWorks uses this, maintaining
     a native code cache of recently executed methods,
     achieving a 10x speedup over squeak and a 5x speedup
     over other commercial Smalltalks I have tried
     Of course, you have to implement code generation for 
     each system you want to port to.  The language
     overhead still doesn't completly dissapear.
     Even with native compilation, VisualWorks is still
     10x slower on my simple benchmarks than optimized C.

     A group developing the Self language, a derivitive
     of Smalltalk, developed a number of techniques
     for even faster execution.  Their techniques
     involved guessing the type of a variable, and later
     profiling the code and observing the type,
     and compiling accordingly.  Using this information,
     they were able to get with a factor of 2-4 of
     optimized C, but they used a lot of memory in the
     process.  They descibe their techniques at
     http://self.sunlabs.com.  Their papers their are
     worth reading.

     The Self group started out at Stanford.  They later
     moved to Sun.  Then, they formed a company
     Animorphic, to commercialize the technology.
     (Apparently, they got the memory consumption back
     down to a reasonable level.)  But, before they
     could, they were bought by Sun.  Sun's HotSpot
     compiler for Java is supposed to be based on
     this technology.

Smalltalk implementations probably represent the state
of the art in implementing dynamic object-oriented languages.

Java has borrowed a number of these same techniques.
If you look closely at how the bytecodes are defined
(especially including the "quick" bytecodes) you will
see much the same setup, with the unfortunate exception
of tagged values.  Also, unfortunately, the early Java
implementations has poor implementations of memory
management, generally using conservative collectors.
So, even though in theory Java should have less 
overhead than Smalltalk, it doesn't run any faster.

Many of these techniques could be applied to Python.
I would probably start with Garbage Collection.
In addition to speed, this would allow cycles to be
collected.  Not only is this a boon to programmers,
it broadens the techniques that can be used within
python to implement language extensions, etc.
Finally, it shouldn't be too complex to implement.
(I might try it myself if I have time.)

Of course, smaller optimizations also work.
Eliminating the actual generation of a list with
a "range" argument is one of the most obvious.

As a side note, when considering the cost and benefit
of a low-level optimization, don't just count
instructions.  Different instructions have different
costs.  An approximate order of cost would be

 cheap   Add, subtract, shift, logical on registers
   |     Load, store, FP operations (except divide/remainder),
   |           cheap operations on memory (Intel)
   |     Integer multiply on some processors (Intel)
   |     Function Call (Intel, Sparc)
   |     Integer multiply (others)
   |     Function Call (others)
   v     Mispredicted branch*
costly   Divides, remainder, squre root, etc.    (around 30x cheap)

Of course, it is best to actually measure the cost.
(But you all knew that.)

Anyway, I hope these ideas are useful.
Good Pythoning.

Greg Gritton

* The bytecode interpreter loop is a likely example
  of a mispredicted branch.  Because the branch location
  changes from one iteration to the next the processor
  will have a difficult time predicting it.

---------------------- Benchmark results -----------------------

All runs are on a 100MHz Cyrix 6x86 PR-120 with 512K L2 cache 
and 24M RAM run in December 1998 to January 1999.
The times are in milliseconds.

        System                  Bubble  Quick  Perm  Towers
   C (Visual C++ "release")       21     11     21     22
   VisualWorks Smalltalk 3.0     206     99    120     79
   Smalltalk MT 1.5              259    105    239    215
   Dolphin Smalltalk 2.1         805    364    701    538
   Squeak 2.0                   1349    632    961   1266
   Python 1.5                   3577   1555   2844   3115 

These benchmarks are derived from the Stanford Integer bencharks.
They tend to test low-level operations such as array access,
comparisons, and procedure calls.  They were originally written
in C, and becaue of this and their low-level nature tend to
favor C, so they don't necessarily give an idea of how fast
it is to use a langauge or system in real life.  However,
they do give somewhat of a worst case.

As an example of this, two of the benchmarks test sorting speed.  
When quicksort, for example, is replaced with Python's native sort
function, the entire benchmark is reduced from 1555ms to 483ms.
And, most of that is now initializing the array.  The sorting
itself takes only 98ms.

I can send or post source code if anyone wishes.