Soon after writing the game, I mentioned some of how I'd optimised the
code in email to a friend. In case it's of any interest to a wider
audience, here it is (slightly modified, as you might expect).


My first entry was a port of my `ztrack', which was loosely based on
Casio's old "Turbo Drive" LCD driving game. What made this
particularly suited to the speccy was the way the display worked. I
could stick a bitmap on the screen with all the track lanes and cars
in each position then, during the game, I could modify the attributes
to light up the relevant bits of the `LCD'. Finally, a situation where
the attribute system is actually an advantage... :-)

There were just two problems with this:

- a full screen bitmap would take 6144 bytes.

- I wasn't sure if I could port the ztrack code itself into 1k even
  without a memory-hogging bitmap.

The only practical solution to the bitmap problem seemed to be to use
a very small version of one half of the track display, then expand and
mirror it to the screen. Just how small I reckoned it would have to be
was a bit of a shock - around 32x32. With the mirroring, and the track
display not taking up the whole screen vertically, the effective
resolution of this is 64x48, i.e. the resolution of ZX81 `PLOT'
graphics. Less than great. :-(

There was only one real approach to the other half of the problem, the
port of the ztrack code - try it and see. I decided to largely ignore
how much space things took up, preferring instead to get the thing
working. Then once I had it working, I could look at how things were
on the memory front.

It all seemed to go rather well. The non-display code was shaping up
to fit in the 500 to 600 byte range on even the initial
implementation, and with the bitmap taking up about 128 bytes, there
should have been a respectable amount left over for display code.
Because obviously, the display code would be simple.

But things were not quite so obvious as I might have hoped. :-) The
problem I'd overlooked was how exactly to hide/show the various car
and lane graphics. The scheme I ended up with was to use what are
essentially fixed-position attribute `sprites', which worked well
enough - but adding sprite data and code to the expansion/mirroring
code, and the surprisingly large main loop, gave me a rather troubling
total. The code was over 1300 bytes.

Now, I've optimised Z80 stuff to save memory before with ZCN, so it
would be fair to say I've got a few chops in this department. :-) But
this one was bloody hard. Having to save over 300 bytes (leaving room
for the Basic wrapper, etc.) on something which was already pretty
small took me about 10 hours in the end, and drove me to some fairly
desperate measures to save room. I can't remember everything, but I'll
just give a few examples:

- probably the simplest one was replacing my `convert pixel line
  co-ord to screen address' routine with a call to the ROM's one. I'd
  originally thought I couldn't use that, as Basic uses a weird
  co-ordinate system (starting 16 pixels up from the bottom-left of
  the screen, with the bottom two character lines inaccessible) which
  the ROM routine follows. But if you call it partway in, you can do
  lookups for the whole screen with it.

  It's fair to say this was one of the less desperate measures. :-)

- my original format for the attribute sprite data was something like
  this:

    defw 05800h+12*32+14
    defb 4
    defb 01100000b
    defb 11110000b
    defb 11110000b
    defb 10010000b

  That's the address of the top-left position (in the attribute area),
  then the number of lines in the sprite bitmap, then the bitmap data
  showing which attributes to change.

  But using a whole separate byte for the number of lines - which
  could actually fit in 3 bits - seemed downright profligate. :-) And
  since the address was known to be in the attribute area, I thought I
  could get away with just having an attribute-area offset (which fits
  in 10 bits), and use the top 4 bits for the number of lines, giving:

    defw 4*4096+12*32+14
    defb 01100000b
    defb 11110000b
    defb 11110000b
    defb 10010000b

  Now, that's all very well, but you need to be able to get at the
  data without the code to get at the two separate things taking more
  than a few extra bytes. Even if I'd needed no more code to deal with
  the new format, I'd only save 25 bytes, so keeping it small was
  important.

  The obvious approach would have involved bit-shifts and ANDs, and
  would have taken 11 bytes. But I had a sneakier approach which ended
  up taking 9. Whether it was worth the extra hassle for two bytes is
  debatable, but still... :-) The Z80 has two instructions intended
  for shifting BCD digits in memory left and right, probably as an aid
  to doing long multiplication/division with BCD numbers. As you might
  expect, these instructions do the shift in 4-bit chunks, and use the
  accumulator as a carry digit. So for RLD you end up with something
  like this happening:

                           _______   ____
                          |       | |    |
                          v       | v    |
                 ,-----+-----.  ,-----+-----.
                 |hi   A   lo|  |hi  (HL) lo|
                 `-----+-----'  `-----+-----'
                          |              ^
                          |______________|

  That led to me using the following code to get the top then bottom
  half of the byte at HL non-destructively:

    ld d,(hl)
    xor a
    rld
    ld b,a
    ld (hl),d
    rrd
    ld (hl),d

  To be honest, I was just chuffed to have finally had a reason to use
  RLD/RRD. :-)

- random numbers are vitally important for ztrack, and unfortunately
  the easy way out, using the R register, wasn't good enough. The
  `random' numbers when using that gave really crappy results. So I
  reluctantly put in my usual Z80 random-number routine, which is
  about 50 bytes.

  That stayed in for quite a while. But as I squeezed the rest of the
  code harder and harder and struggled to find something else to cut
  down, the `rand' routine was something which had to give. So I
  turned to the ROM. It doesn't have a directly-callable version of
  Basic's RND, and even if it did the code uses floating-point
  numbers, and is slow by machine-code standards. But I had plenty of
  CPU time to spare, so I had a go.

  The basic idea was to copy the code from the ROM to RAM, so I could
  get it to do a RET, and also so I could stop it leaving a copy of
  the random number on the ROM's FP calculator stack, which would
  eventually fill the memory if left unchecked. The code for that is
  below. It shows another example of the insane extent to which I took
  things - the printer buffer happens to come directly after the
  attribute area, and I happened to have a routine to initialise all
  the attributes just before. So I decided to save 3 bytes by using
  the printer buffer for my copy of the RND code. :-)

    ;set up basic attrs
    ...
    ldir                [this leaves de pointing just past the screen memory]
    
    ;copy guts of the ROM's RND routine so we can call it sanely
    ;b is still zero
    ;de is pointing at rnd_bit (printer buffer)
    ld hl,025fdh
    ld c,40
    ldir
    ;stick a RET on the end
    ld a,0c9h
    ld (de),a
    ;stop it doing a dup to leave an FP return value, which otherwise
    ;would gradually fill memory
    ld hl,038h  ;end-calc/nop
    ld (05b18h),hl

  (The "end-calc" is a bytecode used by the ROM's FP calculator, BTW.
  When you call it you use these bytecodes inline, with "end-calc"
  marking the endpoint.)

  I also needed a little wrapper routine to replace `rand', but this
  still cut the RNG overhead to about 20 bytes. By the time I managed
  this one, I have to say, saving 30 bytes seemed like a miracle. :-)

At any rate, I eventually ended up getting the thing to fit. And
wouldn't you know it, no sooner had I struggled to manage that than I
noticed an easy way to save another 9 bytes. Bah.