Writing an emulator in python - implementation questions (forperformance)

Santiago Romero · Nov 12, 2009

Hi.

I'm trying to port (just for fun), my old Sinclair Spectrum emulator,
ASpectrum, from C to Python + pygame.

Although the Sinclair Spectrum has a simple Z80 8 bit 3.5Mhz
microprocessor, and no aditional hardware (excluding the +2/+3 model's
AYsound chip), I'm not sure if my loved scripted language, python,
will be fast enought to emulate the Sinclair Spectrum at 100% speed.
There are Java Spectrum emulators available, so it should be possible.

Anyway, this message is directed to prepare the DESIGN so that the
code can run as fast as possible. I mean, I want to know the best way
to do some emulation tasks before starting to write a single line of
code.

My questions:

GLOBAL VARIABLES VS OBJECTS:
==================================

I need the emulation to be as fastest as possible. In my C program I
have an struct called "Z80_Processor" that contains all the registers,
memory structures, and so on. I pass that struct to the Z80Decode(),
Z80Execute() or Z80Dissassemble() functions, i.e.. This allows me (in
my C emulator) to emulate multiple z80 processors If I want.

As Python is scripted and I imagine that emulation will be slower
than the emulator written in C, I've thought of not creating a
Z80_Processor *object* and declare global variables such as reg_A,
reg_B, reg_PC, array main_memory[] and so on, and let the z80
functions to directly access that global variables.

I'm doing this to avoid OOP's extra processing... but this makes the
program less modular. Do you think using processor "objects" would
make the execution slower, or I'm doing right using global variables
and avoiding objects in this type of program?

Should I start writing all the code with a Z80CPU object and if
performance is low, just remove the "object" layer and declare it as
globals, or I should go directly for globals?

HIGH AND LOW PART OF REGISTERS:
=================================

- In C, I have the following structs and code for emulating registers:

typedef union
{
struct
{
unsigned char l, h;
} B;

unsigned short W;
} eword;

eword reg_A;

This means that reg_A is a 16 bit "variable" that I can directly
access with reg_A.w=value, and I can access also the LOW BYTE and HIGH
BYTES with reg_A.B.h and reg_A.B.l. And more importante, changing W
modifies l and h, and changing l or h modifies W.

How can I implement this in Python, I mean, define a 16 byte variable
so that high and low bytes can be accessed separately and changing W,
H or L affects the entire variable? I would like to avoid doing BIT
masks to get or change HIGH or LOW parts of a variable and let the
compiled code to do it by itself.

I know I can write an "object" with set and get methods that
implement that (that could be done in C too), but for emulation, I
need the FASTEST implementation possible (something like the C's Union
trick).

MEMORY (ARRAYS):
===========================

To emulate Spectrum's memory in C, I have a 64KB array like: unsigned
char memory_array[65535]. Then I can access memory_array[reg_PC] to
fetch the next opcode (or data for an opcode) and just increment
reg_PC.

Is python's array module the best (and fastest) implementation to
"emulate" the memory?

MEMORY (PAGES):
=============================

The Sinclair Spectrum 8 bit computer can address 64KB of memory and
that memory is based on 16KB pages (so it can see 4 pages
simultaneously, where page 0 is always ROM). Programs can change
"pages" to point to aditional 16KB pages in 128KB memory models.

I don't know how to emulate paging in python...

My first approach would be to have eight 16KB arrays, and "memcpy()"
memory to the main 64KB array when the user calls page swapping. I
mean (C + pseudocode):

main_memory[65536];
memory_blocks[8][16384];

// Initial settings
current_pages[4] = [0, 1, 2, 3]

// User swaps last memory page (3) to block 7, so I:
page_to_swap_from = 3
page_to_map = 7

// Save the contents of current page (page being unmapped):
memcpy( main_memory, // Source
16384*page_to_swap_from, // Starting
at
memory_blocks[current_pages[page_to_swap_from], // To
16384 ); // 16K

// Now map page 7 to memory block 3:
memcpy( memory_blocks[page_to_map], // Source
0, // Starting
at
main_memory[page_to_swap_from*16384], // To
16384 ); // 16K
current_pages[page_to_swap_from] = page_to_map;

Memcpy is very fast in C, but I don't know if doing the same in
python with arrays would be fast enough, or if there is a better
approach to simulate paging of 16KB blocks in a 64KB memory windows (4
mappable memory blocks).

Maybe another approach based in pointers or something like that?

Carl Banks · Nov 12, 2009

Hi.

I'm trying to port (just for fun), my old Sinclair Spectrum emulator,
ASpectrum, from C to Python + pygame.

The answer to your question is, "Use numpy". More details below.

[snip]

Should I start writing all the code with a Z80CPU object and if
performance is low, just remove the "object" layer and declare it as
globals,

Yes, but I'd suggest to index a numpy array rather than structures.
See below.

[snip]

How can I implement this in Python, I mean, define a 16 byte variable
so that high and low bytes can be accessed separately and changing W,
H or L affects the entire variable? I would like to avoid doing BIT
masks to get or change HIGH or LOW parts of a variable and let the
compiled code to do it by itself.

You can do clever memory slicing like this with numpy. For instance:

breg = numpy.zeros((16,),numpy.uint8)
wreg = numpy.ndarray((8,),numpy.uint16,breg)

This causes breg and wreg to share the same 16 bytes of memory. You
can define constants to access specific registers:

R1L = 1
R1H = 2
R1 = 1

breg[R1H] = 2
print wreg[R1]

[snip]

Is python's array module the best (and fastest) implementation to
"emulate" the memory?

I'd use numpy for this as well. (I doubt the Z80 had a 16-bit bus,
but if it did you could use the same memory sharing trick I showed you
with the registers to simulate word reads and writes.)

Note that, when performing operations on single values, neither numpy
nor array module are necessarily a lot faster than Python lists, might
even be slower. But they use a LOT less memory, which is important
for largish arrays.

[snip]

The Sinclair Spectrum 8 bit computer can address 64KB of memory and
that memory is based on 16KB pages (so it can see 4 pages
simultaneously, where page 0 is always ROM). Programs can change
"pages" to point to aditional 16KB pages in 128KB memory models.

I don't know how to emulate paging in python...

numpy again. This would mean you'd have to fiddle with addresses a
bit, but that shouldn't be too big a deal. Create the physical
memory:

mem = numpy.zeros((128*1024,),numpy.uint8)

Then create the pages. (This is a regular Python list containing
numpy slices. numpy slices share memory so there is no copying of
underlying data.)

page = [mem[0:16*1024],
mem[16*1024:32*1024],
mem[32*1024:48*1024],
mem[48*1024:64*1024]]

To access the byte at address 42432, you'd have use bit operations to
get a page number and index (2 and 9664 in this case), then you can
access the memory like this:

page[2][9664] = 33
p = page[3][99]

To swap a page, reassign the slice of main memory,

page[2] = mem[96*1024:112*1024]

Now, accessing address 42432 will access a byte from a different page.

If you don't want to fiddle with indirect pages and would just rather
copy memory around when a page swap occurs, you can do that, too.
(Assigning to a slice copies the data rather than shares.) I don't
know if it's as fast as memset but it should be pretty quick.

..

Hope these brief suggestions help. If you don't want third party
libraries, then numpy will be of no use. But I guess if you're using
pygame third party modules are ok. So get numpy, it'll make things a
lot easier. It can be a bit daunting to learn, though.

Carl Banks

Santiago Romero · Nov 12, 2009

I'm trying to port (just for fun), my old Sinclair Spectrum emulator,

The answer to your question is, "Use numpy". More details below.

Let's see

How can I implement this in Python, I mean, define a 16 byte variable
so that high and low bytes can be accessed separately and changing W,
H or L affects the entire variable? I would like to avoid doing BIT
masks to get or change HIGH or LOW parts of a variable and let the
compiled code to do it by itself.

Click to expand...

You can do clever memory slicing like this with numpy. For instance:

breg = numpy.zeros((16,),numpy.uint8)
wreg = numpy.ndarray((8,),numpy.uint16,breg)

This causes breg and wreg to share the same 16 bytes of memory. You
can define constants to access specific registers:

R1L = 1
R1H = 2
R1 = 1

breg[R1H] = 2
print wreg[R1]

And how about speed?

Assuming a 16 bit register named BC which contains 2 8 bit regiters (B
and C)...

Will the above be faster than shifts and bit operations (<<, and,changing either B, C or BC?

I'd use numpy for this as well. (I doubt the Z80 had a 16-bit bus,
but if it did you could use the same memory sharing trick I showed you
with the registers to simulate word reads and writes.)

Z80 has a 16 bit ADDRESS bus, 8 bit DATA bus. This means you can
address from 0 to 65535 memory cells of 8 bytes. Z80 has 16 bit bus
operations, but currently in C I write 16 bit values as two 8 bit
consecutive values without using (unsigned short *) pointers. But it
seems that numpy would allow me to do it better than C in this case...

Note that, when performing operations on single values, neither numpy
nor array module are necessarily a lot faster than Python lists, might
even be slower. But they use a LOT less memory, which is important
for largish arrays.

Well, we're talking about a 128KB 1-byte array, that's the maximum
memory size a Sinclair Spectrum can have, and always by using page
swapping of 16KB blocks in a 64KB addressable space...

If you really think that python lists can be faster that numpy or
array modules, let me know.

Maybe I'll make a "benchmark test", by making some millions of read /
write operations and timing the results.

I wan't to write the emulator as "pythonic" as possible...

numpy again. This would mean you'd have to fiddle with addresses a
bit, but that shouldn't be too big a deal. Create the physical
memory:

mem = numpy.zeros((128*1024,),numpy.uint8)

A 128K array of zeroes...

Then create the pages. (This is a regular Python list containing
numpy slices. numpy slices share memory so there is no copying of
underlying data.)

page = [mem[0:16*1024],
mem[16*1024:32*1024],
mem[32*1024:48*1024],
mem[48*1024:64*1024]]

Those are just like pointers to the "mem" numpy array, pointing to
concrete start indexes, aren't they?

To access the byte at address 42432, you'd have use bit operations to
get a page number and index (2 and 9664 in this case), then you can
access the memory like this:

Do you mean:

page = address / 16384
index = address MOD 16384

?

Or, better, with:

page = address >> 14
index = address & 16383

?

page[2][9664] = 33
p = page[3][99]

To swap a page, reassign the slice of main memory,

page[2] = mem[96*1024:112*1024]

Now, accessing address 42432 will access a byte from a different page.

But the above calculations (page, index) wouldn't be slower than a
simple play 64KB numpy array (and make no paging mechanism when
reading or writing) and copy memory slices when page swappings are
requested?

If you don't want to fiddle with indirect pages and would just rather
copy memory around when a page swap occurs, you can do that, too.
(Assigning to a slice copies the data rather than shares.) I don't
know if it's as fast as memset but it should be pretty quick.

That's what I was talking about.
With "page and index" I'm slowing down EACH memory read and write,
and that includes opcode and data fetching...

With "memory copying in page swaps", memory is always read and write
quickly, and if "slice copies" are fast enough, the emulation will be

100% of speed (I hope) for a 3.5Mhz system ...

Hope these brief suggestions help. If you don't want third party
libraries, then numpy will be of no use. But I guess if you're using
pygame third party modules are ok. So get numpy, it'll make things a
lot easier. It can be a bit daunting to learn, though.

Yes, you've been very helpful!

How about numpy portability? Is array more portable?

And finally, do you think I'm doing right by using global variables
for registers, memory, and so, or should I put ALL into a single
object and pass it to functions?

Ummm ... I think I'm going to do some tests with some "for i in range
(1,100000000000)" + time

Thanks a lot for your help

Any other suggestion is welcome.

Dave Angel · Nov 12, 2009

Santiago said:
Do you mean:

page =ddress / 16384
index =ddress MOD 16384

?

Or, better, with:

page =ddress >> 14
index =ddress & 16383

?
<snip>

How about
page, index = divmod(address, 16384)

Santiago Romero · Nov 12, 2009

How about
page, index = divmod(address, 16384)

Surely, much better and faster

Thanks a lot.

Santiago Romero · Nov 12, 2009

You can do clever memory slicing like this with numpy. For instance:

breg = numpy.zeros((16,),numpy.uint8)
wreg = numpy.ndarray((8,),numpy.uint16,breg)

This causes breg and wreg to share the same 16 bytes of memory. You
can define constants to access specific registers:

What I'm doing wrong?

[sromero@compiler:~]$ cat test.py
#!/usr/bin/python

import numpy

# Register array
breg = numpy.zeros((16,),numpy.uint8)
wreg = numpy.ndarray((8,), numpy.uint16, breg )

reg_A = 1
reg_F = 2
reg_AF = 1
reg_B = 3
reg_C = 4
reg_BC = 3

breg[reg_B] = 5
breg[reg_C] = 10
print breg[reg_B]
print breg[reg_C]
print wreg[reg_BC]

[sromero@compiler:~]$ python test.py
5
10
0

?

Santiago Romero · Nov 12, 2009

Oops, numpy arrays start with index=0

greg · Nov 13, 2009

Carl said:
You
can define constants to access specific registers:

R1L = 1
R1H = 2
R1 = 1

breg[R1H] = 2
print wreg[R1]

But keep in mind that named "constants" at the module level
are really global variables, and therefore incur a dictionary
lookup every time they're used.

For maximum speed, nothing beats writing the numeric literals
directly into the code, unfortunately.

Generally, I think you're going to have quite a battle on
your hands to get a pure Python implementation to run as
fast as a real Z80, if it's even possible at all. And if
you do succeed, the code will be pretty awful (due to things
such as not being able to use named constants).

I would be taking a different approach -- develop a prototype
in Python, concentrating on clarity rather than speed, and
later reimplement the core of the emulator as an extension
module, using Pyrex or Cython or otherwise.

Steven D'Aprano · Nov 13, 2009

Generally, I think you're going to have quite a battle on your hands to
get a pure Python implementation to run as fast as a real Z80, if it's
even possible at all. And if you do succeed, the code will be pretty
awful (due to things such as not being able to use named constants).

I don't know about that...

Python on a 2GHz processor (which is more or less entry-level for desktop
PCs these days), emulating something which used to run at 2.5MHz? Even if
the Python code does 1000 times more work, the modern processor is nearly
1000 times faster, so your Python code won't be much slower than the
first generation Z80.

The latest Z80 processors operate at 50MHz. That still gives you a factor
of 40 to work with. Write your Python carefully, optimizing only the bits
that *need* optimizing, and perhaps using a few C extensions or Psyco,
and I think you have a good chance of being close enough to the speed of
a real Z80.

Steven D'Aprano · Nov 13, 2009

Not necessarily, because it involves a function call, and constructing
and deconstructing a result tuple. If you time them, you may well find
that the explicit shift and mask operations turn out to be faster.

It's easy enough to test:
0.54839301109313965

The shift and mask are a little faster on my machine, but that's
certainly what I would call a micro-optimization. Unless the divmod call
is the bottleneck in your code -- and it almost certainly won't be -- I
don't think it's worth the obfuscation to use shift/mask. What should be
done is write the code in the clearest way you can, then, only if it is
it too slow, profile it to see where it actually needs optimizing.

Carl Banks · Nov 13, 2009

The answer to your question is, "Use numpy". More details below.

Click to expand...

Let's see

You can do clever memory slicing like this with numpy. For instance:

Click to expand...

breg = numpy.zeros((16,),numpy.uint8)
wreg = numpy.ndarray((8,),numpy.uint16,breg)

Click to expand...

This causes breg and wreg to share the same 16 bytes of memory. You
can define constants to access specific registers:

Click to expand...

R1L = 1
R1H = 2
R1 = 1

Click to expand...

breg[R1H] = 2
print wreg[R1]

Click to expand...

And how about speed?

Assuming a 16 bit register named BC which contains 2 8 bit regiters (B
and C)...

Will the above be faster than shifts and bit operations (<<, and,>> ) with new B and C values to "recalculate" BC when reading or

changing either B, C or BC?

I don't know. Strange thing about Python, the community generally
isn't hellbent on speed so we don't usually memorize a bunch of facts
like "X is faster than Y". I'm personally aware of only a few general
rules of thumb on speed (like "local variables are much faster than
globals" and "function calls have rather high overhead").

Obviously, given the problem you chose, you know you're going to have
to worry about speed from the beginning. But since not many of us
know speed details off-hand, you'll probably get a faster answer by
just running the speed tests yourself.

Having said that, I *suspect* the numpy approach I mentioned will be
quite a bit faster than using bit operations and recalculating BC, but
I really don't know.

Carl Banks

Carl Banks · Nov 13, 2009

I would be taking a different approach -- develop a prototype
in Python, concentrating on clarity rather than speed, and
later reimplement the core of the emulator as an extension
module, using Pyrex or Cython or otherwise.

But keep in mind he said he was doing it just for fun. I don't know
about you, but I find it much more fun to try to write fast Python
than to write the fast stuff in C. Not as effective, but more fun.

Carl Banks

Santiago Romero · Nov 13, 2009

I'm going to quote all the answers in a single post, if you all don't
mind:

[greg]
But keep in mind that named "constants" at the module level
are really global variables, and therefore incur a dictionary
lookup every time they're used.

For maximum speed, nothing beats writing the numeric literals
directly into the code, unfortunately.

So, finally, there are not constants available in python? (Like

#define R1 1

Generally, I think you're going to have quite a battle on
your hands to get a pure Python implementation to run as
fast as a real Z80, if it's even possible at all.

Well... I'm trying to emulate the basic Z80, clocked at 3.5Mhz..
I hope python in a 2Ghz computer can emulate a 3.5Mhz machine ...

Finally, if it's not possible... well, then I would just
have some fun...

[Steven D'Aprano]
The shift and mask are a little faster on my machine, but that's
certainly what I would call a micro-optimization. Unless the divmod call
is the bottleneck in your code -- and it almost certainly won't be --

It can be a real bottleneck.

An emulator executes continously machine code instructions.
Those machine code instructions are read from memory, and operands are
read from memory too. The minimum opcode (00h -> NOP) requires 1
memory read, and the CPU task is read from mem & decode & execute.

My problem is that I would need to "encapsulate" Memory reads and
Memory Writes in functions. In C I use #define so that:

- No function call is done (¿code unrolling?)
- I can "call" my function so I don't have to manually repeat my code
- I can optimize my "#define" macro just once and all the "calls" are
optimized too.

This way (in C) I can write readable code and the compiler replaces
my
"readable macros" with the final code.

I don't think it's worth the obfuscation to use shift/mask.

An idea.

I think I'm going to write a layer previous to my python program, to
allow to use macros in my emulator, and generate the final .py program
with a script.

VERY SIMPLE EXAMPLE:

My program:

File emulator.pym:

==================================
#!/usr/bin/python
import blah
import sys

MACRO BEGIN Z80WriteMem( address, value )
blablablah
inc blah
p = p + x
MACRO END

MACRO BEGIN Z80ReadMem( address )
( memory[address>>4][blah] )
MACRO END

(more code)

pepe = @@@Z80ReadMem( reg_A )
(more code)
@@@Z80WriteMem( 0x121212, value )
==================================

And then, use a script to replace macro calls @@@ by the
final code.

This way I could write my emulator with "macros" and
my "preprocessor" would rewrite the final .py files
for the "binary" releases. While I can keep the .pym
files as the "real" source (because .py files would be
generated from .pym macro-files).

Can the above be easily done with another already-existing
application? (example: can m4 do this job)?

Gabriel Genellina · Nov 13, 2009

En Thu said:
Carl said:

You
can define constants to access specific registers:
R1L = 1
R1H = 2
R1 = 1
breg[R1H] = 2
print wreg[R1]

Click to expand...

But keep in mind that named "constants" at the module level
are really global variables, and therefore incur a dictionary
lookup every time they're used.

For maximum speed, nothing beats writing the numeric literals
directly into the code, unfortunately.

This recipe may improve speed dramatically in those cases:
http://code.activestate.com/recipes/277940/

Bearophile · Nov 13, 2009

Try creation an extension module with ShedSkin.

Bye,
bearophile

greg · Nov 14, 2009

Santiago said:
Can the above be easily done with another already-existing
application? (example: can m4 do this job)?

The problem you're going to have with something like
m4 is indentation. When you inline a function call,
somehow the inserted code has to be shifted to the
same indentation level as the statement containing
the call.

I don't know of any off-the-shelf macro processor
that can handle that sort of thing easily.

You may want to consider writing a custom one in
Python. You could use the ast facilities to parse
the code, operate on the parse tree and then write
it back out as code.

You might be able to get some ideas from 2to3,
which does similar kinds of source-to-source
translations.

Hello from beginner with some questions!	3	Jul 30, 2021
Taking list as an input from Python to C	1	Jul 21, 2022
Writing Extensions for Python 3 in C	2	Jun 18, 2013
New way of writing socket servers in #Linux kernel 3.9 (and in#Python too)	3	Aug 24, 2013
Python battle game help	2	Feb 23, 2023
How to improve writing code in python?	2	Feb 5, 2013
Writing an assembler in Python	10	Feb 23, 2010
Writing Donald E. Knuth based code in Python, cont'd	2	Nov 12, 2012

Writing an emulator in python - implementation questions (forperformance)

Santiago Romero

Carl Banks

Santiago Romero

Dave Angel

Santiago Romero

Santiago Romero

Santiago Romero

greg

Steven D'Aprano

Steven D'Aprano

Carl Banks

Carl Banks

Santiago Romero

Gabriel Genellina

Bearophile

greg

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads