Jump to content

A month might be what it takes…


Recommended Posts

Out of curiosity, I'm trying to work out how much quicker, in practical terms, a modern Mac is than a vintage Mac.  One test that I'm working on it Primes to a Million - how long would a classic Mac (68k) take to calculate all Primes to 1,000,000.  I'm guessing that it'll take about 24 Hours on a IIci (but I haven't tried it yet).  If I supply a quick and dirty program for your 68k Mac, is anyone here prepared to let me know the run-time?  Bonus points for being especially brave and running it on an unexpanded Mac Plus or 512KE (I doubt I can make my software run on anything earlier, because I don't have an old enough C compiler!)

Link to post
Share on other sites

At raw out-and-out compute? A couple thousand times, give or take. Especially if you go all the way for a 68000 Mac and all the  way new to an M1 Mac, and not just "something on the plateau" like a decade-old ivb system.

 

It's tough to make a direct comparison because early Macs didn't really get used for heavy math per se and often didn't have enough RAM for heavy compute compared to whatever was being sold into the HPC market and/or the UNIX/Academics chores market at the time.

 

Plus, the same code often translates poorly to newer computer due to emulation overhead and/or emulation just being bad. That's the explicit reasoning behind, say, these specific Norton System Info benchmarks: https://doku.stenoweb.net/doku.php?id=macdex:nortonsysteminfobenches 

 

I'd say a 4-meg Plus/SE/Classic is the best spot to start, mostly for logistics reasons, and see if you can come up with some benchmark that gets you to a pretty identifiable point, and then basically just extrapolate from there.

 

Taking it in steps, i.e. how much faster is a Quadra 650/800 than a Plus (using Norton System Info, perhaps), then change to like macbench 4 and find out how much faster like a 1GHz G4  booted in OS 9 is than a Quadra 800, then switch to OS X and use maybe a universalbinary version of macbench or perhaps something like an old version of cinebench to compare your 1GHz G4 to like a Mac from 2011 or so, then use current cinebench to compare that Mac to an M1.

 

Unfortunately because the workload changes each time you won't get like an exact read but I think you'll still be able to see the arc of performance boosts over the last 30-35 years and that should look impressive regardless.

Link to post
Share on other sites

Funny, I just did a little comparison between an 68040 and an PPC7455 for the recent #MARCHintosh.

 

I compressed some videos and did some batch image manipulation. The summary is - a PPC7455 (1.25 Ghz) is about 100-200 times faster than an 68040 (33 Mhz) at the same tasks using as similar software as possible. (See the my #MARCHintosh 2021 pages)

 

I did not compare with anything really modern though.

 

 

Link to post
Share on other sites

The code is quite simple - it just calculates primes to 1,000,000 using as many cores as the computer has available. (So 8 for my M1 Mac, 24 for my old Mac Pro, 2 for my old Mac Mini - and one for a 68k Mac).  If pthread isn't available (and, looking at my manuals for Think C, it won't be) then I'll omit it.  In any event, to speed execution and to minimise the dependency on memory, it doesn't output the prime when it finds it or store it or anything funky like that - it just increments a counter.  When execution completes, it outputs the time taken to run - and the number of primes found (a useful check to see if the program is running correctly).

As you say, all things won't necessarily to equal - efficiency of the compiler for one - but I'll take advantage of all optimisations possible in the compiler, and I'll use FPU if available. This is purely about CPU - GPU isn't a consideration.

So… Who's up for burning some cycles?

Link to post
Share on other sites
  • 68kMLA Supporter
Posted (edited)

What makes you think you can’t get your code to run on anything older than a 512KE?  As long as you’re not using HFS or any 128k ROM-specific calls (and why would you be to compute primes), your code that runs on a 512KE will run on a 512k, and as long as it doesn’t need more than about 90k of RAM, anything that runs on a 512k should run on a 128k.  You shouldn’t need a special/different C compiler.

 

In other words, I am recommending you solicit a volunteer with an original 1984 128k Mac for your worthy experiment ....

Edited by Crutch
Link to post
Share on other sites

Calculating primes can be done by simply counting through them or it can be done via sieving. If the former, the code will be mostly or completely run out of the CPU's cache.

 

Doing pure counting, finding the 1,000,000th prime (different from what you posted, but it's a small benchmark I use regularly), a 50 MHz m68040 takes about 1220 seconds. A single core of an Intel Core i3 (i3-2125) at 3.3 GHz takes 14.8 seconds, and an AMD Ryzen 2600 at 3.4 GHz does the same in about 5.94 seconds.

 

The Ryzen's clock speed is 68 times faster, which means that at the same clock speed, the Ryzen is just three times faster! This illustrates how certain benchmarks are purely synthetic. I primarily use this to calculate actual clock rate on systems with "turbo" clocking and such:

 

Tweet sized benchmark

Link to post
Share on other sites

So back to the original query: there are 78,498 primes less than 1,000,000. Counting that many primes with my little benchmark takes about 25.4 seconds on a 50 MHz m68040. How much faster is this than a 7.8 MHz m68000? My rough guess, based on the fact that the clock speed is 6.4 times faster and the number of t-states per instruction is probably different by a factor of eight, is that the '040 would be about fifty times faster or so. My guess, then, is that this benchmark running on an original Mac 512K would take about 23 minutes or so.

 

Let's give it a go! Can you compile up this benchmark for 78,498 primes?

 

I'm also interested to see how a Mac binary compares speedwise with a binary compiled by gcc and run in NetBSD ;)

Link to post
Share on other sites
  • 2 weeks later...
On 3/28/2021 at 6:47 PM, johnklos said:

Click on the link above that says, "Tweet sized benchmark". The code is right there!

 

I'm going to try to set up MPW and/or Think C to see if I can compile it for Mac OS. Now who'll test on an m68000 Mac? ;)


It's a shame nobody's replied to this.

For laughs just compiled your benchmark set to look for 78,498 primes using Turbo C++ 1.0 on my Tandy 1000HX, I'll update when it's finished running. I'm grimly curious how its 7.16mhz NEC V20 stacks up against the 68000. :)

Link to post
Share on other sites
On 4/5/2021 at 10:56 PM, Gorgonops said:

For laughs just compiled your benchmark set to look for 78,498 primes using Turbo C++ 1.0 on my Tandy 1000HX, I'll update when it's finished running. I'm grimly curious how its 7.16mhz NEC V20 stacks up against the 68000. :)


Ahahaaahaaahaaa! So the grand total is about four hours and 24 minutes, or 15,845 seconds.

primetime.jpg.d88170af645cffb337c50763c975e4a2.jpg

For comparison here's what it takes on a 2.6ghz i7 2019 Macbook Pro:

 

$ gcc -O2 dosprime.c -o dosprime
$ time ./dosprime 
78498, 999983

real	0m0.222s
user	0m0.214s
sys	0m0.004s


Or about 75,000 times faster, give or take. And, of course, it’s only using one of the six cores so the truth is far, far worse.

Of course I'm sure this is about the worst conceivable case scenario for an 8086 compatible CPU, performing a shedload of modulus operation on 32 bit numbers, but I'm still curious how the 68000 does. Maybe I could make it go a little faster by tweaking the compile settings, not too familiar with Turbo C. ;)

Link to post
Share on other sites
  • 68kMLA Supporter

I will run the faster code on my 512k. If I rewrite that in Pascal (I only have THINK Pascal on it right now), would it still be a fairly apples to apples comparison for something so simple? I'll look into getting a C++ compiler if not

Link to post
Share on other sites

Wow! There's a huge performance jump between the V20 and the m68030:

 

V20		m68030		m68030		m68040
7.16 MHz	15.667 MHz	31.334 MHz	50 MHz
8 bit bus	16 bit bus	32 bit bus	32 bit bus
15845		205		83		25.5

Relative performance per MHz
113450		3211		2600		1275

 

For faster machines, I usually do 1,000,000 primes:

 

m68040		m68060		Core i3-2125	Athlon 5350	Ryzen 2700X	Raspberry Pi 4
50 MHz		50 MHz		3.3 GHz		2.05 GHz	3.7 GHz		1.5 GHz
1296		939		14.89		10.4		5.5		11.8

Relative performance per MHz
64800		46950		49137		21320		20350		17700

 

By comparison for 1,000,000 primes, the m68030 with 16 bit bus would be 145703 and with 32 bits would be 124114.

 

Again, because this is run almost entirely in the cache (for pretty much any CPU that has cache), it shows some interesting results, particularly looking at the m68060 running faster per clock than the Core i3, and the Raspberry Pi 4 running faster per MHz than the Ryzen.

Link to post
Share on other sites
On 4/7/2021 at 6:17 PM, johnklos said:

Wow! There's a huge performance jump between the V20 and the m68030:


I suspect that anything that's not a 32 bit CPU is automatically going to take a huge hit. "Longs" don't come naturally to the 8088. (That may put the 68000 in a slightly interesting spot because hardware-wise you can kind of argue it's really a 16 bit CPU emulating a 32 bit one, IE, it has a 16 bit ALU, etc.) In general usage a 68030 is certainly going to be a lot faster than a V20 but 35x per clock is probably overstating it a *bit*. ;)

(The spread observed here is actually of very similar magnitude to the difference in BogoMips scores between the two CPUs, coincidentally enough. And as the old Bogomips Mini-Howto stressed, BogoMips wasn't really a very fair "benchmark".)

Link to post
Share on other sites
On 4/7/2021 at 11:37 PM, Gorgonops said:

I suspect that anything that's not a 32 bit CPU is automatically going to take a huge hit.


Just to satisfy my curiosity I made a "junior" version of the benchmark that finds the primes under 32768. (IE, the limit of a signed 16 bit integer.) I compiled two versions of it on the Tandy, one retaining the 32 bit "LONG" ints and the other using the standard 16 bit ints. The results:

prime16_32.jpg.2ffba2ef2830bf4be6b7768e1fe14229.jpg
 

(Ignore the incorrect printout from the "PRIME16" binary in this shot, I neglected to change the printf format string on this run and didn't catch it until I already processed the screenshot. Fixing the format string produces the correct output.)

The "Long" and short of it is the one with 32 bit numbers takes a little over 32 seconds while the 16 bit version takes about ten and a half. So, yeah, doing 32 bit math puts a bad case of the hurts on a 16 bit x86 CPU. Trying 32 and 16 bit INT versions on a 68000 might also be interesting. (I don't anticipate there'd be much if any difference on the 68020+.)

Link to post
Share on other sites
On 4/9/2021 at 9:59 AM, MrFahrenheit said:

How hard would it be to make this app calculate digits of Pi?  There’s a popular benchmarking tool on x86 called y-cruncher. I’ve used it to set some world records myself. It would be interesting to see the Pi calculation times on a 68k Mac. 

 

That would be a different program. Like the difference between sieving and counting, digits of Pi can be calculated by keeping everything in memory or by calculating the digits one at a time. I think that would warrant a different thread...

 

I replied to this with my primes benchmark because it’s small enough to fit in a single tweet. Any Pi program would be a bit bigger.

Link to post
Share on other sites
On 4/8/2021 at 12:38 AM, johnklos said:

Now I'm more and more curious about the results on an m68000...


Since you have access to such a wide range of systems, how would you feel about running the old "flops.c" floating point benchmark on them? I'm grimly curious how that scales to really old systems. (I used to have a collection of results for that going back to the Pentium era, but now the oldest I can find is some scores from G4 systems.)

flops.c results can be very compiler optimization-sensitive, which I suppose is a downside. I'd probably just suggest going with "-O3" across the board on the first crack.

Link to post
Share on other sites
On 4/11/2021 at 2:17 PM, Gorgonops said:

Since you have access to such a wide range of systems, how would you feel about running the old "flops.c" floating point benchmark on them?

 

I have results from flops.c, but that almost entirely measures the FPU with practically no help from the CPU. Perhaps that can be another thread. Related to floating point, I'm also collecting performance of ffmpeg on modern versus older CPUs such as the m68040. For instance, my LC II's m68881 is almost exactly 1/4 of the speed of my LC III+'s m68882. The first is half the clock speed and is half the memory bandwidth (16 bits versus 32 bits), so it makes sense.

 

flops.c wouldn't be useful at comparing the m68000, though.

 

That said, I finally had some time to try out MPW, so I've created some Mac primes benchmarks:

 

https://www.klos.com/~john/primesbench.sit

 

Please try them out, particularly on any m68000 machines! Note that the long version counts the first million primes, which could take a full day or more on a Mac Plus class machine. I'm estimating / guessing that counting the first 78,498 primes on an m68000 will take about half an hour. This guess is based on the speed of an LC III+, dividing by clock difference, bus width difference, memory access speed, and instruction cache gains.

 

Who wants to give it a go?

Link to post
Share on other sites
On 4/12/2021 at 2:54 AM, Stillwell said:

68000 in a 512k, just under 40 minutes.

 

On a 512K Mac? Awesome! I thought it might fit, but I wasn't sure.

 

I've read that the display hardware reduces memory access throughput by as much as 35% (apple.fandom.com) and that the video hardware of the 128K / 512K / Plus is different than the video hardware of the SE (Big Mess O' Wires), so I'd be interested to compare your 512K's time with that of an SE.

 

We'll also have to run these on other Macs so we can compare the speeds from running in NetBSD to speeds running in Mac OS. For such simple code, I'd expect very small differences. For now I'm on the other side of the country from my Mac hardware, so any runs from other people would be most welcome :)

Link to post
Share on other sites
On 4/12/2021 at 1:23 AM, johnklos said:

I have results from flops.c, but that almost entirely measures the FPU with practically no help from the CPU. Perhaps that can be another thread.


Sure, but given practically every modern CPU has had floating point built-in since the mid-1990s and much of the really backbreaking stuff we expect from modern CPUs leverages the floating point hardware (granted often in the form of SIMD instructions) you certainly can't just ignore that if you're trying to map out just how much performance (both absolute and per-clock) has evolved over the years.

But, yes, this is a little off topic for the 68000. (Although hypothetically you can *use* a 68881 with one it doesn't integrate the same way as it does on the 68020+. The 68010-based SUN-2 systems didn't have any floating-point hardware, did they? That's the oldest/slowest/closest machine to an original 68000 Mac I can think of that can run something like a "modern" UNIX-oid OS.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...