PDA

View Full Version : Re: [LF] MATMUL-Performance


Lahey Support
08-15-2003, 01:33 AM
----- Original Message -----
From: [address removed]
To: "Lahey User" <>
Cc: [address removed]
Sent: Sunday, September 09, 2001 9:21 PM
Subject: Re: [LF] MATMUL-Performance


> I have continued with the work of JvO for matrices of rank of
2-10 with
> some interresting results. I have used 5 different coding
approaches
> (hopefully without bugs):
> 1) Explicit
> 2) Dot_Product
> 3) MATMUL
> 4) Simple matrix multiplication
> 5) Modified matrix multiplication with explicit array indexing.
>
> I have used Salford FTN95 Ver 2.04 both with and without
optimisation on
> Windows 95 4.00.950 B / 200MHz Pentium/MMx.
> The results are very interresting, especially for the effect of
> optimisation on MATMUL. I would be very interrested to see the
performance
> comparison on other compilers and processors. My recent
experience on
> testing on a P4 have shown that the size of the executable code
(and
> presumably it's ability to be stored in the cache) can have a
significant
> effect on performance. The old FTN77 ideas of removing multiple
dimensioned
> subscripts don't seem to apply any more.
> Has anyone had other experiences of this ?
>
> Regards John Campbell
>
> Table of execution times (seconds) for optimised compilation:
>
> Rank Exp. dot Matmul Sub OptSub
>
> 2 0.380 0.330 0.550 0.770 0.660
> 3 0.820 0.610 0.660 0.980 1.050
> 4 1.260 0.830 0.930 1.210 1.590
> 5 1.810 1.160 1.260 1.430 2.190
> 6 2.750 1.370 1.650 1.810 3.240
> 7 3.630 1.810 1.920 2.150 4.060
> 8 4.450 2.530 2.630 2.860 5.000
> 9 5.710 2.800 3.130 3.300 6.420
> 10 6.870 3.290 3.520 3.730 7.690
>
> For optimisation, MATMUL and subroutine call shows significant
improvement.
> How does optimisation improve MATMUL ?
> The array indexed optimisation in the subroutine calls shows
that it
> hampers optimised improvement.
>
> These times are for Salford FTN95 Ver 2.04. What do other
compilers show ?

CVF is the only compiler available to me which expands the small
MATMUL in-line. It also would be expected to expand the
subroutine in-line. Modern compilers (even C) should be expected
to perform strength reduction on subscripts.

The effect of /arch:p6 is to prevent introduction of prefetch
instructions, which slow down execution of artificial benchmarks
such as this.

CVF 6.6, on 700 Mhz P3 laptop
df /optimize:5 /fast /arch:p6 matmul.f90

2 0.090 0.050 0.150 0.150 0.120
3 0.140 0.090 0.200 0.190 0.200
4 0.180 0.150 0.270 0.290 0.260
5 0.240 0.200 0.340 0.340 0.340
6 0.351 0.320 0.631 0.471 0.451
7 0.361 0.381 0.481 0.541 0.521
8 0.431 0.431 0.531 0.591 0.531
9 0.511 0.511 0.621 0.691 0.621
10 0.611 0.591 0.771 0.761 0.751


----------------------------------------------------------
To unsubscribe, send to [address removed] the following
as the first and only line of the message body:
unsubscribe fortran
----------------------------------------------------------