PDA

View Full Version : Re: [LF] MATMUL-Performance


Lahey Support
08-15-2003, 01:33 AM
I have continued with the work of JvO for matrices of rank of 2-10 with
some interresting results. I have used 5 different coding approaches
(hopefully without bugs):
1) Explicit
2) Dot_Product
3) MATMUL
4) Simple matrix multiplication
5) Modified matrix multiplication with explicit array indexing.

I have used Salford FTN95 Ver 2.04 both with and without optimisation on
Windows 95 4.00.950 B / 200MHz Pentium/MMx.
The results are very interresting, especially for the effect of
optimisation on MATMUL. I would be very interrested to see the performance
comparison on other compilers and processors. My recent experience on
testing on a P4 have shown that the size of the executable code (and
presumably it's ability to be stored in the cache) can have a significant
effect on performance. The old FTN77 ideas of removing multiple dimensioned
subscripts don't seem to apply any more.
Has anyone had other experiences of this ?

Regards John Campbell

Table of execution times (seconds) for nornmal compilation:

Rank Exp. dot Matmul Sub OptSub

2 0.550 0.600 1.260 1.160 0.880
3 1.090 1.270 1.700 2.140 1.380
4 1.920 2.140 2.250 3.350 2.200
5 2.910 3.300 2.740 4.890 3.130
6 4.340 4.610 3.740 6.870 4.440
7 5.830 6.150 4.450 8.950 5.820
8 7.420 8.020 5.600 11.420 7.250
9 9.290 9.770 6.760 14.220 8.740
10 11.590 12.300 8.020 17.360 10.600

As rank improves MATMUL shows significant relative improvement.
The array indexed optimisation in the subroutine calls shows it's
improvement.

Table of execution times (seconds) for optimised compilation:

Rank Exp. dot Matmul Sub OptSub

2 0.380 0.330 0.550 0.770 0.660
3 0.820 0.610 0.660 0.980 1.050
4 1.260 0.830 0.930 1.210 1.590
5 1.810 1.160 1.260 1.430 2.190
6 2.750 1.370 1.650 1.810 3.240
7 3.630 1.810 1.920 2.150 4.060
8 4.450 2.530 2.630 2.860 5.000
9 5.710 2.800 3.130 3.300 6.420
10 6.870 3.290 3.520 3.730 7.690

For optimisation, MATMUL and subroutine call shows significant improvement.
How does optimisation improve MATMUL ?
The array indexed optimisation in the subroutine calls shows that it
hampers optimised improvement.

These times are for Salford FTN95 Ver 2.04. What do other compilers show ?

<<MATMUL.F95>>
! Last change: JDC 10 Sep 2001 1:53 pm
! [W.Schmidt] 2001-09-07 test-intr.f90
! [JvOosterwijk] 2001-09-07 test-intr.f90
! [J.Campbell] 2001-09-10 test-intr.f90
program Test_intr
implicit NONE

integer :: I, j, k
real :: T0, T1, TIMES(5,10)
integer, parameter :: N = 200000
integer :: s
real*8, allocatable :: a(:,:), b(:,:)
real*8, allocatable :: x(:), y(:), z(:), z1(:), z2(:)
!
do s = 2,10
!
allocate (a(s,s))
allocate (b(s,s))
allocate (x(s))
allocate (y(s))
allocate (z(s))
allocate (z1(s))
allocate (z2(s))
!
forall (i=1:s,j=1:s) a(i,j) = i+j-1.
forall (i=1:s,j=1:s) b(i,j) = i+j+4.
forall (i=1:s) x(i) = 1.1*i
forall (i=1:s) y(i) = 19.*i-26.
!
! A = reshape((/1,2,3,4/), (/s,s/))
! B = reshape((/6,7,8,9/), (/s,s/))
! X = (/ 1.1d0, 2.2d0 /)
! Y = (/ -7d0, 12d0 /)

call cpu_time(T0)
do i = -N, N
! i have changed the following code
a(s,s) = n-i ! Against opt.
do j = 1,s
z(j) = 0.
do k = 1,s
z(j) = z(j) + a(j,k)*x(k) - b(j,k)*y(k)
end do
end do
end do
call cpu_time(T1)
write (*, '(a, F8.3)') ' Explicit:', T1 - T0
write (*, *) z
TIMES(1,S) = T1-T0

call cpu_time(T0)
do i = -N, N
a(s,s) = n-i ! Against opt.
do j = 1,s
Z(j) = dot_product(A(j,:), X) - dot_product(B(j,:), Y)
end do
end do
call cpu_time(T1)
write (*, '(a, F8.3)') ' Dot_prod:', T1 - T0
write (*, *) z
TIMES(2,S) = T1-T0

call cpu_time(T0)
do i = -N, N
a(s,s) = n-i ! Against opt.
Z = MATMUL(A, X) - MATMUL(B, Y)
end do
call cpu_time(T1)
write (*, '(a, F8.3)') ' MATMUL:', T1 - T0
write (*, *) z
TIMES(3,S) = T1-T0
!
call cpu_time(T0)
do i = -N, N
a(s,s) = n-i ! Against opt.
call mat_mul (z1, a, x, s, s, 1)
call mat_mul (z2, b, y, s, s, 1)
call mat_sub (z, z1, z2, s, 1)
end do
call cpu_time(T1)
write (*, '(a, F8.3)') ' Sub MATMUL:', T1 - T0
write (*, *) z
TIMES(4,S) = T1-T0
!
call cpu_time(T0)
do i = -N, N
a(s,s) = n-i ! Against opt.
call mat_mul_opt (z1, a, x, s, s, 1)
call mat_mul_opt (z2, b, y, s, s, 1)
call mat_sub (z, z1, z2, s, 1)
end do
call cpu_time(T1)
write (*, '(a, F8.3)') ' Opt Sub MATMUL:', T1 - T0
write (*, *) z
TIMES(5,S) = T1-T0
!
deallocate (a)
deallocate (b)
deallocate (x)
deallocate (y)
deallocate (z)
deallocate (z1)
deallocate (z2)
end do
!
WRITE (*,'(i5,5F10.3)') (s,TIMES(:,S),S=2,10)
end program Test_intr

subroutine mat_mul (c, a, b, l, m, n)
!
! [c] = [a] x [b]
!
integer*4 l, m, n
real*8 c(l,n), a(l,m), b(m,n)
!
integer*4 i, j, k
real*8 s
!
do j = 1,n
do i = 1,l
s = 0.
do k = 1,m
s = s + a(i,k)*b(k,j)
end do
c(i,j) = s
end do
end do
end subroutine mat_mul

subroutine mat_mul_opt (c, a, b, l, m, n)
!
! [c] = [a] x [b]
!
integer*4 l, m, n
! real*8 c(l,n), a(l,m), b(m,n)
real*8 c(*), a(*), b(*)
!
integer*4 i, j, k, ij, ik, kj
real*8 s
!
ij = 1
do j = 1,n
do i = 1,l
s = 0.
ik = i
kj = (j-1)*m+1
do k = 1,m
s = s + a(ik)*b(kj)
ik = ik+l
kj = kj+1
end do
c(ij) = s
ij = ij+1
end do
end do
end subroutine mat_mul_opt

subroutine mat_sub (c, a, b, l, m)
!
! [c] = [a] - [b]
!
integer*4 l, m
! real*8 c(l,m), a(l,m), b(l,m)
real*8 c(*), a(*), b(*)
!
integer*4 ij
!
do ij = 1,l*m
c(ij) = a(ij) - b(ij)
end do
end subroutine mat_sub

(See attached file: Matmul.f95)