ASM for FUN - #6 SUB [Searching inside random numbers]

Started by frktons, April 25, 2010, 07:10:21 PM

Previous topic - Next topic

frktons

Here we have the first counting excercise  :lol

In the attached zip file:
- CombCheck4.bas
- CombCheck4.exe
- CombCheck4.scn

to run the program you have to use the program RandGen3.exe/scn attached in SUB #5
to generate the file with random numbers and after the CombCheck4 to count/display
the repetitions of the single numbers.

The RND function of PB has done a pretty good job as you can see.

This is a simplified counting, and the performance misures all the steps togheter
not the single counting routine.

All the job is done with a simple loop:

SUB CountNumbers
     
    FOR x = 1 TO 320000
        INCR numbers&(n&(x))
    NEXT x
   
END SUB


It is just a short sample of what we can count on the file.
The next SUBs are going to do more complex counting, like the one I described earlier.


Enjoy
Mind is like a parachute. You know what to do in order to use it :-)

clive

Ok, so creating 80000, 4 groups of 4, followed by identifying [1, 2, 3, 4] in any order, as well as any other 4 group pairings.

3 GHz P4 Prescott, takes 220 ms in C, without much effort to optimize.

Here we identified FOUR occurrences of [1,2,3,4],[1,2,4,3],[4,3,2,1],etc from the 320000 groups generated

Buffer    5120000 bytes

Generating..

        16 Ticks,   76402283 Cycles

Saving.. 'rndtest.out'


Evaluating Distribution..

   25600 Expected Average per bin

1 :      25616     16
2 :      25528    -72
3 :      25565    -35
4 :      25677     77
5 :      25915    315
6 :      25689     89
7 :      25880    280
8 :      25484   -116
9 :      25690     90
10 :      25340   -260
11 :      25548    -52
12 :      25686     86
13 :      25382   -218
14 :      25813    213
15 :      25240   -360
16 :      25729    129
17 :      25560    -40
18 :      25385   -215
19 :      25507    -93
20 :      25676     76
21 :      25339   -261
22 :      25727    127
23 :      25488   -112
24 :      25821    221
25 :      25463   -137
26 :      25583    -17
27 :      25844    244
28 :      25361   -239
29 :      25504    -96
30 :      25660     60
31 :      25542    -58
32 :      25476   -124
33 :      25628     28
34 :      25629     29
35 :      25512    -88
36 :      25493   -107
37 :      25708    108
38 :      25533    -67
39 :      25874    274
40 :      25475   -125
41 :      25723    123
42 :      25612     12
43 :      25477   -123
44 :      25528    -72
45 :      25925    325
46 :      25596     -4
47 :      25648     48
48 :      25706    106
49 :      25543    -57
50 :      25672     72

Converting to bit vectors..

        16 Ticks,   71265015 Cycles

Sorting bit vectors..

       187 Ticks,  571231005 Cycles

  4 000000000000000F [1,2,3,4]
  2 0000000000000017 [1,2,3,5]
  3 000000000000001D [1,3,4,5]
  4 000000000000002B [1,2,4,6]
  2 000000000000002D [1,3,4,6]
  2 000000000000002E [2,3,4,6]
... ad tedium
  2 0003804000000000 [39,48,49,50]
  3 0003840000000000 [43,48,49,50]
  2 0003880000000000 [44,48,49,50]
  2 0003900000000000 [45,48,49,50]


The most frequent in this run

  9 0000001000020820 [6,12,18,37]
  9 0000001100480000 [20,23,33,37]
  9 0000044000088000 [16,20,39,43]
  9 0000100420000020 [6,30,35,45]
  9 0000200001001100 [9,13,25,46]
It could be a random act of randomness. Those happen a lot as well.

frktons

For the generation of random numbers did you use a C ANSI function?

the performance you reported:

Generating..

        16 Ticks,   76,402,283 Cycles


is much the same of the PB function RND, that in my core duo 2.6 Ghz takes:

        16 ticks,     60,298,705 Cycles


For identifing the sequences of 4 numbers I should have a look at the C CODE to
understand what you did, hoping I can grasp it's meaning.  :lol

As I get time enough to code the 4 digit search inside the random numbers file
I'll post it and test the performance of the routine to have a performance match.

Regarding this part of the report:


Sorting bit vectors..

       187 Ticks,  571231005 Cycles


That's what I suspected, a lot of time spent doing something that the
algo I've in mind skips altogheter, no need to sort anything.

If I don't fall asleep, in few hours I'll post the new version for the 4 digit searching
and counting.

Mind is like a parachute. You know what to do in order to use it :-)

clive

I used a 32-bit MLS (maximal length sequence) pseudo random binary sequence pulling out 6-bits at a time. Rolling the dice again if the number is outside 0..49, or if I hit a previously selected number. It's derived from some efficient ARM assembler (5 cycles), and probably doesn't translate well in C, or x86. The distribution looked better than rand().

The vectorization takes your groups of 4, and scoreboards the values into a 64-bit representation. Once I've filled this table, I then QuickSort it.

ELEMENTS = (80000 * 4)

  printf("Converting to bit vectors..\n\n");

  TickStart = GetTickCount();
  TSCStart = ReadTSC();

  p = Buffer; // DWORD *, where group of 4 elements [ p[0],p[1],p[2],p[3] ]
  q = BitVector; // __int64 *

  for(i=0; i<ELEMENTS; i++)
  {
    *q = ((__int64)1 << (p[0]-1)) | ((__int64)1 << (p[1]-1)) | ((__int64)1 << (p[2]-1)) | ((__int64)1 << (p[3]-1));

    p += 4; // Advance by 4 entries
    q++; // Advance by single 64-bit entry
  }

  TSCEnd = ReadTSC();
  TickEnd = GetTickCount();

  printf("%10d Ticks, %10d Cycles\n",TickEnd - TickStart, TSCEnd - TSCStart);
It could be a random act of randomness. Those happen a lot as well.

frktons

Quote from: clive on April 26, 2010, 05:11:35 PM
Ok, so creating 80000, 4 groups of 4, followed by identifying [1, 2, 3, 4] in any order, as well as any other 4 group pairings.


  4 000000000000000F [1,2,3,4]
  2 0000000000000017 [1,2,3,5]
  3 000000000000001D [1,3,4,5]
  4 000000000000002B [1,2,4,6]
  2 000000000000002D [1,3,4,6]
  2 000000000000002E [2,3,4,6]
... ad tedium
  2 0003804000000000 [39,48,49,50]
  3 0003840000000000 [43,48,49,50]
  2 0003880000000000 [44,48,49,50]
  2 0003900000000000 [45,48,49,50]


The most frequent in this run

  9 0000001000020820 [6,12,18,37]
  9 0000001100480000 [20,23,33,37]
  9 0000044000088000 [16,20,39,43]
  9 0000100420000020 [6,30,35,45]
  9 0000200001001100 [9,13,25,46]


I can't really understand these results.
Let's see what they mean:


The most frequent in this run

9 0000001000020820 [6,12,18,37]

9 = how many times you found it?
0000001000020820 = the vectorization of a 4 digit group?
[6,12,18,37] = the corresponding 4 digit group?

Please help me to understand the output of your routine.  ::)


Mind is like a parachute. You know what to do in order to use it :-)

clive

Yes, the bit patterns at the end were each identified 9 times in the 320000 entries. I ignored entries found only once.

The bit patterns are 64-bit HEX, where the bits represent 2**(N-1). So bit 0 is 1, bit 49 is 50

N=1, 0001
N=2, 0002
N=3, 0004
N=4, 0008

[1, 2, 3, 4] = 1 or 2 or 4 or 8 = 15 or 0x0F
It could be a random act of randomness. Those happen a lot as well.

frktons

Quote from: clive on April 26, 2010, 09:38:32 PM
Yes, the bit patterns at the end were each identified 9 times in the 320000 entries. I ignored entries found only once.

The bit patterns are 64-bit HEX, where the bits represent 2**(N-1). So bit 0 is 1, bit 49 is 50

N=1, 0001
N=2, 0002
N=3, 0004
N=4, 0008

[1, 2, 3, 4] = 1 or 2 or 4 or 8 = 15 or 0x0F

This puzzles me even more.

We should have a total of 230,300 possible combinations of 4 numbers,
with no repetition and ignoring the order inside the single combinations,
[1,3,4,1] = not allowed because of repetition

[1,2,3,4] = any combination of the 4 digit = we consider 1 combination total

if we have a total of 230,300, and we find some of them 9 times, it means we should
find others 0 times to balance the total.
Did you find many combinations with 0 score?

We are matching 230,300 combinations [all the possible ones] against
80.000 [the random generated ones], or at least this is what I have imagined.
Are you using this pattern?
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

if there are 230,300 combinations and 80,000 groups, most of them should have a 0 count no matter what   :red
it is somewhat unlikely that a group will occur twice and more unlikely it will occur three times if you have a decent RNG

clive

There are 80000 x 4 entries (320000) each with a group of 4 numbers

With potentially 230300 combinations, it looks to cover about 172000 of them. I'd have to say the routine I was using, and rand() both has "sequence" patterns.

I haven't counted the number of 0 counts, because it doesn't generated any.
It could be a random act of randomness. Those happen a lot as well.

dedndave

i thought there were only supposed to be 80,000 four-value groups
no wonder you have the bigger numbers, Clive    :red
it should be easy to figure out how many 0-groups there are
take the number of possible groups minus the number of different groups found

clive

With 80000, I'm getting around ~67650 unique combinations.
It could be a random act of randomness. Those happen a lot as well.

frktons

#26
Quote from: clive on April 27, 2010, 01:57:25 AM
With 80000, I'm getting around ~67650 unique combinations.

That's quite probable.

What about the performance?
My PB algo is taking something around 10-11 minutes, based on early tests,
to match 230,300 combinations against the 80,000 random ones.

What performance did you get? How long does it take to run the whole loop?



             groups occurring 0 times: 162,887
             groups occurring 1 times: 56,191
             groups occurring 2 times: 9,970
             groups occurring 3 times: 1,143
             groups occurring 4 times: 105
             groups occurring 5 times: 4
             groups occurring 6 times: 0
             groups occurring 7 times: 0
             groups occurring 8 times: 0
             groups occurring 9 times: 0
             groups occurring 10 times: 0

             Total groups scanned: 230,300
                     Elapsed MS: 646,936
                     Elapsed CPU Cycles: 1,548,762,482,493



The core routine for matching the combinations:



'---------------------------------------------------------------------------------------
' Counts the occurrences of the 4 digit groups that match all the digits.
' Test all the groups of 4 digits starting from [1,2,3,4] to [47,48,49,50]
' against the 80,000 randomly generated ones.
'---------------------------------------------------------------------------------------

SUB CountNumbers

    groups = 0
    n_idx = 1
    n_target = 4

    FOR x1 = 1 TO 47
        FOR x2 = x1 + 1 TO 48
            FOR x3 = x2 + 1 TO 49
                FOR x4 = x3 + 1 TO 50
                    GOSUB Check
                NEXT x4
            NEXT x3
        NEXT x2
    NEXT x1

    GOTO exit_count

Check:


    FOR x = 1 TO 80000


        numbers&(n&(n_idx)) = 1
        INCR n_idx
        numbers&(n&(n_idx)) = 1
        INCR n_idx
        numbers&(n&(n_idx)) = 1
        INCR n_idx
        numbers&(n&(n_idx)) = 1


        total_group = numbers&(x1) + numbers&(x2) + numbers&(x3) + numbers&(x4)
        IF total_group = n_target THEN
           INCR match
        END IF

        total_group = 0

        numbers&(n&(n_idx)) = 0
        numbers&(n&(n_idx -1)) = 0
        numbers&(n&(n_idx -2)) = 0
        numbers&(n&(n_idx -3)) = 0

        INCR n_idx

    NEXT x

    INCR occurr&(match)

    match = 0
    n_idx = 1


RETURN

exit_count:

END SUB


I didn't do anything to optimize, yet. This is the raw version.  :P
Mind is like a parachute. You know what to do in order to use it :-)

frktons

With the correct use of a couple of registers it runs in less than 8 minutes now  :8)


    #REGISTER NONE
    REGISTER n_idx AS LONG
    REGISTER total_group AS LONG   
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

 :8)  <--- not so fast, Frank
8 minutes is too long, i should think
that is because you are testing all 230,300 combinations

sorting the groups was mentioned earlier (sorting the 4 values in each group - not the groups)
i think i would sort them before they are stored - as they are generated, in fact
that will speed up anything you want to do with them afterward

if they are ordered groups, you could also convert them to ordinals, as well
it depends a lot on what you want to do with the values when you are done
but for each group, you could store the index (dword), the ordinal (dword), and the 4 values (bytes)
now you can even sort the groups - you know what order they were chose in by the index value
tallying the totals like you are would be very fast

frktons

Quote from: dedndave on April 27, 2010, 11:20:38 AM
:8)  <--- not so fast, Frank
8 minutes is too long, i should think
that is because you are testing all 230,300 combinations


You are right, of course. I was enjoying the fact that without
changing the algo, only defining 2 variables in a different way, the
algo got a 30% optimization.  :P <== probably this is more appropriate  :lol

Quote from: dedndave on April 27, 2010, 11:20:38 AM

sorting the groups was mentioned earlier (sorting the 4 values in each group - not the groups)
i think i would sort them before they are stored - as they are generated, in fact
that will speed up anything you want to do with them afterward

if they are ordered groups, you could also convert them to ordinals, as well
it depends a lot on what you want to do with the values when you are done
but for each group, you could store the index (dword), the ordinal (dword), and the 4 values (bytes)
now you can even sort the groups - you know what order they were chose in by the index value
tallying the totals like you are would be very fast

You are right again, in this particular situation, given a file with numbers
that are not changing, an index would speed up the algo a lot.

I think because it could help skipping the combinations [most of them]
that are out of the index range we are searching.

Anyway I've to think about it. I haven't decided yet what kind of statistics
estrapolate from the randomly generated numbers, but it should be
something like: any.  :dazzled:

For example in this case we have searched for 4 digit match, it could be
for 1 digit match or 2 or 3 as well, or it could be: when a specific number appears, what are
the numbers that are more likely to appear in the next 10-20 groups, if any,
or it could be how long [or many groups] between a number appears and the next
occurrence, any combination of the 1 digit - 2 digit - 3 digit inside the groups and
the above, and so on...

Like in a lottery they say: this number doesn't show since 12 months, or in
a web shop: this client doesn't place an order since feb 2009, or the like
in any flavour.
In which season of the year this particular customer places a lot of orders,
or in numbers: when this particular number shows more frequently...

Probably for some task a sorted group [1,2,3,4] is better than [2,1,4,3] and
for some others it is not, or it doesn't affect the performance. I'm not sure. :red

From your comments and suggestions I have to think that it is always better
to have some kind of index.
I don't have an engineer background, so I'm just making suppositions.
Recalling my own past experience, mostly of the times it is better to have
an index than not.

I have to thank you and all the guys in the forum for the good suggestions.
I need some time to digest, I am quite slow at the moment.

Because it is just an exercise, with no particular objective to reach, anything
that shows up is just an occasion to learn something new.

So bear some patience, and let's have some random fun  :P

P.S. When I get time enough I'll try using an index with this particular task
and see how does it work. No algo in my mind at the moment.  :lol


Mind is like a parachute. You know what to do in order to use it :-)