Hello.
I want to count 1s in one byte of data in Assembler, but I dont get it work. Assembling the code is working, but the programs counting is wrong. Seems that the last bit is not counted. Here is my code:
.data
intout db "Einser in EAX: ", 0
.code ; Tell MASM where the code starts
start: ; The CODE entry point to the program
call main
exit
main proc
print OFFSET intout ; achtung: print-makro verändert eax!
xor ebx, ebx ; clear ebx register
mov eax, 00010111b ; 8-Bit Daten
mov ecx, 8 ; Länge der Daten in Bit
xor edx, edx ; clear counter fuer einser
einser_zaehler: ; zaehlt 1er, die in eax enthalten sind
clc
bt eax, ecx ; if bit = 1 then carry flag = 1
jnc not_a_one ; jump if not Carry = 1
inc edx ; If bit = 1 then increment edx
not_a_one:
loop einser_zaehler ; ecx = ecx-1 if ecx!=0 then loop again
print str$(edx)
ret
main endp
end start ; Tell MASM where the program ends
THX for all help!
Greetings from Germany
it might be easier to just shift it out of EAX
xor edx,edx
mov ecx,8
loop_start:
shr eax,1
adc dl,dh
dec ecx
jnz loop_start
oh - willkommen auf dem forum :U
Wow, that was fast!
Thank you very much!
Quote from: dedndave on November 10, 2010, 04:16:26 PM
willkommen auf dem forum :U
:naughty:
Quotewilkommen im Forum
:P
my pleasure - that one was easy
i wish all the questions were as easy to answer :P
well - i got that from google translate :lol
Personally I'd just use the POPCNT command, but as it looks like homework, the following is an interesting method to think about
mov al,byte ptr [i]
mov ah,al
shr ah,1
and ax,05555h
add al,ah
mov ah,al
shr ah,2
and ax,03333h
add al,ah
mov ah,al
shr ah,4
and ax,00F0Fh
add al,ah
mov byte ptr [j],al
POPCNT is nice, if the CPU in use supports it :bg
but, i like the "wiggle and jiggle" method - lol
(i am sure it has a better name)
Quote from: clive on November 10, 2010, 04:51:14 PM
Personally I'd just use the POPCNT command, but as it looks like homework, the following is an interesting method to think about
mov al,byte ptr [i]
mov ah,al
shr ah,1
and ax,05555h
add al,ah
mov ah,al
shr ah,2
and ax,03333h
add al,ah
mov ah,al
shr ah,4
and ax,00F0Fh
add al,ah
mov byte ptr [j],al
Thank you, too. Its good to know.
Quote from: dedndave
but, i like the "wiggle and jiggle" method - lol (i am sure it has a better name)
Don't know, I coded it off my head from first principles, I've seen it used before, and it is consistent with a hardware construct I would use. Pretty sure AMD or Intel has it in an optimization paper. Uses the rule that any two 2**n binary digits added together will always fit in 2**(n+1), the new bit is a carry.
Full-Adder Pyramid, Sieve, Shake-n-Bake?
lol - shake-n-bake - let's go with that one :bg
i have seen something similar in a previous thread
i believe they were using it to reverse the bit order in a dword register, which was a little more complicated
but, the same principal was used
let me see if i can find it...
http://www.masm32.com/board/index.php?topic=12722.msg98224#msg98224
as i recall, Drizz found it in some other forum
MichaelW dubbed it "stir-fry" - lol
very close to your "shake-n-bake" moniker
Here's one that cites AMD/Norbert
http://www.df.lth.se/~john_e/gems/gem002d.html
PDF Page 195 (Printed 179)
http://support.amd.com/us/Processor_TechDocs/25112.PDF
The 8-bit version was a quick hack interview answer, show's you're thinking.
yes - Norbert Jaffa says he derived it from the K6 docs
Quote from: dedndave on November 10, 2010, 05:38:17 PM
yes - Norbert Juffa says he derived it from the K6 docs
He worked/works for AMD, worked on floating point stuff among other things as I recall.
Sorry, I couldn't resist :bg
mov eax, 10101010b
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
16 cycles for Clive, result=4
22 cycles for Dave, result=4
15 cycles for Jochen, result=4
16 cycles for Clive, result=4
22 cycles for Dave, result=4
16 cycles for Jochen, result=4
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
28 cycles for Clive, result=4
105 cycles for Dave, result=4
111 cycles for Jochen, result=4
41 cycles for Brute1, result=4
44 cycles for Brute2, result=4
28 cycles for Clive, result=4
103 cycles for Dave, result=4
113 cycles for Jochen, result=4
41 cycles for Brute1, result=4
41 cycles for Brute2, result=4
Intel(R) Atom(TM) CPU N450 @ 1.66GHz (SSE4)
14 cycles for Clive, result=4
46 cycles for Dave, result=4
115 cycles for Jochen, result=4
24 cycles for Brute1, result=4
26 cycles for Brute2, result=4
16 cycles for Clive, result=4
47 cycles for Dave, result=4
115 cycles for Jochen, result=4
24 cycles for Brute1, result=4
25 cycles for Brute2, result=4
Hi,
PIII
Regards,
Steve
pre-P4 (SSE1)
28 cycles for Clive, result=4
34 cycles for Dave, result=4
16 cycles for Jochen, result=4
19 cycles for Brute1, result=4
19 cycles for Brute2, result=4
28 cycles for Clive, result=4
34 cycles for Dave, result=4
16 cycles for Jochen, result=4
19 cycles for Brute1, result=4
19 cycles for Brute2, result=4
--- ok ---
you guys just enjoy making me look bad :lol
For doing a single byte at a time, why not a table lookup?
yah - 256 bytes is a bit much - lol
but, you could split the byte in 2 and have a 16 byte LUT
it would probably rival the shake-n-bake method
Popcount2.zip:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
16 cycles for Clive, result=4
22 cycles for Dave, result=4
16 cycles for Jochen, result=4
14 cycles for Brute1, result=4
15 cycles for Brute2, result=4
Quote from: MichaelW on November 10, 2010, 07:09:23 PM
For doing a single byte at a time, why not a table lookup?
.data
pcTable db 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5
db 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6
db 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6
db 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7
db 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6
db 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7
db 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7
db 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
...
mov eax, Source
movzx eax, byte ptr pcTable[eax]
Certainly fast - one cycle on my Celeron.
Here is the DWORD version, running in 7 cycles:
push ecx
push ebx
mov ebx, Source
movzx ecx, bh
movzx eax, byte ptr pcTable[ecx]
movzx ecx, bl
movzx edx, byte ptr pcTable[ecx]
bswap ebx
movzx ecx, bh
add eax, edx
movzx edx, byte ptr pcTable[ecx]
add eax, edx
movzx ecx, bl
movzx edx, byte ptr pcTable[ecx]
add eax, edx
pop ebx
pop ecx
algo from hackers delight (http://www.hackersdelight.org/)
mov eax,123
imul eax,08040201h
shr eax,3
and eax,11111111h
imul eax,11111111h
shr eax,28
:eek
Used not proper constant at my test... :green2
Dave, :P
Quote from: dedndave on November 11, 2010, 12:52:11 AM
i don't see how you'd get anything but 0 :lol
Dave, BYTEs, not DWORDs. Put 0FFh into EAX, not 100h or 0FFFFFFFFh
Quote from: Antariy on November 11, 2010, 12:42:17 AM
Which is intention and usage of the algo?
Byte quantities, I thought it was obvious.
Quote from: Antariy on November 11, 2010, 12:42:17 AM
For 123T I got 00654321h.
it's impossible to get that after "SHR 28". You obviusly have different radix set.
123 is sample byte value
ahhhhhhhhh - i see where i went wrong
it is 11111111h - not 11111111b
i saw all the ones and was thinking binary - lol
Dave entangle hex with binary, I'm used hex constant instead of decimal :P
i would have seen it, had i looked in a debugger - lol
i saw "11111111" and thought in my head "0FFh"
For longer than one byte (which is short enough to do tricks or LUTS with) the fastest way has always been something like:
xor ecx, ecx
@@:
lea edx, [eax - 1]
inc ecx
and eax, edx
jnz @B
It should only loop once for each set bit, so on average it should beat the slightly simpler shr/adc as it must loop through zeros at the start of it's number.
As always, the "best" method depends on the "average" data (and also what you can bear as a worst case).
Thanks,
Mirno
It works fine, but it's not particularly fast, about 66 cycles on a P4:
mov eax, 10101010101010101010101010101010b
...
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
1213 cycles for 100*MasmBasic, res= 16
6614 cycles for 100*Mirno, res= 16
1119 cycles for 100*MasmBasic, res= 16
6668 cycles for 100*Mirno, res= 16
it needs a little test for the value EAX = 0 :P
as written, it will return a 1 count
Quote from: dedndave on November 18, 2010, 03:28:30 PM
it needs a little test for the value EAX = 0 :P
as written, it will return a 1 count
Like this?
mov eax, 0 ; 10101010101010101010101010101010b
xor ecx, ecx
test eax, eax
.if !Zero?
@@:
lea edx, [eax - 1]
inc ecx
and eax, edx
jnz @B
.endif
The MasmBasic PopCount (here (http://www.masm32.com/board/index.php?topic=12460.msg124805#msg124805)) works correctly. If only a handful of bits are set, Mirno's algo becomes competitive.
Clearly not as effective at finding bits, but easier to explain to the professor
mov eax, 0 ; 10101010101010101010101010101010b
xor ecx, ecx
@@:
shr eax,1
adc ecx,0
or eax,eax
jnz @B
no - not like this - lol
mov eax, 0 ; 10101010101010101010101010101010b
xor ecx, ecx
test eax, eax
.if !Zero?
@@:
lea edx, [eax - 1]
inc ecx
and eax, edx
jnz @B
.endif
more like this
xor ecx, ecx
test eax, eax
.if !Zero?
@@:
lea edx, [eax - 1]
inc ecx
and eax, edx
jnz @B
.endif
moot point, really
Quote from: dedndave on November 18, 2010, 06:00:37 PM
no - not like this - lol
mov eax, 0 ; 10101010101010101010101010101010b
xor ecx, ecx
test eax, eax
.if !Zero?
@@:
lea edx, [eax - 1]
inc ecx
and eax, edx
jnz @B
.endif
more like this
xor ecx, ecx
test eax, eax
.if !Zero?
@@:
lea edx, [eax - 1]
inc ecx
and eax, edx
jnz @B
.endif
moot point, really
I'm afraid I fail to see the difference, Dave :dazzled:
mov eax, 0 ; 10101010101010101010101010101010b
i don't think you want this line
if you MOV EAX,0 - the answer will always be 0 (http://l.yimg.com/us.yimg.com/i/mesg/emoticons7/40.gif)
at least it will be nice and fast :P
Quote from: dedndave on November 18, 2010, 06:47:37 PM
mov eax, 0 ; 10101010101010101010101010101010b
i don't think you want this line
if you MOV EAX,0 - the answer will always be 0 (http://l.yimg.com/us.yimg.com/i/mesg/emoticons7/40.gif)
at least it will be nice and fast :P
But that was the whole point: That Mirno's algo returned 1 for eax=0...
i am giving you a hard time, my friend :bg