XML Parse Multi Thread Unlimited File Size

Started by TSDNZ, June 24, 2011, 02:49:35 AM

Previous topic - Next topic

TSDNZ

Hi, I am developing a multi threaded XML Parser in assembly, it will be linked to C++ for additional functionality.
I look forward to keeping everyone updated.

TSDNZ

The first step is considering multi threading and where it can speed up the parser.
Taking into account that we are reading from a file that is larger than our memory the following is a what normally happens.

1. Read File (Part of) into memory
2. Parse memory
3. Repeat step 1 & 2 until finished or error.

The Parser is paused while the file is loaded, so this is our first additional thread.
The ReadXMLThread reads the file and signals the parser thread when the file has been read.

If any one is interested I will continue.

hutch--

The only problem I can see with working on files larger than can be loaded into memory is determining the end of a line in relation to a given block size. Its easy enough to use the normal file IO to load files in preset blocks and keep doing it until you reach the last block.

The example is variable length lines of text which is normal in scripting like XML is where the problem lies, you need to be able to parse one line at a time sequentially. Now with a single thread this can probably be done by backstepping the file pointer to the start of the last incomplete line then reading your preset block size from that location.

Now there are a couple of issues here, the parser if its written properly may be faster than the file IO which negates the need for multithreading and alternately if multithreading is viable, you need some method to synchronise the beginnings and ends of different blocks of memory that are handles by different cors.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

welcome to the forum   :U

more threads may not necessarily be the answer
for example, if the parser is waiting for the file-read to signal, it doesn't help to be in another thread
however, if you can read in one thread and parse in another at the same time, you may have an advantage
this would imply some sort of dual buffer, i think

a good fast parse algo and proper file handling will go a long ways to make it faster
if you are interested in large files, you might consider mapping the file

TSDNZ

Yes, a dual buffer.

Steps are, for the XMLReaderThread:

1: Alloc Buffer memory
2. Divide by 2 to get the middle, second buffer is this location
3. Load the file into buffer 1
4. Wait for parser to request read, return Buffer1 data
5. Load next block into buffer 2
6. Wait for parser to request read, return buffer2 data
7. Goto step.

And within each step is a quit flag and error checking.

TSDNZ


TSDNZ

No problem with end of line, the reader also has a buffer for quick storage where it will convert if required.

TSDNZ


TSDNZ

Below is the actual LoadNext procedure in assembly, as you can see it is very small code, compared to ReadFile and waiting it is very fast.

LabelAlign4 LoadFirstBlockOfData
   mov      edx, esi

LabelAlign4 LoadNext
   sub      esi, edx
   mov      BufferOverRun, esi
   mov      State, State_ReadRequested

; Spin loop until the data is ready
   AtAtAlign4
   cmp      State, State_DataReady
   jb      @B
   ja      LoadNext_HandleError

   mov      esi, CurrentChar
   mov      edx, LastChar

   lodsd

   ret;

   align 4
   LoadNext_HandleError:
   ; TO DO
   ret

dedndave

this code is far from optimal
however, it is incomplete, too - we can't see the "big picture" from that code snippet
two things i see right off...
1) aligning code labels by 4 isn't likely to yield much advantage - in fact, it will probably slow you down
if you align NEAR branch labels by 16 that are repeatedly used (such as the top of a loop), you may achieve an advantage
aligning SHORT labels is a waste of space and, perhaps, clock cycles
2) LODSD is a slow instruction - you might be better off with...
        mov     eax,[esi]
        add     esi,4

LODSD is also dependant on the direction flag
manipulating the direction flag comes with a severe penalty in clock cycles
by explicitly adjusting the index register (ESI), you free yourself of that overhead

TSDNZ

Yes, all good points, just a habbit to align and not needed here.
I have yet to go through the code and remove such things.

I did not realise lodsd was so slow, I have not tested it, i did assume it would be faster that mov then add.
Just how slow is it, would you recommend mov & add.
I will never change the direction flag.

All my code, once finished i will go through and realign the instructions for performance.
The processors have changed a bit since my last asm, so I will need to brush up on performance.

As you can see it spins, waiting for the State to change.
My Reader thread does the same, except the data is preloaded, once the signal is changed it changes the CurrentChar, LastChar and sets the State to DataReady.
The reader then loads the next block, while the parser parses the data. Then it waits for the Parser again.

For example: If it takes 1 second to parse 1GB and 1 second to read 1GB then normally it would take 2 seconds to parse each 1GB.
Using this method it will take 2 seconds for the first 1GB and 1 second there after for each 1 GB
Very rough estimating I know, but the concept looks good.

baltoro

TSDNZ,
Interesting concept. My only real experience with XML documents is using, MSXML, which is implemented as a number of COM interfaces.
I'm assuming that your parser will conform to the W3C XML 1.0 Standard.
Dang, that seems like an awful lot of work,... :eek
Baltoro

dedndave


TSDNZ

Hi, looking at developing the enginee using macros.
Most of it I can figure out, but I am a little stuck with redefining a constant.

I am creating a temp constant called ElementTemp#, # is the temp element number.
ElementTemp1 textequ <xmlConfig$$ElementOrAttribute1>
This is created by the macro.
When a new element is created, and for example the name is below the first name, then I want this
ElementTemp1 textequ <xmlConfig$$ElementOrAttribute2>
ElementTemp2 textequ <xmlConfig$$ElementOrAttribute1>

But ElementTemp1 is already defined and it is expanded, is there a way around this?
So it becomes
xmlConfig$$ElementOrAttribute1 textequ <xmlConfig$$ElementOrAttribute2>

qWord

Quote from: TSDNZ on July 22, 2011, 08:21:34 AM
But ElementTemp1 is already defined and it is expanded, is there a way around this?
So it becomes
xmlConfig$$ElementOrAttribute1 textequ <xmlConfig$$ElementOrAttribute2>
This behavior appears only, if the line begins with the expansion operator %. Otherwise Element# is load with new content.
FPU in a trice: SmplMath
It's that simple!