Wednesday, February 9, 2011

What would be the best (runtime performance) application or pattern or code or library for matching string patterns

Programmer Question

Hi,

I have been trying to figure out a decent way of matching string patterns. I will try my best to provide as much information as I can regarding what I am trying to do.



The simplest thougt is that there are some specified patterns and we want to know whic of these patterns match completely or partially to a given request. The specified patterns hardly change. The amount of requests are about 10K per day but the results have to pe provided ASAP and thus runtime performance is the highest priority.



I have been thinking of using Assembly Compiled Regular Expression in C# for this, but I am not sure if I am headed in the right direction.



Scenario:

Data File:

Let's assume that data is provided as an XML request in a known schema format. It has anywehere between 5-20 rows of data. Each row has 10-30 columns. Each of the columns also can only have data in a pre-defined pattern. For example:




  1. A1- Will be "3 digits" followed by a
    "." follwed by "2 digits" -
    [0-9]{3}.[0-9]{2}

  2. A2- Will be "1
    character" follwoed by "digits" -
    [A-Z][0-9]{4}



    The sample would be something like:





  

123.45
A5567
456EV
xxx





Rule File:



Rule ID    A1                 A2       
1001 [0-9]{3}.45 A55[0-8]{2}
2002 12[0-9].55 [X-Z][0-9]{4}
3055 [0-9]{3}.45 [X-Z][0-9]{4}


Rule Location - I am planning to store the Rule IDs in some sort of bit mask.

So the rule IDs are then listed as location on a string



Rule ID     Location (from left to right)  
1001 1
2002 2
3055 3


Pattern file: (This is not the final structure, but just a thought)



Column   Pattern                Rule Location
A1 [0-9]{3}.45 101
A1 12[0-9].55 010
A2 A55[0-8]{2} 100
A2 [X-Z][0-9]{4} 011


Now let's assume that I run the SOMEHOW (not sure how I am going to limit the search to save time) I run the regex and make sure that A1 column is only matched aginst A1 patterns and A2 column against A2 patterns. I would end up with the follwoing reults for "Rule Location"



Column   Pattern                Rule Location
A1 [0-9]{3}.45 101
A2 A55[0-8]{2} 100



  • Doing AND on each of the loctions
    gives me the location 1 - 1001 -
    Complete match.

  • Doing XOR on each of the loctions
    gives me the location 3 - 3055 -
    Partial match. (I am purposely not
    doing an OR, because that would have
    returned 1001 and 3055 as the result
    which would be wrong for partial
    match)



The final reulsts I am looking for are:

1001 - Complete Match

3055 - Partial Match



Input:

The data will be provided in some sort of XML request. It can be one big request with 100K Data nodes or 100K requests with one data node only.

Rules:

The matching values have to be intially saved as some sort of pattern to make it easier to write and edit. Let's assume that there are approximately 100K rules.



Output:

Need to know which rules matched completely and partially.



Preferences:

I would prefer doing as much of the coding as I can in C#. However if there is a major performnace boost, I can use a different language.



The "Input" and "Output" are my requirements, how I manage to get the "Output" does not matter. It has to be fast, lets say each Data node has to be processed in approximately 1 second.



Questions:




  1. Are there any existing pattern or
    framewroks to do this?

  2. Is using Regex the right path
    specifically Assembly Compiled
    Regex?

  3. If I end up using Regex how can I
    specify for A1 patterns to only
    match against A1 column?

  4. If I do specify rule locations in a
    bit type pattern. How do I process
    ANDs and XORs when it grows to be
    100K charcter long?



I am looking for any suggestions or options that I should consider.



Thanks..



Find the answer here

No comments:

Post a Comment

LinkWithin

Related Posts with Thumbnails