EMEA Data Loss Prevention User Group

 View Only
Expand all | Collapse all

EDM Best Practice- multiple token source data, and handling of empty cells

  • 1.  EDM Best Practice- multiple token source data, and handling of empty cells

    Posted Jan 14, 2014 11:34 AM

    I am in the process of designing EDM indexes for customer data but have encountered a design issue relating to cells containing multiple tokens. I would greatly appreciate it if anyone could confirm:

    1) how DLP handles empty cells (not entire columns, just cells) and if there is any performance impact resulting from this.

    2) what the best practice or recommended approach is to creating EDM indexes for multi-token customer data

     

    I know that EDM cannot match unstructured content against multiple token index cells. Many customer names, however, have multiple first names and multiple surnames (the example in the DLP training materials is "Mary Jane" and "von Batten".  The two options, as far as I can see, are:

     

    1. Remove part of the name, leaving only one token and excluding all others from the matching process, e.g. "Mary" or "Batten". The obvious downside to this is that you are excluding potentially key data.

    2. Splitting the original name out into multiple cells, allowing matching to be performed against all parts of the name, utilizing EDM.SimpleTextProximityRadius to reduce false positives.

     

    The second option here would work perfectly, providing the index file was created with enough columns to accommodate the longest customer name. This would, however, result in empty cells for any customers who have shorter names. For example:

    Row 1:  Mary | Jane | von | batten | - all cells are filled with data.

    Row 2: John | | | Smith |          - note the two empty cells here.

    If anyone has encountered this issue themselves or has advice regarding best practice, I'd greatly appreciate your input.

    Robin



  • 2.  RE: EDM Best Practice- multiple token source data, and handling of empty cells

    Posted Jan 14, 2014 06:19 PM

    You should read the Admin Guide section "Cleansing EDM files.  Since DLP creates a hash for each row in the EDM file, there needs to be a unique identifier in each row (SSN, Serial Number). Also, blanks are a no no.