Clarity

 View Only
Expand all | Collapse all

Problem Parsing CSV File in GEL Script

  • 1.  Problem Parsing CSV File in GEL Script

    Posted Mar 22, 2019 03:58 PM

    I'm a relative newcomer to GEL and not a Java guru, so I'm hoping someone out there can give me some advice.

     

    My goal is to use GEL to read a CSV file (which is exported from our ERP system and stored on our SFTP server), then parse each row into individual columns, and then populate a staging table via XOG.

     

    I have almost everything working exactly right except for one problem: the content of the CSV file is giving me headaches.  It's comma-delimited, but the only time quotation marks are used is when there is a comma in a field that is not a delimiter.  In those cases, the value is embedded in quotation marks.  For example:

     

    PART_NUMBER,ITEM_DESCRIPTION,UNIT_COST,QTY_SHIPPED
    3EU48C891,CIRCUIT BREAKER BOX,138.22,50
    4FJ62D837,SWITCH PLATE,0.04,5000
    5GL29E906,"BREAKER, CIRCUIT, 20A",6.94,200

     

    Note that in the first 2 rows the value in the second column DOES NOT contain commas, so there are no quotation marks around the ITEM_DESCRIPTION field. In the third row, the value in the second column DOES contain commas, so there are quotation marks before and after the ITEM_DESCRIPTION value itself.

     

    When I parse the row using files:readFile coded as follows:

     

    <files:readFile fileName="${vFileName}" delimiter="," var="vPartData" embedded="false"/>

     

    I get:

     

    First Row: (OK)
    Field 1 = 3EU48C891
    Field 2 = CIRCUIT BREAKER BOX
    Field 3 = 138.22
    Field 4 = 50

     

    Second Row: (OK)
    Field 1 = 4FJ62D837
    Field 2 = SWITCH PLATE
    Field 3 = 0.04
    Field 4 = 5000

     

    Third Row: (NOT OK)
    Field 1 = 5GL29E906
    Field 2 = "BREAKER
    Field 3 = CIRCUIT
    Field 4 = 20A"

     

    The fields:readFile doesn't recognize that the commas in the second field in the third row aren't delimiters, and it parses the third row incorrectly.  The ITEM_DESCRIPTION field becomes 3 separate columns because my command doesn't recognize that the 2 commas in that value aren't supposed to be delimiters.

     

    Does anyone know the proper delimiter syntax so that all rows will be parsed correctly, even if there are non-delimiting commas in a row?  If you do, you will be my hero.

     

    Thanks,

     

    Alan Brobst

    GE Transportation, a Wabtec Company



  • 2.  Re: Problem Parsing CSV File in GEL Script

    Posted Mar 26, 2019 12:10 AM

    I generally use | as the column delimiter, when reading in delimiter files.  Yes, the comma delimited file should work (and someone else may be able to provide a solution).  The | character is generally not used in text fields, and hence I don't normally hit the issue you are currently experiencing.



  • 3.  Re: Problem Parsing CSV File in GEL Script

    Posted Mar 26, 2019 05:17 AM

    Have you tried embedded="true" - thats the flag about the fields being enclosed in quotes ; but I'm not sure if that would break when (normally) the field is not enclosed in quotes.

     

    (just GUESSing)



  • 4.  Re: Problem Parsing CSV File in GEL Script

    Posted Mar 26, 2019 04:44 PM

    Hi David,

     

    Thanks for that thought, but unfortunately, it didn't work.  When I changed embedded="false" to embedded="true" in my files.readFile line, it wouldn't return any output.

     

    Alan



  • 5.  Re: Problem Parsing CSV File in GEL Script

    Posted Mar 26, 2019 08:58 AM

    I suspect the embed attribute is going to want each field wrapped in double quotes but I would give Dave’s idea a try.

     

    In looking at your data, it appears that only the second field is of concern.

     

    Another option would be to check the length of the row items (I believe it is a string[] type) and if equal to 4 then life is good.  If greater than four you would need to take the first and last two elements of the item array and then combine the elements from 1 to length -3 (zero based).

     

    V/r,

    Gene



  • 6.  Re: Problem Parsing CSV File in GEL Script

    Posted Mar 26, 2019 04:45 PM

    Hi Gene,

     

    Thanks for that suggestion.  My bigger issue is that the number of commas in the second field is inconsistent.  Some values might have 1 comma, some might have 5, etc.  And there are 2 others fields in the source CSV file that can also contain commas, so the permutations are daunting.

     

    Thanks,

     

    Alan



  • 7.  Re: Problem Parsing CSV File in GEL Script

    Posted Mar 26, 2019 08:16 PM

    Left field suggestion, have you considered generating the data as XML, and reading the XML file via the GEL script?



  • 8.  Re: Problem Parsing CSV File in GEL Script
    Best Answer

    Posted Mar 26, 2019 09:54 PM

    How about manual parsing of the CSV file.

     

    Multiple fields that contain ",,," data elements.

     

    <?xml version="1.0" encoding="utf-8"?>
    <gel:script
         xmlns:core="jelly:core"
         xmlns:log="jelly:log"
         xmlns:util="jelly:util"
         xmlns:files="jelly:com.niku.union.gel.FileTagLibrary"
         xmlns:gel="jelly:com.niku.union.gel.GELTagLibrary"
         xmlns:xsd="http://www.w3.org/2001/XMLSchema"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">


         <gel:log>Start this Script</gel:log>

         <core:set var="dataFile" value="gel\TestData.txt"/>
         <gel:log>dataFile ${dataFile}</gel:log>
         <util:file var="fileOne" name="${dataFile}" />
         <util:available file="${fileOne}">
              <gel:log>${dataFile} file is available</gel:log>
         </util:available>

         <files:readFile fileName="${dataFile}" delimiter='"' var="lines" embedded="false"/>
         <core:forEach var="line" items="${lines.rows}" indexVar="rowCnt">
              <core:if test="${rowCnt != 0}">
                   <gel:log>Processing ${rowCnt}</gel:log>
                   <util:tokenize var="tokens" delim=""/>
                   <core:forEach var="field" items="${line}">
                        <core:choose>
                             <core:when test='${field.startsWith(",") || field.endsWith(",")}'>
                                  <core:set var="items" value='${field.split(",")}' />
                                  <core:forEach var="thisField" items="${items}">
                                       <core:if test="${!thisField.isEmpty()}" >
                                            <core:expr value="${tokens.add(thisField)}" />
                                       </core:if>
                                  </core:forEach>
                             </core:when>
                             <core:otherwise>
                                  <core:expr value="${tokens.add(field)}" />
                             </core:otherwise>
                        </core:choose>
                   </core:forEach>
                   <gel:log></gel:log>
                   <core:forEach begin="0" end="${tokens.size()-1}" step="1" indexVar="index">
                        <gel:log>arraylist tokens = ${tokens.get(index)}</gel:log>
                   </core:forEach>
              </core:if>
         </core:forEach>
         <gel:log>End this Script</gel:log>

    </gel:script>

     

    Results:

     

     

    V/r,

    Gene



  • 9.  RE: Re: Problem Parsing CSV File in GEL Script

    Posted Sep 09, 2019 09:43 AM
    Thanks, Gene!  You are my hero.


  • 10.  RE: Re: Problem Parsing CSV File in GEL Script

    Posted Sep 09, 2019 10:58 AM
    Hi Gene,

    You're still my hero, but I encountered an issue -- if the input file contains null values as represented by consecutive commas (e.g., in the CSV content there's something like this):

    1,,CIRCUIT BREAKER,4,6.29

    Instead of parsing tokens as [0] = 1, [1] = null, [2] = CIRCUIT BREAKER, [3] = 4 and [4] = 6.29

    It parses:

    [0] = 1, [1] = CIRCUIT BREAKER, [2] = 4, [3] = 6.29, [4] = null

    That is, they're "shifted 1 to the left" where the null occurs.

    One option would be to substitute ,null, for every occurrence of ,, (e.g., something like a str.replaceAll()) before it splits the line into tokens.  But I'm not smart enough about Java or GEL to figure that one out.  And I'm sure you have a better idea.

    Any suggestions you can provide would be much appreciated!

    Thanks,

    Alan