Automic Workload Automation

 View Only
Expand all | Collapse all

Download files from a SFTP in parallel

MarcoTizzoni604411

MarcoTizzoni604411Jul 05, 2017 10:16 AM

  • 1.  Download files from a SFTP in parallel

    Posted Jul 05, 2017 09:43 AM
    Hello,
    we have to download a couple of thousands files from a remote SFTP server. We would like to parallelize the download on multiple servers to speed up the process.
    I thought a possible implementation could be:
    1. execute a ls command using the RA_FTP agent 
    2. parse the ls output to get the list of files in a variable 
    3. for each file trigger a file download
    I could not get past the first point because the ls output is mixed with SSH debug information and parsing is hard and too easy too break. 
    Any hints on how to overcome the issue or other possible implementations?

    Best,
    Marco


  • 2.  Download files from a SFTP in parallel

    Posted Jul 05, 2017 09:54 AM
    If possible I would ask the provider of the files to send/upload a list of files with a defined name, this can be downloaded and further processed quite easy.


  • 3.  Download files from a SFTP in parallel

    Posted Jul 05, 2017 10:16 AM
    Unfortunately this is not possible.  :( 


  • 4.  Download files from a SFTP in parallel

    Posted Jul 05, 2017 11:31 AM
    Your primary goal here is to speed up the download process of numerous files.  Each RA-FTP job has a "number of transfer threads" option.  Suggest you try some performance tests with a larger setting, and see if you can reach your performance goals that way?

    To find this setting, go to the "FTP" tab, and on the "Command Sequence" title bar, go to the upper right corner and find the "job settings" option.

    From RA-FTP help;
    "Limits parallel processing when wildcards are used. Default value is 2."


  • 5.  Download files from a SFTP in parallel

    Posted Jul 10, 2017 11:04 AM
    Thanks Pete,
    that could be an approach, however we wanted to be flexible and be able to scale if we need to add more hosts in the future.




  • 6.  Download files from a SFTP in parallel

    Posted Jul 17, 2017 12:02 PM
    Hi,

    So you want to spread the action of downloading these files over multiple servers? That would imply that the bottleneck you're trying to address is the I/O on the downloading server itself, and not the actual network link. While possible, this is a rather unusual scenario. Are you sure about this?

    Not to impose, but unless you are, I'd start by identifying the bottleneck. Especially with huge numbers of small files, ftp, scp and to some extend also sftp can be orders of magnitude to slow (not considering the RA agent, just by virtue of the protocols and file handling). If you're on UNIX or otherwise able (i.e. with a windows port), and have full SSH access on the remote side, I'd run a benchmark outside of UC4 with something like rsync, or even pipe your stuff through tar on the remote and local end. That might already solve much of your problem. That, or look into a potential I/O problem on that current server :)

    If you still find multiple downloading servers are faster even with the RA agent out of the picture and using a well-performing transfer tool, are you sure that's not just because you're now using a greater number of tcp connections? If tcp connections is your bottleneck, that probably could be rectified on a single server as well without spreading out to multiple servers.

    Failing all of that: if you're on UNIX (hint: it would really help a lot to know what OS this is on ;) I could possibly give you some pointers how to separate the ls output from ssh debug info (but why is there debug info in the first place?) and split it into parts to be used, if you'd post an example of the listing. But I doubt this alone will help much: You'd end up with a static split based on number or names of files, which will still not guarantee an even load distribution through to the end. Also, not to bash on UC4, but even if you put that into a variable and have multiple RA agents parse that, I doubt that would be the racecar option.

    I'd personally think instead about putting my file listing (obtained from the "ls") into an sqlite database and have multiple servers lock one (or more) records, download the respective files, then remove the records from the table. These "worker" scripts which process the database records and do the actual downloading (couple of lines of shell script) could then easily be triggered from UC4. Poor man's message queue :) but very scalable.

    Hope this helps.

    edit: there's also https://www.gnu.org/software/parallel, which could be used to parallelize downloads across multiple machines as well. Haven't used it yet, but reportedly works like xargs, so should be able to achive proper distribution of load over a list of filenames as well.


  • 7.  Download files from a SFTP in parallel

    Posted Jul 18, 2017 04:36 AM
    Hello Carsten,
    thanks for your help. Maybe I can explain better the context, because it's not just downloading files.
    What we would like to achieve is to have a pipeline of jobs . Each pipeline processes exactly one file, it can be run independently and is idempotent. Such setup, I believe, is simpler to understand and manage if anything bad happen during the processing which, in the beginning, will be likely since we have thousands of files. 

    To complicate things:
    • we have to check regularly the remote site because we do not know exactly when new files will show up and a push approach cannot be used.
    • we use Windows, i.e. no rsync (I know it can be setup but it is a mess and we would rather not doing it, also we have a lot of files but they are big and to over-complicate the remote site will move the file once downloaded)
    For the moment I have achieved my goals using 2 workflows. The first workflow:
    1. Runs a Windows JOB on the first available host in a agent group. It connects to the remote site, retrieves the list of files, parses the list and writes it to a file in a shared location.
    2. Runs a script JOB. It reads the file (with PRE_PROCESS_FILE), fires up the processing workflow and waits until all triggered workflows have finished.
    The second workflow:
    1. Runs a Windows JOB to download a file.
    2. Runs a number of processing steps.
    With this setup:
    • Only one instance of the first workflow is allowed. This avoid the complexity of managing a queue, however it reduces our parallelism but maybe this is good enough.
    • If we add new servers we just need to add the new agent to the agent group, i.e. we can easily scale.
    • If the first workflow is still running we know there is some processing going on. Users can check what.
    • The parent/child relation between the first workflow and the triggered workflows is lost (at least afaik). This means: 1) when all the triggered workflows have finished, we are unable to report the status back to the parent (if any workflow failed the parent workflow shall exit with error). 2) users are not able to visually link the parent/child relation in the monitoring view.
    • Each file can be processed independently since each workflow can be triggered independently.
    • Using WinSCP we are able to resume file transfers (there is no option with the RA_FTP, I guess it is not supported).
    Best,
    Marco


  • 8.  Download files from a SFTP in parallel

    Posted Jul 18, 2017 05:16 AM
    Hi Marco,

    Sounds like you have found your solution then. Thanks for sharing the details.

    Out of curiosity I looked at the resuming of sftp downloads. The library used in RA_FTP (com.jcraft.jsch) can theoretically resume sftp downloads just like WinSCP can, but it may be RA agent isn't making use of that particular functionality.

    Best,
    Carsten


  • 9.  Download files from a SFTP in parallel

    Posted Jul 18, 2017 06:52 AM
    Do you know if there is a way for a child process to report its status back to the parent even if it is detached or, better, if the parent/child relation like in a workflow can be preserved so that all triggered workflows will be shown under one in the monitoring interface?

    Thanks again,
    Marco


  • 10.  Download files from a SFTP in parallel

    Posted Jul 18, 2017 07:02 AM
    Hi Marco,

    I'm not quite sure how that's meant. By "child process" you don't mean an OS process but something in UC4? Unfortunately I don't think I can help you with that. There's things like pset for passing variables between child and parent in UC4, but I don't believe one can alter the monitor view in any way.

    Best,
    Carsten.


  • 11.  Download files from a SFTP in parallel

    Posted Jul 18, 2017 07:13 AM
    If I run a workflow (JOBP) then all JOBS in the workflow will appear under the parent JOBP in the monitoring view. Something like this:
    JOBP
     |--> JOBS#1
     |--> JOBS#2
     |--> JOBS#3
     |--> JOBS#4

    In my case I cannot use a JOBP because the number of JOBS to run depends on the number of files. What I do instead is using a SCRIPT to activate a new JOBP with ACTIVATE_UC_OBJECT. In this way the hierarchical view is lost because the triggered JOBPs have no parent ID.

    Best,
    Marco





  • 12.  Download files from a SFTP in parallel

    Posted Jul 18, 2017 07:44 AM
    Ah okay, now I understand. However, I still don't think there's any built-in way to display this in the monitor.

    Best,
    Carsten


  • 13.  Download files from a SFTP in parallel

    Posted Jul 18, 2017 08:04 AM
    Marco Tizzoni said:
    In my case I cannot use a JOBP because the number of JOBS to run depends on the number of files.
    What about using a Foreach Workflow basing one Static VARA or a Script Array ?


  • 14.  Download files from a SFTP in parallel

    Posted Jul 18, 2017 10:15 AM
    Hi Wolfgang,
    that was the original idea which I liked a lot because it solves the parent/child issue. However, a Foreach Workflow does not support parallelism, which means it will wait for the triggered JOBS to end before triggering the next one. At least this is my understanding maybe you have a better solution in mind.



  • 15.  Download files from a SFTP in parallel

    Posted Jul 18, 2017 10:41 AM
    I have in Mind a workflow (maybe a script) that creates a so called "Workload VARA" or an Script Array that holds files to process in an amount (configurable) of 100.

    You could either via Script or with a Workflow (modify_task) or hardcoded (if there are no more 100-packs the FE.Workflow will STOP,NOMSG) start many  FE Workflows each with the bunch of 100 files.

    With that you could limit them either via Queue or max_parallel condition.

    -- Just an Idea in my mind....



  • 16.  Download files from a SFTP in parallel

    Posted Jul 18, 2017 10:55 AM
    I am not sure I fully get it but I probably grasp the general idea, I will experiment a bit and see what comes out of it.
    thanks.
    m-


  • 17.  Download files from a SFTP in parallel

    Posted Jul 18, 2017 04:13 PM
    I played a bit with the logic.
    the tricky part is splitting up the full amount of files (=the list)
    into a useful amount of sub- packages.

    creating an array within a loop is not the best choice for that I think.
    Possibly the split up into some working-VARA object is better.
    with these you can run the FE Workflow.


  • 18.  Download files from a SFTP in parallel

    Posted Jul 19, 2017 04:20 AM
      |   view attached
    Here is a short example how I imagined the split up into differenent varas.

    just edit/start SCRI.MAIN

    &PART_SIZE# => how many files should be processed within one VARA and FE Workflow
    VARA_FOLDER# => Folder for the working VARAs in  your UI

    VARA.JOBLIST.DYNAMIC contains 25 dummy entries and is the source for the script

    SCRI.PROCESS_ALL is a dummy for your FTP Jobs (I put the working VARA name and File Name into Archive KEY for a better overview)



    Attachment(s)



  • 19.  Download files from a SFTP in parallel

    Posted Aug 03, 2017 01:35 PM
    It looks like I achieved to have a parent workflow and all triggered workflows activated via UC_ACTIVATE_OBJECT shown in the monitoring view under a common parent task.

    This is what I do:
    1. I build a list with all files that I need to process (task A)
    2. When task A is finished an empty script (task B) has a break point set, and another script (task C) is triggered via UC_ACTIVATE_OBJECT
    3. Task C, modify the parent workflow by adding new sub-workflows and changing the &FILENAME# variable in the input prompt.
    4. The break point is removed
    This is the relevant part of the script (task C), maybe it will prove it useful to somebody else. 
    :ON_ERROR ABEND !:WAIT 5 !!! STOP THE WF FOR MODIFICATION :SET &RUNID# = GET_UC_OBJECT_NR("WF.FULL_COLLECTION.NEW") :P "RunID: &RUNID#" :SET &STATUS# = GET_STATISTIC_DETAIL(&RUNID#,STATUS) :P "Status: &STATUS#" :SET &RETSTOP# = MODIFY_TASK(&RUNID#, STOP_MODIFY) :P "Stop ret: &RETSTOP#" ! To process a file PREP_PROCESS_FILE needs a fixed host. ! If the host is unavailable the job will fail. To fix this we use hostgroups and pick ! the first available host. :SET &HND# = PREP_PROCESS_AGENTGROUP(&HOST_G#,"*",ALL) :PROCESS &HND# :   SET &STATUS# = GET_PROCESS_LINE(&HND#,2) :   IF &STATUS# = "Y" :     SET &AGENT# = GET_PROCESS_LINE(&HND#,1) :     PRINT "Agent: &AGENT#, Status: &STATUS#" :   ENDIF :ENDPROCESS :CLOSE_PROCESS &HND# :SET &HND#=PREP_PROCESS_FILE(&AGENT#, &TEMP_FILE#) :PROCESS &HND# :   SET &FILE# = GET_PROCESS_LINE(&HND#) :   PRINT &FILE# :   SET &RET# = MODIFY_TASK(&RUNID#,"WF.FETCH_AND_PREPROCESS",, ADD_TASK) :   PRINT &RET# :   SET &MODIFY# = MODIFY_TASK(&RUNID#, "WF.FETCH_AND_PREPROCESS", &RET#, VALUE, "PRPT.WF.FETCH_AND_PREPROCESS.FILENAME", "FILENAME#", &FILE#) :   PRINT &MODIFY# :   SET &MODIFY# = MODIFY_TASK(&RUNID#,, &RET#, ADD_DEPENDENCY,"SCRI.EMPTY",, "ANY_OK") :   PRINT &MODIFY# :   SET &MODIFY# = MODIFY_TASK(&RUNID#, "END",, ADD_DEPENDENCY,,&RET#, "ANY_OK") :   PRINT &MODIFY# :ENDPROCESS :SET &RET# = MODIFY_TASK(&RUNID#, "SCRI.EMPTY",, BREAKPOINT, "NO") :SET &RETCOMMIT# = MODIFY_TASK(&RUNID#, COMMIT) :SET &RETGO# = MODIFY_TASK(&RUNID#, GO)
    It is still to improve (error handling is missing for example). I also have the problem that all workflows are scheduled on the same agent even though the parent workflow has the "Workflow tasks of the same AgentGroup should use the same Agent" option unchecked. Any hint on that?

    Best,
    Marco


  • 20.  Download files from a SFTP in parallel

    Posted Aug 11, 2017 07:14 AM
    I found out a better way to implement this by using Job Groups.
    1. I created a JOBG
    2. I set the JOBG in the child workflow
    3. The first task of the parent workflow activates all needed instances of the child JOB (via UC_ACTIVATE_OBJECT). However, since they are assigned to JOBG, they won't run immediately. 
    4. When the first task has completed, the control is passed to the JOBG which is activated by the parent workflow itself. 
    5. The Job Group starts all activated JOB instances in parallel.
    With this solution the hierarchical view in the monitoring panel is preserved, i.e. it will show the parent workflow and underneath a JOBG and at a level below all triggered instances.


  • 21.  Download files from a SFTP in parallel

    Posted Aug 25, 2017 03:05 PM
    That's exactly the solution we've implemented last year. The JOBG object is very powerful!!!!


  • 22.  Download files from a SFTP in parallel

    Posted Aug 26, 2017 08:34 AM
    Just for information JOBG is/was used intensively by old users of "UC4" since it was available. Sometime returning to the basics is a good thing  ;)
    Reminder : using the option PASS_VALUES in ACTIVATE_UC_OBJECT is also useful to transfer parameters to the child process like a file name .... retreived from the variable populated in a previous job.


  • 23.  Download files from a SFTP in parallel

    Posted Aug 29, 2017 06:45 AM
    Yeah, the thing is I am new user to AWA and documentation is quite extensive but it does not explain in which context objects/settings should be used, so it is difficult if you do not have somebody pointing you in the right direction or you do not know what to look for. Luckily we had a good consultant in for few hours and he tipped me off.