I have recently had trouble with the UniversalDataReader (and its successor the FlatFileReader) when using the S3 protocol with the * wildcard. S3 appears to take the wildcard as a literal ‘*’ and returns no sources while the ETL component assumes there are no files matching the pattern and finishes successfully. We’ve worked around this via other protocols, but I thought I would put this out there in case anyone was having this issue with the S3 protocol.
Hi DTaylor
Could you please provide us with more information on the behavior of your issue. Also, if possible can you post anonymized graph example. We would like to better understand what might be causing the issue.
Sorry for the delay. I’m unfortunately unable to post an anonymized ETL at this time.
We were accessing a bucket using the S3 protocol with the UniversalDataReader. The ETL needed to read the contents of all files that matched a particular pattern. Previously, we had been using the * character as a part of the pattern and it had matched the files appropriately. At the time of posting, S3 had started regarding the * character as a literal rather than a wildcard, so none of the filenames matched the pattern that I had set up. So, for example, rather than finding all files that matched the pattern ‘ABC123*.txt’, the S3 protocol started telling UniversalDataReader that there were no files with the name ‘ABC123*.txt’. Since UniversalDataReader recognized the wildcard even though S3 did not, the ETL did not error and we had a process that was quietly failing.
We managed to work around this by accessing the bucket via HTTPS, but that’s not ideal when there is a dedicated protocol.
In case, you have a Corporate server you can use component ListFiles to list all available files from an S3 bucket a feed those into UniversalDataWriter. Since version 4.2.0, we’re using official Amazon SDK to access S3. Have you encountered this change after upgrade to later version? More details in: https://bug.javlin.eu/browse/CLO-7170.
We encountered this issue on 4.3. We do not have Corporate Server unfortunately, so ListFiles was not an option.
I just tried that on new 4.5.0-M2 and got info, algorithm changed in 4.4.0. Would it be possible for you to try a later version?
s3://***:***@s3.amazonaws.com/cloveretl.svecp/Monitored/cust*.dat - works
s3://***:***@s3.amazonaws.com/cloveretl.svecp/Monitored/* - works
s3://***:***@s3.amazonaws.com/* - does not work, because of insufficient privileges