Ask for any example of extracting data from PDF file

michaelxie · November 15, 2012, 12:00am

Hi guys.

My project need to extract data from PDF file. But I can’t find any example in http://www.cloveretl.com/examples.

Any one can help me?

I have some experiences in Sentiment Analysis.

Look forward your kindly comments.

Michael Xie

David_Hlidek · November 15, 2012, 8:21am

Hi Michael,

CloverETL does not support PDF format directly. I would recommend you to convert PDF to some supported format - eg. txt, html, xml, … There are same free command line tools available. See e.g. http://penguinpetes.com/b2evo/index.php?p=129 You can launch such utility via SystemExecute.

Then you can read&Parse result via XmlExtract, UniversalDataReader, …

michaelxie · November 18, 2012, 9:31am

Hi Michael,

CloverETL does not support PDF format directly. I would recommend you to convert PDF to some supported format - eg. txt, html, xml, … There are same free command line tools available. See e.g. http://penguinpetes.com/b2evo/index.php?p=129 You can launch such utility via SystemExecute.

Then you can read&Parse result via XmlExtract, UniversalDataReader, …

“kubosj”

Hi, guys

Thanks for your reply.

I had used CAS(http://docs.oracle.com/cd/E29578_01/index.htm) to extract data from PDF file.

But when I view the extracted data in the edge of CloverETL, I found all data are stored in one column(Endeca_document_text), how can I split and manipulate this data in CloverETL to get the exact information what I want from the pdf file, can you give me a detail example?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
May be, the question can be more easy:

How can I split the content of one big column in ColverETL?

For example, one article is stored in one DB column.

Thank you very much.

David_Hlidek · November 19, 2012, 8:50am

Hi Michael,

I don’t know what is component which takes CAS results. If it is Reader (UniversalDataReader, ComplexDataReader, …) then just prepare metadata with desired fields and CloverETL will parser inputs for you automatically.

If there is another kind of input producing metadata with just one string, you can try following:
* use http://doc.cloveretl.com/documentation/ … eader.html
* connect edge with your current data to input, set uri to port:$0.FieldName:discrete (http://doc.cloveretl.com/documentation/ … aders.html)
* create and attach desired output metadata (take a care of delimiters especially)

I hope this helps.

Topic		Replies	Views
CSV conversion to Database CloverDX Platform	1	35	July 16, 2007
How to extract Data from HTML page using clover ETL CloverDX Platform	1	34	July 31, 2014
Upload PDF document to external system through an API CloverDX Platform	2	36	May 6, 2024
Documentation to Old CloverETL versions CloverDX Platform	0	34	April 4, 2018
How to parse large XML file using custom XMLExtract in Java and store it in database CloverDX Platform	4	59	January 31, 2019

Ask for any example of extracting data from PDF file

Related topics