Hi guys.
My project need to extract data from PDF file. But I can’t find any example in http://www.cloveretl.com/examples.
Any one can help me?
I have some experiences in Sentiment Analysis.
Look forward your kindly comments.
Michael Xie
Hi guys.
My project need to extract data from PDF file. But I can’t find any example in http://www.cloveretl.com/examples.
Any one can help me?
I have some experiences in Sentiment Analysis.
Look forward your kindly comments.
Michael Xie
Hi Michael,
CloverETL does not support PDF format directly. I would recommend you to convert PDF to some supported format - eg. txt, html, xml, … There are same free command line tools available. See e.g. http://penguinpetes.com/b2evo/index.php?p=129 You can launch such utility via SystemExecute.
Then you can read&Parse result via XmlExtract, UniversalDataReader, …
Hi Michael,
CloverETL does not support PDF format directly. I would recommend you to convert PDF to some supported format - eg. txt, html, xml, … There are same free command line tools available. See e.g. http://penguinpetes.com/b2evo/index.php?p=129 You can launch such utility via SystemExecute.
Then you can read&Parse result via XmlExtract, UniversalDataReader, …
“kubosj”
Hi, guys
Thanks for your reply.
I had used CAS(http://docs.oracle.com/cd/E29578_01/index.htm) to extract data from PDF file.
But when I view the extracted data in the edge of CloverETL, I found all data are stored in one column(Endeca_document_text), how can I split and manipulate this data in CloverETL to get the exact information what I want from the pdf file, can you give me a detail example?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
May be, the question can be more easy:
How can I split the content of one big column in ColverETL?
For example, one article is stored in one DB column.
Thank you very much.
Hi Michael,
I don’t know what is component which takes CAS results. If it is Reader (UniversalDataReader, ComplexDataReader, …) then just prepare metadata with desired fields and CloverETL will parser inputs for you automatically.
If there is another kind of input producing metadata with just one string, you can try following:
* use http://doc.cloveretl.com/documentation/ … eader.html
* connect edge with your current data to input, set uri to port:$0.FieldName:discrete (http://doc.cloveretl.com/documentation/ … aders.html)
* create and attach desired output metadata (take a care of delimiters especially)
I hope this helps.