Saturday, July 5, 2014

Alfresco integration with Tesseract OCR(Open Source) and image formatted file will be indexed with content.

This Post will help you to integrate tesseract OCR integration with alfresco.

What is OCR...?

Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text. It is widely used as a form of data entry from some sort of original paper data source, whether passport documents, invoices, bank statement, receipts, business card, mail, or any number of printed records. It is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translationtext-to-speech, key data extraction and text mining. OCR is a field of research in pattern recognitionartificial intelligence and computer vision.


by doing this integration with alfresco and content will be indexed.

 Required things

1)Alfresco
2)Tesseract OCR Software

installing Tesstract OCR:

Download tesseract from (https://code.google.com/p/tesseract-ocr/downloads/list)
and install it by double click on tesseract-ocr-setup-3.02.02.exe.

After installing tesseract in system "C:\Program Files (x86)\Tesseract-OCR" path will be created with installed Tesseract OCR.

Changes made to be done in alfresco

Files to be added.

1)OCR.bat2)ocrpng-transform-context.xml3)ocrjpeg-transform-context.xml4)ocrtiff-transform-context.xml5)alfresco-tesseract-search.jar6)ocrtransform.log

1)OCR.bat

REM to see what happens
echo from %1 to %2 >>C:\tmp\ocrtransform.log


copy /Y %1 C:\TMP\%~n1%~x1

REM  call tesseract and redirect output to $TARGET
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" C:\TMP\%~n1%~x1 %~d2%~p2%~n2 -l eng
del C:\TMP\%~n1%~x1

this batch script is to placed in your alfresco in this path "C:\Alfresco"

this batch script will send the the uploaded file to Tesseract ocr to do actual OCR, copies the log to the ocrtransform.log,Tesseract OCR send content to alfresco and we can change the actual language which in the above file default given eng, and we can give multiple languages to this.

These are the transformation xml will be added in "C:\Alfresco\tomcat\shared\classes\alfresco\extension"

2)ocrpng-transform-context.xml

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
  <bean id="transformer.worker.ocr.jpeg" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
    <property name="mimetypeService">
      <ref bean="mimetypeService" />
    </property>
    <property name="checkCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>dir c:\Alfresco\ocr.bat</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1</value>
        </property>
      </bean>
    </property>
    <property name="transformCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>C:\Alfresco\ocr.bat</value>
                <value>"${source}"</value>
                <value>"${target}"</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1,2</value>
        </property>
      </bean>
    </property>
    <property name="explicitTransformations">
      <list>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
          <property name="sourceMimetype">
            <value>image/png</value>
          </property>
          <property name="targetMimetype">
            <value>text/plain</value>
          </property>
        </bean>
      </list>
    </property>
  </bean>
  <bean id="transformer.ocr.jpeg" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
    <property name="worker">
      <ref bean="transformer.worker.ocr.jpeg" />
    </property>
  </bean>
</beans>

3)ocrjpeg-transform-context.xml

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
  <bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
    <property name="mimetypeService">
      <ref bean="mimetypeService" />
    </property>
    <property name="checkCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>dir c:\Alfresco\ocr.bat</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1</value>
        </property>
      </bean>
    </property>
    <property name="transformCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>C:\Alfresco\ocr.bat</value>
                <value>"${source}"</value>
                <value>"${target}"</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1,2</value>
        </property>
      </bean>
    </property>
    <property name="explicitTransformations">
      <list>

        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
          <property name="sourceMimetype">
            <value>image/jpeg</value>
          </property>
          <property name="targetMimetype">
            <value>text/plain</value>
          </property>
        </bean>


      </list>
    </property>
  </bean>
  <bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
    <property name="worker">
      <ref bean="transformer.worker.ocr.tiff" />
    </property>
  </bean>
</beans>

4)ocrtiff-transform-context.xml

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
  <bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
    <property name="mimetypeService">
      <ref bean="mimetypeService" />
    </property>
    <property name="checkCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>dir c:\Alfresco\ocr.bat</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1</value>
        </property>
      </bean>
    </property>
    <property name="transformCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>C:\Alfresco\ocr.bat</value>
                <value>"${source}"</value>
                <value>"${target}"</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1,2</value>
        </property>
      </bean>
    </property>
    <property name="explicitTransformations">
      <list>

        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
          <property name="sourceMimetype">
            <value>image/tiff</value>
          </property>
          <property name="targetMimetype">
            <value>text/plain</value>
          </property>
        </bean>


      </list>
    </property>
  </bean>
  <bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
    <property name="worker">
      <ref bean="transformer.worker.ocr.tiff" />
    </property>
  </bean>

</beans>

these all are transformations files we can write any no. based on type format of files which want to do OCR with Tesseract.

5)alfresco-tesseract-search.jar

Downolad this jar from this link(https://docs.google.com/file/d/0B94FD2QmPSJCNHpuUVlicW95UjA/edit)
and place this jar in this path "C:\Alfresco\tomcat\lib".

6)ocrtransform.log

create an empty file name with ocrtransform.log in "C:\TMP"

after that restart the alfresco

then upload a image formatted files, content of the image will be indexed in alfresco so we can search the content of the file.

Testing


Upload a image formatted  file to alfresco


i will upload this png file into alfresco


then search for the content in the image like "water"




simple that's it.

3 comments:

  1. how can we do this same to the PDF file.

    I want to perform OCR action on pdf files.

    is there any solution for that?

    Can you please help me with this?

    kind regards.

    ReplyDelete
  2. Yes it will work, instead of image just pass the pdf and check packages and interfaces in the tesseract oct component. From there you can change the XML configuration as required for pdf's

    ReplyDelete