Skip to content

Commit

Permalink
WebSubmit: new fulltext conversion tools
Browse files Browse the repository at this point in the history
Supporting conversion among PDF, PDF/A, PS(.GZ), DJVU, HTML, XML, TEXT,
OpenOffice.org, Microsoft Office documents. Integrated with BibDocFile, with spell checking.
OCR support and PDF enhancing with OCR information.
  • Loading branch information
kaplun committed Apr 13, 2010
1 parent 0dc6d9c commit 7b3804d
Show file tree
Hide file tree
Showing 54 changed files with 3,325 additions and 1,682 deletions.
40 changes: 25 additions & 15 deletions INSTALL
Original file line number Diff line number Diff line change
Expand Up @@ -35,17 +35,16 @@ Contents
mysql-server mysql-client python-mysqldb \
python-4suite-xml python-simplejson python-xml \
python-libxml2 python-libxslt1 gnuplot poppler-utils \
gs-common antiword catdoc wv html2text ppthtml xlhtml \
clisp gettext libapache2-mod-wsgi unzip python-numpy \
python-rdflib python-gnuplot python-magic pdftk \
html2text giflib-tools pstotext
gs-common clisp gettext libapache2-mod-wsgi unzip python-rdflib \
python-gnuplot python-magic pdftk html2text giflib-tools \
pstotext netpbm

You may also want to install some of the following packages,
if you have them available on your concrete architecture:

$ sudo aptitude install rxp python-psyco sbcl cmucl \
pylint pychecker pyflakes python-profiler python-epydoc \
libapache2-mod-xsendfile
libapache2-mod-xsendfile openoffice.org

Moreover, you should install some Message Transfer Agent (MTA)
such as Postfix so that CDS Invenio can email notification
Expand Down Expand Up @@ -149,6 +148,9 @@ Contents
files (i.e. to have fulltext indexing) or to stamp submitted
files, then you need as well to install some of the following
tools:
- for Microsoft Office/OpenOffice.org document conversion:
OpenOffice.org
<http://www.openoffice.org/>
- for PDF file stamping: pdftk, pdf2ps
<http://www.accesspdf.com/pdftk/>
<http://www.cs.wisc.edu/~ghost/doc/AFPL/>
Expand All @@ -158,16 +160,16 @@ Contents
<http://www.cs.wisc.edu/~ghost/doc/AFPL/>
- for PostScript files: pstotext or ps2ascii
<http://www.cs.wisc.edu/~ghost/doc/AFPL/>
- for MS Word files: antiword, catdoc, or wvText
<http://www.winfield.demon.nl/index.html>
<http://www.ice.ru/~vitus/catdoc/index.html>
<http://sourceforge.net/projects/wvware>
- for MS PowerPoint files: pptHtml and html2text
<http://packages.debian.org/stable/utils/ppthtml>
<http://userpage.fu-berlin.de/~mbayer/tools/html2text.html>
- for MS Excel files: xlhtml and html2text
<http://chicago.sourceforge.net/xlhtml/>
<http://userpage.fu-berlin.de/~mbayer/tools/html2text.html>
- for DjVu creation, elaboration: DjVuLibre
<http://djvu.sourceforge.net>
- to perform OCR: OCRopus (tested only with release 0.3.1)
<http://code.google.com/p/ocropus/>
- to perform different image elaborations: ImageMagick
<http://www.imagemagick.org/>
- to generate PDF after OCR: ReportLab
<http://www.reportlab.org/rl_toolkit.html>
- to analyze images to generate PDF after OCR: netpbm
<http://netpbm.sourceforge.net/>

h) If you have chosen to install fast XML MARC Python processors
in the step d) above, then you have to install the parsers
Expand Down Expand Up @@ -358,6 +360,14 @@ Contents
option, then the first executable that will be found
in your PATH will be chosen for running CDS Invenio.

--with-openoffice-python

Optionally, specify the path to the Python interpreter
embedded with OpenOffice.org. This is normally not
contained in the normal path. If you don't specify this
it won't be possible to use OpenOffice.org to convert from and
to Microsoft Office and OpenOffice.org documents.

This configuration step is mandatory. Usually, you do this
step only once.

Expand Down
16 changes: 16 additions & 0 deletions Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,21 @@ uninstall-fckeditor-plugin:
@echo "** The FCKeditor plugin was successfully uninstalled. **"
@echo "***********************************************************"

install-pdfa-helper-files:
@echo "***********************************************************"
@echo "** Installing PDF/A helper files, please wait... **"
@echo "***********************************************************"
wget 'http://cdsware.cern.ch/download/invenio-demo-site-files/ISOCoatedsb.icc' -O ${prefix}/etc/websubmit/file_converter_templates/ISOCoatedsb.icc
@echo "***********************************************************"
@echo "** The PDF/A helper files were successfully installed. **"
@echo "***********************************************************"

uninstall-pdfa-helper-files:
rm -f ${prefix}/etc/websubmit/file_converter_templates/ISOCoatedsb.icc
@echo "***********************************************************"
@echo "** The PDF/A helper files were successfully uninstalled. **"
@echo "***********************************************************"

update-v0.3.0-tables update-v0.3.1-tables:
echo "ALTER TABLE idxINDEXNAME CHANGE id_idxINDEX id_idxINDEX mediumint(9) unsigned NOT NULL FIRST;" | ${prefix}/bin/dbexec
echo "ALTER TABLE rnkMETHODNAME CHANGE id_rnkMETHOD id_rnkMETHOD mediumint(9) unsigned NOT NULL FIRST;" | ${prefix}/bin/dbexec
Expand Down Expand Up @@ -384,6 +399,7 @@ update-v0.99.1-tables:
echo "INSERT INTO knwKBRVAL (id,m_key,m_value,id_knwKB) SELECT id,m_key,m_value,id_fmtKNOWLEDGEBASES FROM fmtKNOWLEDGEBASEMAPPINGS;" | ${prefix}/bin/dbexec
echo "ALTER TABLE sbmPARAMETERS CHANGE name name varchar(40) NOT NULL default '';" | ${prefix}/bin/dbexec
echo "ALTER TABLE bibdoc CHANGE docname docname varchar(250) COLLATE utf8_bin NOT NULL default 'file';" | ${prefix}/bin/dbexec
echo "ALTER TABLE bibdoc ADD COLUMN text_extraction_date datetime NOT NULL default '0000-00-00';" | ${prefix}/bin/dbexec
echo "ALTER TABLE collection DROP COLUMN restricted;" | ${prefix}/bin/dbexec
echo "ALTER TABLE schTASK CHANGE host host varchar(255) NOT NULL default '';" | ${prefix}/bin/dbexec
echo "ALTER TABLE hstTASK CHANGE host host varchar(255) NOT NULL default '';" | ${prefix}/bin/dbexec
Expand Down
11 changes: 10 additions & 1 deletion THANKS
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,16 @@ Function icon set, and (iii) the activity indicator icon.
<http://wefunction.com/2008/07/function-free-icon-set/>
<http://www.badeziner.com/2008/05/04/120-free-ajax-activity-indicator-gif-icons/>

The unoconv.py script has been adapted from UNOCONV by Dag Wieers.
<http://dag.wieers.com/home-made/unoconv/>

PDFA_def.ps has been adapted from the GPL distribution of GhostScript.
<http://ghostscript.com/>

The ISOCoatedsb.icc ICC profile has been retrieved from the European Color
Initiative.
<http://www.eci.org/>

The PEP8 conformance checking script (pep8.py) was written by Johann
C. Rocholl <[email protected]>. The pep8.py version included
with CDS Invenio was downloaded from
Expand All @@ -136,5 +146,4 @@ The asyncproc module to manage asynchronous processes with timeout support
was written by Thomas Bellman <[email protected]>. The asyncproc.py
version included with CDS Invenio was downloaded from
<http://www.lysator.liu.se/~bellman/download/asyncproc.py> on 2009-07-13.

- end of file -
3 changes: 2 additions & 1 deletion config.nice.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@
--with-mysql=@MYSQL@ \
--with-clisp=@CLISP@ \
--with-cmucl=@CMUCL@ \
--with-sbcl=@SBCL@
--with-sbcl=@SBCL@ \
--with-openoffice-python=@OPENOFFICE_PYTHON@
21 changes: 13 additions & 8 deletions config/invenio-autotools.conf.in
Original file line number Diff line number Diff line change
Expand Up @@ -39,24 +39,29 @@ CFG_WEBDIR = @localstatedir@/www
## path to interesting programs:
CFG_PATH_MYSQL = @MYSQL@
CFG_PATH_PHP = @PHP@
CFG_PATH_ACROREAD = @ACROREAD@
CFG_PATH_GZIP = @GZIP@
CFG_PATH_GUNZIP = @GUNZIP@
CFG_PATH_TAR = @TAR@
CFG_PATH_DISTILLER = @PS2PDF@
CFG_PATH_GFILE = @FILE@
CFG_PATH_CONVERT = @CONVERT@
CFG_PATH_PDFTOTEXT = @PDFTOTEXT@
CFG_PATH_PDFTK = @PDFTK@
CFG_PATH_PDFTOPS = @PDFTOPS@
CFG_PATH_PDF2PS = @PDF2PS@
CFG_PATH_PDFINFO = @PDFINFO@
CFG_PATH_PDFTOPPM = @PDFTOPPM@
CFG_PATH_PAMFILE = @PAMFILE@
CFG_PATH_GS = @GS@
CFG_PATH_PS2PDF = @PS2PDF@
CFG_PATH_PDFOPT = @PDFOPT@
CFG_PATH_PSTOTEXT = @PSTOTEXT@
CFG_PATH_PSTOASCII = @PSTOASCII@
CFG_PATH_ANTIWORD = @ANTIWORD@
CFG_PATH_CATDOC = @CATDOC@
CFG_PATH_WVTEXT = @WVTEXT@
CFG_PATH_PPTHTML = @PPTHTML@
CFG_PATH_XLHTML = @XLHTML@
CFG_PATH_HTMLTOTEXT = @HTMLTOTEXT@
CFG_PATH_ANY2DJVU = @ANY2DJVU@
CFG_PATH_DJVUPS = @DJVUPS@
CFG_PATH_DJVUTXT = @DJVUTXT@
CFG_PATH_TIFF2PDF = @TIFF2PDF@
CFG_PATH_OCROSCRIPT = @OCROSCRIPT@
CFG_PATH_OPENOFFICE_PYTHON = @OPENOFFICE_PYTHON@
CFG_PATH_WGET = @WGET@
CFG_PATH_MD5SUM = @MD5SUM@

Expand Down
39 changes: 39 additions & 0 deletions config/invenio.conf
Original file line number Diff line number Diff line change
Expand Up @@ -514,6 +514,35 @@ CFG_BIBDOCFILE_USE_XSENDFILE = 0
## the check to be performed once for every 10 downloads)
CFG_BIBDOCFILE_MD5_CHECK_PROBABILITY = 0.1

## CFG_OPENOFFICE_SERVER_HOST -- the host where an OpenOffice Server is
## listening to. If localhost an OpenOffice server will be started
## automatically if it is not already running.
## Note: if you set this to an empty value this will disable the usage of
## OpenOffice for converting documents.
## If you set this to something different than localhost you'll have to take
## care to have an OpenOffice server running on the corresponding host and
## to install the same OpenOffice release both on the client and on the server
## side.
## In order to launch an OpenOffice server on a remote machine, just start
## the usual 'soffice' executable in this way:
## $> soffice -headless -nologo -nodefault -norestore -nofirststartwizard \
## .. -accept=socket,host=HOST,port=PORT;urp;StarOffice.ComponentContext
CFG_OPENOFFICE_SERVER_HOST = localhost

## CFG_OPENOFFICE_SERVER_PORT -- the port where an OpenOffice Server is
## listening to.
CFG_OPENOFFICE_SERVER_PORT = 2002

## CFG_OPENOFFICE_USER -- the user that will be used to launch the OpenOffice
## client. It is recommended to set this to a user who don't own files, like
## e.g. 'nobody'. You should also authorize your Apache server user to be
## able to become this user, e.g. by adding to your /etc/sudoers the following
## line:
## "apache ALL=(nobody) NOPASSWD: ALL"
## provided that apache is the username corresponding to the Apache user.
## On some machine this might be apache2 or www-data.
CFG_OPENOFFICE_USER = nobody

#################################
## Part 6: BibIndex parameters ##
#################################
Expand Down Expand Up @@ -574,6 +603,16 @@ CFG_BIBINDEX_URLOPENER_PASSWORD = mysuperpass
## and to disable this value for speed improvements.
CFG_INTBITSET_ENABLE_SANITY_CHECKS = False

## CFG_BIBINDEX_PERFORM_OCR_ON_DOCNAMES -- regular expression that matches
## docnames for which OCR is desired (set this to .* in order to enable
## OCR in general, set this to empty in order to disable it.)
CFG_BIBINDEX_PERFORM_OCR_ON_DOCNAMES = scan-.*

## CFG_BIBINDEX_SPLASH_PAGES -- regular expression that matches URLs
## that are not to be indexed but that indirectly refers to documents
## that are supposed to be indexed.
CFG_BIBINDEX_SPLASH_PAGES = http://documents\.cern\.ch/setlink\?.*

#######################################
## Part 7: Access control parameters ##
#######################################
Expand Down
19 changes: 19 additions & 0 deletions configure-tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,25 @@ def wait_for_user(msg):
*****************************************************
""" % msg

try:
import reportlab
except ImportError, msg:
print """
*****************************************************
** IMPORT WARNING %s
*****************************************************
** Note that reportlab module is not really **
** required, but we recommend it you want to **
** enrich PDF with OCR information. **
** **
** You can safely continue installing CDS Invenio **
** now, and add this module anytime later. (I.e. **
** even after your CDS Invenio installation is put **
** into production.) **
*****************************************************
""" % msg
wait_for_user("Press ENTER to continue the installation...")

## 4) check for versions of some important modules:
if MySQLdb.__version__ < cfg_min_mysqldb_version:
print """
Expand Down
Loading

0 comments on commit 7b3804d

Please sign in to comment.