Skip to content

Scraping a PDF #51

@psychemedia

Description

@psychemedia

How do I scrape a local PDF?

I'm running:

  • norma/releases/download/v0.2.26/norma_0.1.SNAPSHOT_all.deb
  • ami/releases/download/v0.2.24/ami2_0.1.SNAPSHOT_all.deb

and using one of your test files trying:

norma  -i /contentmineself/trialsjournal_15_1_511.pdf -o /contentmineself/test_ct/

but all it seems to do is copy the pdf and rename it fulltext.pdf?

If I add the switch --transform pdf2html, as per #38, I get:

java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:1049)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runMethodsOfType(DefaultArgProcessor.java:946)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:927)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:1111)
    at org.xmlcml.norma.Norma.run(Norma.java:23)
    at org.xmlcml.norma.Norma.main(Norma.java:18)
Caused by: java.lang.RuntimeException: Input must be reserved file; found: /contentmineself/trialsjournal_15_1_511.pdf
    at org.xmlcml.norma.NormaArgProcessor.checkAndGetInputFile(NormaArgProcessor.java:282)
    at org.xmlcml.norma.NormaTransformer.transform(NormaTransformer.java:114)
    at org.xmlcml.norma.NormaArgProcessor.runTransform(NormaArgProcessor.java:202)
    ... 10 more
0    [main] DEBUG org.xmlcml.cmine.args.DefaultArgProcessor  - option in exception  or --transform; (1,2147483647); parseTransform; STRING: null / []; pdf2html; [pdf2html]
java.lang.RuntimeException: invoke runTransform fails
    at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:1052)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runMethodsOfType(DefaultArgProcessor.java:946)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:927)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:1111)
    at org.xmlcml.norma.Norma.run(Norma.java:23)
    at org.xmlcml.norma.Norma.main(Norma.java:18)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:1049)
    ... 5 more
Caused by: java.lang.RuntimeException: Input must be reserved file; found: /contentmineself/trialsjournal_15_1_511.pdf
    at org.xmlcml.norma.NormaArgProcessor.checkAndGetInputFile(NormaArgProcessor.java:282)
    at org.xmlcml.norma.NormaTransformer.transform(NormaTransformer.java:114)
    at org.xmlcml.norma.NormaArgProcessor.runTransform(NormaArgProcessor.java:202)
    ... 10 more

My complete install is:

RUN apt-get clean -y && apt-get -y update && apt-get -y upgrade && \
  apt-get -y update && apt-get install -y wget ant unzip openjdk-7-jdk  && \
    apt-get clean -y

RUN wget --no-check-certificate https://github.com/ContentMine/norma/releases/download/v0.2.26/norma_0.1.SNAPSHOT_all.deb

RUN wget --no-check-certificate https://github.com/ContentMine/ami/releases/download/v0.2.24/ami2_0.1.SNAPSHOT_all.deb

RUN dpkg -i norma_0.1.SNAPSHOT_all.deb
RUN dpkg -i ami2_0.1.SNAPSHOT_all.deb

RUN npm install --global getpapers

in a basic linux environment with node installed (Dockerhub image node:4.3.2).

Hmm - is this the issue maybe? #21 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions