text2doc
The Datanote feature extraction engine, as micro service
TODO
- support multiple formats:
- datanote: a custom, low-level format supported by Datanote
- json: basic list of entities
- gexf: GEXF graph
- csv: CSV graph (for Neo4J) https://neo4j.com/developer/guide-import-csv/
List of features
Formats
The API supports multiple output format to export entities and sometimes relationships between them.
GEXF
TODO
RDFa
Only basic (and custom) RDFa is supported, example:
the Monkey has Ebolavirus
Features
Custom fields
Optional url parameters:
- locale:
en
,fr
(example:?locale=en
,&locale=fr
..) - fields: values to keep (example:
fields=id,label
,&fields=label,links,target
..) - domain:
PoliceReport
, see source for more (example:?domain=PoliceReport
..) - types:
bacteria
,address
,event
, see source for more -
- format:
graphson
,gdf
,gexf
(example:?format=gdf
..)
- format:
Note: since domain
cannot be used at the same time as types
, types
will
have priority and domain
will have no effect.
Domains and entity types
Current extraction model (you can change this, if your edit engine.js
):
PoliceReport: 'email' 'phone' 'location' 'evidence' 'event' 'protagonist' 'position' 'weapon' generic: 'protagonist'
Usage
Examples use httpie with jq, but you can also use curl or something else.
The content-type is optional, it can help the app if there is an encoding issue with magic number.
Example with curl
curl -X POST "http://localhost:3000?locale=en&types=animal&format=gdf" -d "THE HIPPO KILLS THE DOLPHIN"curl -X POST "http://localhost:3000?locale=en&types=protagonist,weapon&format=gdf" -d "James bond buys an ak-47"curl -X POST "http://localhost:3000" --data-binary "@tests/fixtures/police_en.txt"curl -X POST "https://file2doc.mutation.one?locale=en&types=protagonist,virus" -d "James Bond has caught the terrorist carrying H5N1"
Example with httpie and jq
https POST "https://file2doc.mutation.one?locale=en&types=virus" body="the monkey died of ebola" | jqhttps POST "https://file2doc.mutation.one" body="James Bond" | jqhttps POST "https://file2doc.mutation.one" body="James Bond" | jqhttps POST "https://file2doc.mutation.one?&fields=label,links,link,target&locale=en" body="James Bond" | jqhttps POST "https://file2doc.mutation.one?locale=en" body="James Bond" | jqhttps POST "https://file2doc.mutation.one?&fields=label,links,link,target" body="James Bond" | jqhttps POST "https://file2doc.mutation.one?locale=en&types=protagonist,virus" body="James Bond has caught the terrorist carrying H5N1" | jq
Longer example
https POST "https://file2doc.mutation.one?fields=link,links,target,properties,ngram,begin,end,label,gender,number,firstname,lastname&locale=en" body="James Bond buys an AK-47"
output:
``` ### Medical example ```bashhttps POST "https://file2doc.mutation.one?locale=en&types=virus" body="H5N1" | jq
GDF
curl -X POST "http://localhost:3000?locale=en&types=animal,virus&format=gdf" -d "the monkey has ebola"
nodedef>id VARCHAR,label VARCHARentity:animal__monkey,Monkeyentity:virus__ebolavirus,Ebolavirusedgedef>id VARCHAR,source VARCHAR,target VARCHAR
Graphson
curl -X POST "http://localhost:3000?locale=en&types=animal,virus&format=graphson" -d "the monkey has ebola"
Deployment
To start the service locally: npm run start
.
To deploy on Now: npm run deploy
.
For the moment we have to manually edit the Dockerfile to add the NPM_TOKEN key (WARNING: do not commit the key). This is because there is a limitation on Now regarding the ARG directive (build-time env variables) in Docker, it is not working.