+
Skip to content

cpdoc/dhbb-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

construção do corpus DHBB.

run udpipe

execute udpipe for each raw that has an json in ner folder:

for f in ner/*.json; do echo raw/$(basename $f .json).raw; done | xargs ~/work/udpipe-1.2.0/bin-osx/udpipe --outfile=udp/{}.conllu --tokenizer="normalized_spaces;ranges" --tag --parse ~/work/udpipe-1.2.0/models/portuguese-bosque-ud-2.5-191206.udpipe

run NLU

Melhor usar o ner/nlu.py

for a sample of files in RAW:

for f in raw/?.raw; do 
    curl -G --header "Content-Type: application/json" -u SECRET --data-urlencode "text@$f" "https://api.us-south.natural-language-understanding.watson.cloud.ibm.com/instances/a9eda6db-309d-4e9f-8454-0464bbbf7575/v1/analyze?version=2020-08-01&features=entities,relations&entities.model=073dab23-dd1e-4ded-badf-f502eb06372c&entities.mentions=true&&return_analyzed_text=true" > ner/$(basename $f .raw).json;
done

for each document in wks:

for f in `cat wks/documents.json| jq -r '.[] | .name'` ; do
    IN=$(basename $f .txt);
    echo Processing $IN;
    curl -X POST -G --header "Content-Type: application/json" -u $KEY --data-urlencode "text@../raw/$IN.raw"  "$URL/v1/analyze?version=2020-08-01&features=entities,relations&entities.model=$MODEL&entities.mentions=true&&return_analyzed_text=true" > $IN.json;
done

diretorio ner/wks

Para comparar ner/*.json com respectivos ner/wks/gt/*.json:

cat wks/documents.json| jq -r '.[] | "echo Processing " + .name + "\ntest-ner -c " + .name + ".json wks/gt/" + .id + ".json"' > temp.sh

Para importar novos arquivos para o WKS:

jq '.[] | .name | "raw/" + rtrimstr(".txt") + ".raw" ' -r ner/wks/documents.json > a.list
wc -l raw/*.raw | awk '$1 < 7 && $1 > 2 {print $2}' > b.list
diff --new-line-format="" --unchanged-line-format="" <(sort b.list) <(sort a.list) | shuf | head -50 > n.list
for f in `cat n.list`; do cp $f tmp/`basename $f .raw`.txt; done
rm a.list b.list n.list 

diretório udp-mini

for f in sents/*.sent; do echo $f; done | xargs ~/work/udpipe-1.2.0/bin-osx/udpipe --outfile=udp-mini/{}.conllu --tokenizer="normalized_spaces;presegmented" --tag --parse ~/work/udpipe-1.2.0/models/portuguese-bosque-ud-2.5-191206.udpipe

About

processamentos DHBB

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载