The following D3 visualization shows helps in evaluation of the crawling process. The request-response headers are investigated and metrics are provided below.
File Size diversity across all the MIME types.
The following dendogram represents the parser chain called by Apache Tika
The following D3 identifies all the languages in the polar dataset using Apache Tika
The following D3 identifies all the languages in the polar dataset using Optimaize
The following lists sample NER entities and mappings
The following lists maximal joint among NER toolkits