Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
July 27, 2022 11:41 pm GMT

Search text from PDF files stored in an S3 bucket

Does your application allow users to upload PDFs? Maybe they upload resumes, waivers, agreements or signed documents. What if they need to search the contents of these PDFs?

As a developer, you have 3 options:

  1. Search by Filename: Lookup by key/value like filename [Native]
  2. Search by Metadata: Store the metadata in a separate database to perform queries [Database add-on]
  3. Full-Text-Search: Extract the contents into a search engine [OCR_, Database, Search add-on]_

Full Text Search provides the most intuitive user experience, but its also the most challenging to build, maintain, and enhance.

data diagram

In this tutorial, well walk you through best practices for PDF file upload, content extraction via OCR (Optical Character Recognition), and searching so you can add full-text PDF search into your application, with ease.

Bonus: At the end will be a Github repository so you can import the code directly into your application.

Store the file

First we need a function to download the file locally in order to run our OCR extraction logic:

import boto3s3_client = boto3.client(      's3',      aws_access_key_id='aws_access_key_id',      aws_secret_access_key='aws_secret_access_key',      region_name='region_name'  )with open(s3_file_name, 'wb') as file:          s3_client.download_fileobj(              bucket_name,              s3_file_name,              file          )

Extract the contents

Well use the open source, Apache Tika library, which contains a class: AutoDetectParser that does OCR (optical character recognition):

from tika import parserparsed_pdf_content = parser.from_file(s3_file_name)['content']

Insert contents into a search engine

Were using a self-managed OpenSearch node here, but you can use Lucene, SOLR, ElasticSearch or Atlas Search.

Note: if you dont have OpenSearch locally you must install it first, then run it:

brew update  brew install opensearch  opensearch

OpenSearch will now be accessible here: http://localhost:9200. Lets build the index and insert the file contents:

from opensearchpy import OpenSearchos = OpenSearch("http://localhost:9200/")  index_name="pdf-search"doc = {      "filename": s3_file_name,      "parsed_pdf_content": parsed_pdf_content  }response = os.index(      index=index_name,      body=doc,      id=1,      refresh=True  )

Creating a PDF search API

Well use Flask to create a microservice that searches terms:

from flask import Flask, jsonify, request  from opensearchpy import OpenSearch  from config import *app = Flask(__name__)      os = OpenSearch("http://localhost:9200/")    @app.route('/search', methods=['GET'])      def search_file():          query = request.args.get('q', default = None, type = str)# query payload to ES          payload = {              'query': {                  'match': {                      'parsed_pdf_content': query                  }              }          }      response = os.search(          body=payload,          index=index_name      )return jsonify(response)if __name__ == '__main__':      app.run(host="localhost", port=5011, debug=True)

Now we can call the API via:

GET: http://localhost:5011/search?q=SEARCH_TERM{        "_shards": {          "failed": 0,           "skipped": 0,           "successful": 1,           "total": 1        },         "hits": {          "hits": [            {              "_id": "1",               "_index": "pdf-search",               "_score": 0.29289162,               "_source": {                "filename": "prescription.pdf",                 "parsed_pdf_content": "http://localhost:5011/search?q=SEARCH_TERM"              }            }          ],           "max_score": 0.29289162,           "total": {            "relation": "eq",             "value": 1          }        },         "timed_out": false,         "took": 40      }

Whoo we did it! Weve successfully created an API that offers full text PDF search.

congrats

You can download the repo here: https://github.com/mixpeek/pdf-search-s3

So whats next?

  • Queuing: Ensuring concurrent file uploads are not dropped
  • Security: Adding end to end encryption to the data pipeline
  • Enhancements: Including more features like fuzzy, highlighting and autocomplete
  • Rate Limiting: Building thresholds so users dont abuse the system

Everything collapsed into just 2 API calls

If this feels like too much for you to build, maintain, and enhance, Mixpeek has you covered.

Upload

import requestsurl = "https://api.mixpeek.com/upload"      files=[        ('file',('FILE_NAME.pdf',open('FILE_NAME.pdf','rb'),'pdf'))      ]      response = requests.request("POST", url, files=files)

Search

import requestsurl = "https://api.mixpeek.com/search?q=SEARCH_QUERY"response = requests.request("GET", url)print(response.text)

Corresponding Postman Collection for your convenience.

Request an API key for free, and review the docs to get started.


Original Link: https://dev.to/mixpeek/search-text-from-pdf-files-stored-in-an-s3-bucket-2084

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To