Parsing HTML files that may not comply with HTML DOM

$10-30 USD

Imefungwa

Imechapishwa

about 5 years ago

$10-30 USD

Kulipwa wakati wa kufikishwa

I have 1TB of htm files. These files have been collected from the same 30 websites since 2011. When I'm trying to write a parser for these (python,nlp,beautifulsoup,decruft,etc.) large portions of known text does not appear. Upon review I notice that some pages are putting the actual page content into json strings that live within javascript elements. This makes the parsing process very cumbersome for me. One example of this behavior is the current [login to view URL] index page. This project will have 1 deliverable. A conversation with me to discuss modern technologies and techniques that can be used to parse these pages, store the results, and make it available for searching. After our conversation a second project may be opened on freelancer exclusively for the winning bidder to perform further work related to the parsing of HTML based on their suggestions provided from our discussion on this project.

HTML

JavaScript

Python

Software Architecture

Web Scraping

Kitambulisho cha mradi: 18626886

Kuhusu mradi

5 mapendekezo

Mradi wa mbali

Inatumika 5 yrs ago

Unatafuta kupata pesa?

Barua pepe

Faida za kutoa zabuni kwenye Freelancer

Weka bajeti yako na muda uliopangwa

Pata malipo kwa kazi yako

Eleza pendekezo lako

Ni bure kujiandikisha na kutoa zabuni kwa kazi

5 wafanyakazi huru wana zabuni kwa wastani $43 USD kwa kazi hii

@zeke

This can be done by combining different techniques including regular expressions. I have huge experience with parsing HTML files. Ready to start immediately. Please contact with details if you are interested. Thank you, zeke.

$30 USD ndani ya siku 1

4.5

(103 hakiki)

7.3

@ideepeners

Hi, I am a web scrapping expert. I have worked extensively on NLP. What i think you need is combination of both of these in order to parse your data correctly. Let us discuss further details in chat. I will be glad to work with you. Thanks, Shubham Sharma

$100 USD ndani ya siku 1

4.7

(24 hakiki)

4.6

@tomivs

I'm an expert Python developer and I've done web scraping. For that reason I think I'm the best candidate for this work... Have you tried to run the Javascript from the HTML? This is a good solution but could be slower...

$30 USD ndani ya siku 1