Friday, 29 May 2015

Data Scraping Services - Web Scraping Video Tutorial Collection for All Programming Language

Web scraping is a mechanism in which request made to website URL to get  HTML Document text and that text then parsed to extract data from the HTML codes.  Website scraping for data is a generalize approach and can be implemented in any programming language like PHP, Java, C#, Python and many other.

There are many Web scraping software available in market using which you can extract data with no coding knowledge. In many case the scraping doesn’t help due to custom crawling flow for data scraping and in that case you have to make your own web scraping application in one of the programming language you know. In this post I have collected scraping video tutorials for all programming language.

I mostly familiar with web scraping using PHP, C# and some other scraping tools and providing web scraping service.  If you have any scraping requirement send me your requirements and I will get back with sample data scrape and best price.

 Web Scraping Using PHP

You can do web scraping in PHP using CURL library and Simple HTML DOM parsing library.  PHP function file_get_content() can also be useful for making web request. One drawback of scraping using PHP is it can’t parse JavaScript so ajax based scraping can’t be possible using PHP.

Web Scraping Using C#

There are many library available in .Net for HTML parsing and data scraping. I have used Web Browser control and HTML Agility Pack for data extraction in .Net using C#

I have didn’t done web scraping in Java, PERL and Python. I had learned web scraping in node.js using Casper.JS and Phantom.JS library. But I thought below tutorial will be helpful for some one who are Java and Python based.

Web Scraping Using Jsoup in Java

Scraping Stock Data Using Python

Develop Web Crawler Using PERL

Web Scraping Using Node.Js

If you find any other good web scraping video tutorial then you can share the link in comment so other readesr get benefit form that.

Source: http://webdata-scraping.com/web-scraping-video-tutorial-collection-programming-language/

Tuesday, 26 May 2015

Web Scraping Services - Extracting Business Data You Need

Would you like to have someone collect, extract, find or scrap contact details, stats, list, extract data, or information from websites, online stores, directories, and more?

"Hi-Tech BPO Services offers 100% risk-free, quick, accurate and affordable web scraping, data scraping, screen scraping, data collection, data extraction, and website scraping services to worldwide organizations ranging from medium-sized business firms to Fortune 500 companies."

At Hi-Tech BPO Services we are helping global businesses build their own database, mailing list, generate leads, and get access to vast resources of unstructured data available on World Wide Web.

We scrape data from various sources such as websites, blogs, podcasts, and online directories; and convert them into structured formats such as excel, csv, access, text, My SQL using automated and manual scraping technologies. Through our web data scraping services, we crawl through websites and gather sales leads, competitor’s product details, new offers, pricing methodologies, and various other types of information from the web.

Our web scraping services scrape data such as name, email, phone number, address, country, state, city, product, and pricing details among others.

Areas of Expertise in Web Scraping:

•    Contact Details
•    Statistics data from websites
•    Classifieds
•    Real estate portals
•    Social networking sites
•    Government portals
•    Entertainment sites
•    Auction portals
•    Business directories
•    Job portals
•    Email ids and Profiles
•    URLs in an excel spreadsheet
•    Market place portals
•    Search engine and SEO
•    Accessories portals
•    News portals
•    Online shopping portals
•    Hotels and restaurant
•    Event portals
•    Lead generation

Industries we Serve:

Our web scraping services are suitable for industries including real estate, information technology, university, hospital, medicine, property, restaurant, hotels, banking, finance, insurance, media/entertainment, automobiles, marketing, human resources, manufacturing, healthcare, academics, travel, telecommunication and many more.

Why Hi-Tech BPO Services for Web Scraping?

•    Skilled and committed scraping experts
•    Accurate solutions
•    Highly cost-effective pricing strategies
•    Presence of satisfied clients worldwide
•    Using latest and effectual web scraping technologies
•    Ensures timely delivery
•    Round the clock customer support and technical assistance

Get Quick Cost and Time Estimate

Source: http://www.hitechbposervices.com/web-scraping.php

Monday, 25 May 2015

Which language is the most flexible for scraping websites?

3 down vote favorite

I'm new to programming. I know a little python and a little objective c, and I've been going through tutorials for each. Then it occurred to me, I need to know which language is more flexible (python, obj c, something else) for screen scraping a website for content.

What do I mean by "flexible"?

Well, ideally, I need something that will be easy to refactor and tweak for similar projects. I'm trying to avoid doing a lot of re-writing (well, re-coding) if I wanted to switch some of the variables in the program (i.e., the website to be scraped, the content to fetch, etc).

Anyways, if you could please give me your opinion, that would be great. Oh, and if you know any existing frameworks for the language you recommend, please share. (I know a little about Selenium and BeautifulSoup for python already).

4 Answers

I recently wrote a relatively complex web scraper to harvest a TON of data. It had to do some relatively complex parsing, I needed it to stuff it into a database, etc. I'm C# programmer now and formerly a Perl guy.

I wrote my original scraper using Python. I started on a Thursday and by Sunday morning I was harvesting over about a million scores from a show horse site. I used Python and SQLlite because they were fast.

HOWEVER, as I started putting together programs to regularly keep the data updated and to populate the SQL Server that would backend my MVC3 application, I kept hitting snags and gaps in my Python knowledge.

In the end, I completely rewrote the scraper/parser in C# using the HtmlAgilityPack and it works better than before (and just about as fast).

Because I KNEW THE LANGUAGE and the environment so much better I was able to add better database support, better logging, better error handling, etc. etc.

So... short answer.. Python was the fastest to market with a "good enough for now" solution, but the language I know best (C#) was the best long-term solution.

EDIT: I used BeautifulSoup for my original crawler written in Python.

5 down vote

The most flexible is the one that you're most familiar with.

Personally, I use Python for almost all of my utilities. For scraping, I find that its functionality specific to parsing and string manipulation requires little code, is fast and there are a ton of examples out there (strong community). Chances are that someone's already written whatever you're trying to do already, or there's at least something along the same lines that needs very little refactoring.

1 down vote

I think its safe to say that Python is a better place to start than Objective C. Honestly, just about any language meets the "flexible" requirement. All you need is well thought out configuration parameters. Also, a dynamic language like Python can go a long way in increasing flexibility, provided that you account for runtime type errors.

1 down vote

I recently wrote a very simple web-scraper; I chose Common Lisp as I'm learning the language.

On the basis of my experience - both of the language and the availability of help from experienced Lispers - I recommend investigating Common Lisp for your purpose.

There are excellent XML-parsing libraries available for CL, as well as libraries for parsing invalid HTML, which you'll need unless the sites you're parsing consist solely of valid XHTML.

Also, Common Lisp is a good language in which to implement DSLs; a DSL for web-scraping may be a solution to your requirement for flexibility & re-use.

Source: http://programmers.stackexchange.com/questions/74998/which-language-is-the-most-flexible-for-scraping-websites/75006#75006


Saturday, 23 May 2015

Scraping Data: Site-specific Extractors vs. Generic Extractors

Scraping is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).

Social Media Scraping

Focused crawls on a predefined list of sites

However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are 2 ways of tackling this:

a. site-specific extractors which give desired results

b. generic extractors that result in few surprises

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data – Use generic extractors when you have a large-scale crawling requirement on a continuous basis. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the data.

cb1c0_one-size

A generic extractor case

Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. Reason being

programatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article. Or delineating article content from the title on a blog page is not easy either.

To summarize, below is what to expect of a generic extractor.

Pros-

•    minimal manual intervention
•    low on effort and time
•    can work on any scale

Cons-

•    Data quality compromised
•    inaccurate and incomplete datasets
•    lesser details suited only for high-level analyses
•    Suited for gathering- blogs, forums, news
•    Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring.

2. Low/Mid scale crawls; detailed datasets – If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is do-able only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many number of fields from any nook or corner of the web pages. Most of the times, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.

cutlery

Designing extractor for each website

Pros-

•    High data quality
•    Better data coverage on the site

Cons-

High on effort and time

Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention

Only for limited scale

Suited for gathering – any data from any domain on any site be it product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.

Uses- Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis

Conclusion

Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.

So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.

What have your challenges been? Do drop in your comments.

Source: https://www.promptcloud.com/blog/scraping-data-site-specific-extractors-vs-generic-extractors/

Wednesday, 20 May 2015

Social Media Crawling & Scraping services for Brand Monitoring

Crawling social media sites for extracting information is a fairly new concept – mainly due to the fact that most of the social media networking sites have cropped up in the last decade or so. But it’s equally (if not more) important to grab this ever-expanding User-Generated-Content (UGC) as this is the data that companies are interested in the most – such as product/service reviews, feedback, complaints, brand monitoring, brand analysis, competitor analysis, overall sentiment towards the brand, and so on.

Scraping social networking sites such as Twitter, Linkedin, Google Plus, Instagram etc. is not an easy task for in-house data acquisition departments of most companies as these sites have complex structures and also restrict the amount and frequency of the data that they let out to crawlers. This kind of a task is best left to an expert, such as PromptCloud’s Social Media Data Acquisition Service – which can take care of your end-to-end requirements and provide you with the desired data in a minimal turnaround time. Most of the popular social networking sites such as Twitter and Facebook let crawlers extract data only through their own API (Application Programming Interface), so as to control the amount of information about their users and their activities.

PromptCloud respects all these restrictions with respect to access to content and frequency of hitting their servers to make sure that user information is not compromised and their experience with the site is unhindered.

Social Media Scraping Experts

At PromptCloud, we have developed an expertise in crawling and scraping social media data in real-time. Such data can be from diverse sources such as – Twitter, Linkedin groups, blogs, news, reviews etc. Popular usage of this data is in brand monitoring, trend watching, sentiment/competitor analysis & customer service, among others.

Our low-latency component can extract data on the basis of specific keywords, categories, geographies, or a combination of these. We can also take care of complexities such as multiple languages as well as tweets and profiles of specific users (based on keywords or geographies). Sample XML data can be accessed through this link – demo.promptcloud.com.

Structured data is delivered via a single REST-based API and every time new content is published, the feed gets updated automatically. We also provide data in any other preferred formats (XML, CSV, XLS etc.).

If you have a social media data acquisition problem that you want to get solved, please do get in touch with us.

Source: https://www.promptcloud.com/social-media-networking-sites-crawling-service/

Sunday, 17 May 2015

Scraping Twitter Lists To Boost Social Outreach (+ Free Tool!)

I published a post a few weeks ago describing how to build your own twitter custom audience list, outlining a variety of techniques to build up your list.

This post outlines another method (hat tip to Ade Lewis for the idea) which requires you to scrape Twitter directly.

If you want to skip all the explanations and just want to download the Twitter List Scraper tool, here you go…

Download the Twitter Scraper Tool for Windows or Mac (completely free)

Disclaimer: Scraping Twitter is against their Terms of Service, so if you decide to do this you do it at your own risk.

Some Benchmarks

Building custom audiences on Twitter requires you to identify Twitter usernames that might be interested in your service or product.

In my previous posts, one of the methods I employed was to pull a competitor’s link profile and scrape social accounts from the linking domains.

Once you upload a custom list, Twitter goes through a process of ‘matching’ against profiles in their system, to make sure the user exists and hasn’t opted out of tailored ads.

As our data was scraped from a list of unqualified websites, the data matching wasn’t likely to be perfect.

Experiments

Since I published that post, I have been experimenting a fair bit with list building, and have built up around 10 custom audience lists. I‘ve uploaded a total of 48,857 Twitter usernames using this method, but only 29,260 were matched by Twitter (just less than 60% match rate).

From some other experiments where I have had better control over the input data, this match rate was between 70-80%.

Since we’ll be scraping Twitter directly, I expect our match rate to be much higher – 90%+

Finding Relevant Twitter Lists

So, we’re going to scrape Twitter, and the first step is to find Twitter lists that will contain users potentially interested in what we have to offer.

As an example, we’ll pretend we’re marketing a music website, and we’ve produced a survey we want to collect responses for.

An advanced Google query can give us lists of music bloggers: site:twitter.com inurl:lists inurl:members inurl:music “music blogger”

Source: http://urlprofiler.com/blog/scraping-twitter/

Wednesday, 13 May 2015

Web Scraping - Data Collection or Illegal Activity?

Web Scraping Defined

We've all heard the term "web scraping" but what is this thing and why should we really care about it?  Web scraping refers to an application that is programmed to simulate human web surfing by accessing websites on behalf of its "user" and collecting large amounts of data that would typically be difficult for the end user to access.  Web scrapers process the unstructured or semi-structured data pages of targeted websites and convert the data into a structured format.  Once the data is in a structured format, the user can extract or manipulate the data with ease.  Web scraping is very similar to web indexing (used by most search engines), but the end motivation is typically much different.  Whereas web indexing is used to help make search engines more efficient, web scraping is typically used for different reasons like change detection, market research, data monitoring, and in some cases, theft.

Why Web Scrape?

There are lots of reasons people (or companies) want to scrape websites, and there are tons of web scraping applications available today.  A quick Internet search will yield numerous web scraping tools written in just about any programming language you prefer.  In today's information-hungry environment, individuals and companies alike are willing to go to great lengths to gather information about all sorts of topics.  Imagine a company that would really like to gather some market research on one of their leading competitors...might they be tempted to invoke a web scraper that gathers all the information for them?  Or, what if someone wanted to find a vulnerable site that allowed otherwise not-so-free downloads?  Or, maybe a less than honest person might want to find a list of account numbers on a site that failed to properly secure them.  The list goes on and on.

I should mention that web scraping is not always a bad thing.  Some websites allow web scraping, but many do not.  It's important to know what a website allows and prohibits before you scrape it.

The Problem With Web Scraping

Web scraping rides a fine line between collecting information and stealing information.  Most websites have a copyright disclosure statement that legally protects their website information.  It's up to the reader/user/scraper to read these disclosure statements and follow along legally and ethically.  In fact, the F5.com website presents the following copyright disclosure:  "All content included on this site, such as text, graphics, logos, button icons, images, audio clips, and software, including the compilation thereof (meaning the collection, arrangement, and assembly), is the property of F5 Networks, Inc., or its content and software suppliers, except as may be stated otherwise, and is protected by U.S. and international copyright laws."  It goes on to say, "We reserve the right to make changes to our site and these disclaimers, terms, and conditions at any time."

So, scraper beware!  There have been many court cases where web scraping turned into felony offenses.  One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles.  This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted.  Another case involves a real estate company who illegally scraped listings and photos from a competitor in an attempt to gain a lead in the market.  Then, there's the case of a regional software company that was convicted of illegally scraping a major database company's websites in order to gain a competitive edge.  The software company had to pay a $20 million fine and the guilty scraper is serving three years probation.  Finally, there's the case of a medical website that hosted sensitive patient information.  In this case, several patients had posted personal drug listings and other private information on closed forums located on the medical website.  The website was scraped by a media-rese
arch firm, and all this information was suddenly public.

While many illegal web scrapers have been caught by the authorities, many more have never been caught and still run loose on websites around the world.  As you can see, it's increasingly important to guard against this activity.  After all, the information on your website belongs to you, and you don't want anyone else taking it without your permission.

The Good News

As we've noted, web scraping is a real problem for many companies today.  The good news is that F5 has web scraping protection built into the Application Security Manager (ASM) of its BIG-IP product family.  As you can see in the screenshot below, the ASM provides web scraping protection against bots, session opening anomalies, session transaction anomalies, and IP address whitelisting.

The bot detection works with clients that accept cookies and process JavaScript.  It counts the client's page consumption speed and declares a client as a bot if a certain number of page changes happen within a given time interval.  The session opening anomaly spots web scrapers that do not accept cookies or process JavaScript.  It counts the number of sessions opened during a given time interval and declares the client as a scraper if the maximum threshold is exceeded.  The session transaction anomaly detects valid sessions that visit the site much more than other clients.  This defense is looking at a bigger picture and it blocks sessions that exceed a calculated baseline number that is derived from a current session table.  The IP address whitelist allows known friendly bots and crawlers (i.e. Google, Bing, Yahoo, Ask, etc), and this list can be populated as needed to fit the needs of your organization.

I won't go into all the details here because I'll have some future articles that dive into the details of how the ASM protects against these types of web scraping capabilities.  But, suffice it to say, ASM does a great job of protecting your website against the problem of web scraping.

I'm sure as you studied the screenshot above you also noticed lots of other protection capabilities the ASM provides...brute force attack prevention, customized attack signatures, Denial of Service protection, etc.  You might be wondering how it does all that stuff as well.  Give us a little feedback on the topics you would like to see, and we'll start posting some targeted tech tips for you!

Thanks for reading this introductory web scraping article...and, be sure to come back for the deeper look into how the ASM is configured to handle this problem. For more information, check out this video from Peter Silva where he discusses ASM botnet and web scraping defense.

Source: https://devcentral.f5.com/articles/web-scraping-data-collection-or-illegal-activity

Saturday, 2 May 2015

Lawyers & Attorneys Website Data Scraping Services

There are so many instances where one end’s up needing information from lawyers or bar associations. However, if you approach them directly or look for other ways to get information it might either be difficult or you might not get the information you are looking for. Thus, the best way to go about the scraping lawyer data.

Scraping lawyer data allow you to get information from various attorney websites, bar association websites, or other related websites. Using web scraping tools for getting such information makes it much easier to get all the relevant and important information without actually having to worry about the same.

If you wish to scrape data from lawyer, you are entitled to information such as lawyer name, firm names, address, contact details, history about the lawyers, educational qualifications, the bar association they are part of and much more.

Scraping lawyer data ensure that you also have images of the lawyer you are concentrating on. The result of scrape data form lawyer can be obtained in any format the user wants such as csv, excel, MySql etc. Scraping lawyer data also ensures that none of the information provided are repetitive or redundant.

If you are in need of information regarding any lawyer such as their contact details, address etc. it could end up being a huge and difficult task to get it manually or physically. Thus, taking off the help of scraping tools would ensure that you get all the needed information without actually having to bother about anything at all. The presence of lots of attorney websites and the fact that more and more lawyers are moving to the internet makes getting information easy with the help of some great tools. Scraping data is a very useful and handy method in which one can get all the required and relevant information and that too in a very easy to read format, which makes the method even worthier.

There are quite a few tools or services that you can take help of to get lawyers data scraped. Most of these services also provide with a sample demo and that free of cost. From the sample one can decide if they wish to continue with the services or try some other services. Thus, if you want any information from attorney websites or information about any lawyers, data scraping is a great way to get the same.

Source: https://3idatascraping.wordpress.com/2014/03/18/lawyers-attorneys-website-data-scraping-services/