House Cleaning Email Scraping: December 2013

Saturday, 28 December 2013

New handwriting recognition study compares usage and performance of OCR, ICR and manual data entry

Companies still collect much of their critical information via pen and paper, yet they ultimately need this information to be available in their digital systems. How much do companies rely on this handwritten data, and how do they then convert it to the digital data they ultimately need? This recent study by the Association for Information and Image Management (AIIM) gives the answers:

Companies rely on handwritten data…

From AIIM

Of the companies surveyed for the study, 50% identified handwritten information as important to their business processes and a full 25% identified it as playing a key role for them. This data could be generated internally, through employee evaluations, culture surveys, site inspections, invoices, walk sheets, etc. It could also come from current or potential clients, in the form of newsletter signup sheets, registration forms, raffle tickets, satisfaction surveys, comment cards, mail-orders forms, purchase orders, and even signed contracts.

…But they struggle to convert handwriting to digital data

data extraction

While companies rely on handwriting to collect data, they need that data entered into their computerized systems quickly and accurately. How are they bridging the paper and digital worlds? The reality is that most companies live with a painful disconnect between their data collection methods and digital data needs. More than half of those surveyed enter the data by hand, while another third rely on OCR, and another 12% use ICR (intelligent character recognition). Before Captricity, there really were no other options available.

Manual Entry, OCR & ICR are Woefully Inadequate:

Unfortunately, all three options – OCR, ICR, and manual entry – come with significant trade-offs in terms of flexibility, turnaround time, and/or quality.

    All of us at some point have dealt with manual data entry. It’s slow, often expensive, not always accurate, and can lead to significant lag times in getting your data. Many companies tell us they have backlogs of months or even years that their manual data entry staff just have not been able to deal with.

    OCR (Optical Character Recognition) converts images of text into a digital, machine-readable format. While it tends to work adequately for well-scanned and printed text, it is extremely inaccurate for handwriting, and yields only a “bag of text”, not structured data. In other words, if you start with a scanned, typed form, OCR will give you a .txt file, not a data set. And if you start with a form filled in by hand, OCR will give you very little useful data at all.

    ICR (Intelligent Character Recognition) was created to more accurately read handwriting. If you have ever filled out a driver’s license application or customs form, you’re already familiar with the highly-regulated ICR-ready forms, where boxes or “combs” (small vertical lines) separate each letter. While this system can read hand-printed text a bit better, it’s as limiting as a Scantron bubblesheet is to teachers who want to ask open-ended questions. ICR makes free-form text and short answers almost impossible. Furthermore, setting up ICR-compatible forms takes time and expertise, requiring significant up-front investment. For the vast majority of those organizations that rely on handwritten data, this is not a practical solution. They are in a tough spot.

Enter Captricity for REAL handwriting recognition.

Our unique data capture technology was created specifically to turn any handwritten form, no matter the format, into digital data quickly and accurately. Multiple-choice, numerical response, likert scale, short answers and long answers are no problem! There is minimal set-up and no software to install. Take your completed forms, scan or photograph them, and upload the images to Captricity. Our special mix of computer algorithms and human intelligence extracts data faster than manual re-keying and more accurately than OCR, with more flexibility than ICR.

Source:http://captricity.com/handwriting-recognition-study-on-ocr-icr-manual-data-entry/

Friday, 27 December 2013

Web Scraping - Data Collection or Illegal Activity?

Web Scraping Defined

We've all heard the term "web scraping" but what is this thing and why should we really care about it? Web scraping refers to an application that is programmed to simulate human web surfing by accessing websites on behalf of its "user" and collecting large amounts of data that would typically be difficult for the end user to access. Web scrapers process the unstructured or semi-structured data pages of targeted websites and convert the data into a structured format. Once the data is in a structured format, the user can extract or manipulate the data with ease. Web scraping is very similar to web indexing (used by most search engines), but the end motivation is typically much different. Whereas web indexing is used to help make search engines more efficient, web scraping is typically used for different reasons like change detection, market research, data monitoring, and in some cases, theft.

Why Web Scrape?

There are lots of reasons people (or companies) want to scrape websites, and there are tons of web scraping applications available today. A quick Internet search will yield numerous web scraping tools written in just about any programming language you prefer. In today's information-hungry environment, individuals and companies alike are willing to go to great lengths to gather information about all sorts of topics. Imagine a company that would really like to gather some market research on one of their leading competitors...might they be tempted to invoke a web scraper that gathers all the information for them? Or, what if someone wanted to find a vulnerable site that allowed otherwise not-so-free downloads? Or, maybe a less than honest person might want to find a list of account numbers on a site that failed to properly secure them. The list goes on and on.

I should mention that web scraping is not always a bad thing. Some websites allow web scraping, but many do not. It's important to know what a website allows and prohibits before you scrape it.

The Problem With Web Scraping

Web scraping rides a fine line between collecting information and stealing information. Most websites have a copyright disclosure statement that legally protects their website information. It's up to the reader/user/scraper to read these disclosure statements and follow along legally and ethically. In fact, the F5.com website presents the following copyright disclosure: "All content included on this site, such as text, graphics, logos, button icons, images, audio clips, and software, including the compilation thereof (meaning the collection, arrangement, and assembly), is the property of F5 Networks, Inc., or its content and software suppliers, except as may be stated otherwise, and is protected by U.S. and international copyright laws." It goes on to say, "We reserve the right to make changes to our site and these disclaimers, terms, and conditions at any time."

So, scraper beware! There have been many court cases where web scraping turned into felony offenses. One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles. This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted. Another case involves a real estate company who illegally scraped listings and photos from a competitor in an attempt to gain a lead in the market. Then, there's the case of a regional software company that was convicted of illegally scraping a major database company's websites in order to gain a competitive edge. The software company had to pay a $20 million fine and the guilty scraper is serving three years probation. Finally, there's the case of a medical website that hosted sensitive patient information. In this case, several patients had posted personal drug listings and other private information on closed forums located on the medical website. The website was scraped by a media-research firm, and all this information was suddenly public.

While many illegal web scrapers have been caught by the authorities, many more have never been caught and still run loose on websites around the world. As you can see, it's increasingly important to guard against this activity. After all, the information on your website belongs to you, and you don't want anyone else taking it without your permission.

The Good News

As we've noted, web scraping is a real problem for many companies today. The good news is that F5 has web scraping protection built into the Application Security Manager (ASM) of its BIG-IP product family. As you can see in the screenshot below, the ASM provides web scraping protection against bots, session opening anomalies, session transaction anomalies, and IP address whitelisting.

The bot detection works with clients that accept cookies and process JavaScript. It counts the client's page consumption speed and declares a client as a bot if a certain number of page changes happen within a given time interval. The session opening anomaly spots web scrapers that do not accept cookies or process JavaScript. It counts the number of sessions opened during a given time interval and declares the client as a scraper if the maximum threshold is exceeded. The session transaction anomaly detects valid sessions that visit the site much more than other clients. This defense is looking at a bigger picture and it blocks sessions that exceed a calculated baseline number that is derived from a current session table. The IP address whitelist allows known friendly bots and crawlers (i.e. Google, Bing, Yahoo, Ask, etc), and this list can be populated as needed to fit the needs of your organization.

I won't go into all the details here because I'll have some future articles that dive into the details of how the ASM protects against these types of web scraping capabilities. But, suffice it to say, ASM does a great job of protecting your website against the problem of web scraping.

ASM Web Scrape

I'm sure as you studied the screenshot above you also noticed lots of other protection capabilities the ASM provides...brute force attack prevention, customized attack signatures, Denial of Service protection, etc. You might be wondering how it does all that stuff as well. Give us a little feedback on the topics you would like to see, and we'll start posting some targeted tech tips for you!

Thanks for reading this introductory web scraping article...and, be sure to come back for the deeper look into how the ASM is configured to handle this problem. For more information, check out this video from Peter Silva where he discusses ASM botnet and web scraping defense.

Source:https://devcentral.f5.com/articles/web-scraping-data-collection-or-illegal-activity#.Ur5Qg849BIA

Tips For Easier Product Uploads In Magento

Uploading products into Magento can be a very time consuming task, especially when you need to upload several hundred or even several thousand products. Fortunately there are some shortcuts that you can take to make this a quicker process. This guide will provide a high level overview as to the most effective methods for achieving quicker bulk uploads into Magento. We also suggest you refer to the Magento user guide for detailed tutorials on performing some of these tasks.

Utilize Magento’s Bulk Product Import Feature:

In addition to giving you the ability to add products individually, Magento also provides bulk import capabilities. By utilizing Magento’s bulk import feature a user can import a large number of products from a single CSV file. While the formatting requirements for bulk product uploads into Magento are very specific, you can easily get an understanding as to how your product must be formatted by exporting existing products out of Magento into a CSV file. It’s important to note that since some of your products may be configurable or have multiple variations the CSV formatting requirements for these products will differ from those without.

Upload Images In Bulk:

Instead of individually uploading images at the product level you can upload images in bulk. Bulk image uploads in Magento are always performed via FTP. Additionally, images must be appropriately labeled according to the SKU and or SKU variation of the product they are associated with.

Add Categories and Attributes In Bulk:

While this does not directly correlate to uploading individual products, taking this step will make the entire import process far more efficient. Preparing all of your attributes and categories for a bulk import before uploading any products will allow you to easily associate products with categories and attributes that you will have already created.

Limit File Size For Bulk Imports:

By keeping the number of products you are uploading to Magento at one time to a reasonable number you will prevent any potential server time outs or delays in successfully uploading all of your product data. Dividing a CSV file with 500 + products into two or three files and running separate uploads will require a little more time on the front end but will invariably prevent future headaches.

Updating Existing Products:

You can also utilize Magento’s bulk import feature to update existing products on your storefront. To make bulk updates to existing products via a CSV file you only need to import the SKU field as well as the field(s) which you would like to update. Once successfully uploaded into Magento your changes will take effect on your storefront.

If you’re looking for a faster and easier way to upload and manage both product data and images in Magento, ClaraStream provides a web based application that will integrate directly with your Magento storefront, allowing you to quickly and easily upload new products or make updates to existing products, and avoid tedious data formatting in spreadsheets and hours of manual data entry. Take a quick TOUR now to learn how ClaraStream can help you save time uploading and managing your product data.

Source:https://www.clarastream.com/2013/06/tips-for-easier-product-uploads-in-magento/

Writing eCommerce Product Descriptions That Sell, Sell, Sell...

Even in the days of massive retail sites with thousands of products that are often "bulk uploaded" from a database, product descriptions are still a critical factor in deciding whether a visitor to an ecommerce store buys or not. Working together, the description and photos should give the website visitor all the same information, and the same sense of desire, that they'd get by viewing the product in a physical store. If they're left in any doubt about exactly what the features of the product are, or how it will benefit them, they'll move on without hesitation.

Writing product descriptions, along with having great product photos, is therefore a vital tool which the store owner can use to take control of their sales.

Writing Product Descriptions Writing Product Descriptions Writing product descriptions is an art, but once mastered it can provide SEO benefits as well as compelling visitors to click on the 'buy' button. A best practice includes doing AB or multivariate testing of different product descriptions to increase their effectiveness. For example, the above test from Talbot recovery tested only text changes on this signup page. Their testing group Fathom recorded a 184% improvement with the copy on the version to the right (with more bullet points) at a 99% confidence level. (Test results supplied by Which Test Won.)

The Challenge Of Writing Effective Product Descriptions

Product descriptions are tough to write well, because in a short space of typically 60-80 words they need to:

    Persuasively describe the benefits of the product and what problem it solves

    Describe any important features which aren't clear from the product photos

    Use SEO keyphrases to make the page rank more highly in search engines

    Differentiate the product from similar ones in a way that encourages purchase

    Perhaps explain why the product should be purchased from that website versus others

Faced with such a challenge, website owners might be tempted to use the standard description provided by the manufacturer, or copy text from a competitor's website. But this could lead to Google penalizing the page as it would contain duplicate content, and it misses a big opportunity to give the ecommerce site a unique voice which builds the brand and keeps visitors coming back.

There are plenty of professional copywriters who specialize in writing product descriptions for ecommerce, who the job can be outsourced to. Yet many online store owners will take the view that no-one knows the product or market as well as they do, in which case there are a few things to consider when writing product descriptions that sell.

Establishing The 'Voice' Of The Product

To set the tone when writing product descriptions, knowing the audience is half the battle - Moms in their 40s will respond to a different style than teenage boys do. But the voice of the tone is important too. For example, Moms in their 40s might be the target market for a fashionable handbag or a game for their child - but those products wouldn't be written about in the same way.

The identity of the brand should also be considered. For example, the J. Peterman Company gives products in their men's and women's ranges a different voice, but the brand's tone is so strong it'd be instantly recognizable even out of context.

Structuring the description

It can be a good idea to separate out information which may not be emotionally captivating, but still important to know, such as product dimensions, so it can be easily browsed without getting in the way of the main product description. This approach follows the typical buying cycle or funnel through which each buyer moves as they build their interest in a product, which typically results in a desire for more detailed information as the buyer approaches the purchasing stage. The British electrical retailer Comet does this well, by having a separate 'technical specifications' panel. This allows them to concentrate on writing product descriptions that emphasize the benefits, knowing that the nitty-gritty is all in place.

The structure of the main description should be kept in mind too - opening with an attention-grabbing question or statement, moving on to describing how it can fit into the customer's life, and ending with a strong call-to-action. A call-to-action is the customer's reason to take action by clicking the 'buy' button right now: this could include 'free shipping this week only' or 'enter this code for 20% off your purchase'.

Keeping this structure in mind also helps to keep the inspiration flowing when writing product descriptions for tens or hundreds of items.

Writing Product Descriptions That Turn Features Into Benefits

It's often said that people don't buy a drill, they buy a hole in a wall. This means that people buy products to solve a problem, so writing product descriptions is all about showing how the features of the product will benefit the buyer.

That means that it's of no real interest in itself that a shaving foam contains extracts of Aloe Vera (feature), but it becomes relevant when mentioning that it means it won't irritate your skin like other products might (benefit).

The same feature might offer a different benefit depending on the target audience. For example, a 100% cotton t-shirt might have the benefits of being:

    1. Easy to wash (for mothers)

    2. Lightweight to battle the summer heat (for women in their 20s planning a vacation)

    3. Environmentally compatible because it's made of man-made fibers (for an audience which is concerned about environmental impact)

Econsultancy has some great examples of product descriptions which effectively sell the benefits and give the reader a vision of how the product will fit into their lifestyle.

Writing Product Descriptions With SEO In Mind

A page of original content about a product is a boon for getting a page indexed in search engines. While writing product descriptions is primarily an exercise in appealing to the potential customer, a few simple considerations will make sure the SEO potential is maximized too:

    Include a headline which uses the targeted SEO keyphrase, but also grabs the reader - just using the name of the product is a missed opportunity

    Use keywords selectively in the description. So if the keyphrase is 'men's cutthroat razor', it's a missed opportunity to call it a 'shaving device' in the description

    Make use of image captions. Rather than just the product name, this is another chance to include a keyword-rich sentence which also appeals to the customer

    Include the keyword in the title and description meta-tags in the source code of the web page

    Include the keyword in alt tags of any images, in title tags associated with links out from the description (if links to other sections are used) and also in the anchor text of any links pointing to the page

    Assign high level headline tags like H1, h3 or H3 to headlines and subheads containing the keywords

    Use the keywords in the file (URL) names associated with the page (as part of all of the page depending on the naming structure associated with the site's shopping cart)

    Consider using keywords in tags associated with the page

While this article doesn't focus on keyword research, it is a wise idea to use search terms which fit multiple parameters including:

    1. Describing the product in the same way the target audience uses when looking for the product (often gained from the site's analytics programs and using a keyword research tool

    2. Looking for search volume in a keyword research tool to ensure there is sufficient search volume for these terms

    3. Assessing the level of competition for the term (either through pay per click estimators, analyzing the top search results and looking at factors like competitor page rank, number of items in the index, or competitor traffic using audit tools such as Compete.com)

How To Constantly Improve Product Description Writing

While using the above approach as a starting point, there will come a time when the more diligent eCommerce marketer will subject their product descriptions to some type of testing. Typically this means using some type of web page optimization program (or using a pay per click campaign with alternate landing pages) that can provide testing of the page against an alternate. While there are many tools for this (Google Website Optimizer is an example of a fully featured tool that is available free), the important point is to subject descriptions to the same rigor of testing that other elements of the page are such as "buy buttons" or offers. And while this type of process may seem to yield small improvements, if done across a large number of pages, with high traffic patterns or over a long period of time, the cumulative results can be quite profitable.

Source:http://www.ultracart.com/resources/articles/writing-ecommerce-product-descriptions/

Thursday, 26 December 2013

Data cleaning service to clean ways to retrieve

Data cleansing or data scrubbing and one or to identify an act of fraud or false proof. Is dataset table to the right? Many companies, business sales and sales by the maid service to provide data to the database. Data cleaning company helps to set the date and error free.

After cleaning removes all consistencies Dataset is consistent with other similar systems. Data validation is the process of separating and removing some typos. Data transformation, statistical methods, parsing, syntax errors and eliminate duplicate data as known technique is used for cleaning. Nice and clean data needed to meet the criteria listed below:

Accuracy: density, integrity and stability.

Completion of missing data must be corrected.

Density: Data released and the price in proportion to the number of values must be well known.

Consistency: challenges and sense deals with the differences.

Uniformity: focused on irregularities or indiscretions.

Integrity: a combined value of wholeness and soundness criteria.

Unique: It is related to the number of duplicate data.

Data cleaning services are offered by companies:

Remove Duplicate ideas.

Tagging and Identification of a record or facts.

Remove duplicate or fake and false evidence.

Data verification.

Delete old records.

Opt-in and opt facts as third parties in order to remove the list

Data cleansing, aggregation and organization.

Identify incomplete or inaccurate facts or figures.

Product specifications, order and establish the facts, including metaphors, improved.

Duplicate data or data which records seem to be as many as received.

Common problems of data cleaning applications:

Sometimes there is a loss of information in the data. No doubt, invalid and duplicate entries are removed, but often the information is limited and is insufficient for a number of entries. It also leads to a loss of information is removed. Data cleansing is very expensive and time consuming. Thus it is important to maintain effectively.

Fortunately, the benefits worth more and more challenges.

And most companies these days, depending on the existence and quality of the data that have business continuity. Data mainly customer information, customer profiles, different products, addresses and important people and market research, etc. This information is mainly collected from various databases and phone numbers are on the technical details. Since these databases use different formats or styles, the data collected are very clumsy and sometimes incomprehensible, but which we cannot control the way data is stored in the database.

So, the best solution for us to organize data is to implement a data called cleaning process. there are various software available in the market that can help clean data are applied.

It is a very important process for their business activities depending on the quality of the data.

Which in turn will lead to losses?

Ways to clean data to retrieve:

1) When importing data, make sure that there is a common format for applying anywhere it is stored, this will ensure consistency.

2) Dictionary software or use MS Word to check for spelling mistakes or grammatical errors frequently. This must be done manually, it can be very time consuming for the entire above amount of information.

3) When copying to an external source of data is always copied into the notepad so that all types of formatting are done.

Source:http://www.tampabaycleaning.com/172-data-cleaning-service-to-clean-ways-to-retrieve-2

The 5 minute guide to scraping data from PDFs

Every data journalist knows the feeling: you’re working on a massive project, you’ve finally found the data… but it is in PDF format.

Last month I had a crime reporter from Cape Town in one of my data journalism training sessions, who had managed to get around 60 PDF pages worth of stats out the relevant authorities. She explored and analyzed them by hand, which took days. That set me thinking. The problem can’t be all that uncommon and there must be a good few data journalists out there who could use a quick guide to scraping spreadsheets from PDFs.

The ideal of course is not getting your data in PDF form in the first place. It all comes from the same database, and it shouldn’t be any effort for the people concerned to save the same data in an Excel spreadsheet. The unfortunate truth however is that a lot of officials aren’t willing to do that out of fear that you’ll tinker with their data.

There are some web services like cometdocs or pdftoexcelonline that could help you out. Or you could try to build a scraper yourself, but then you have to read Paul Bradshaw‘s Scraping for Journalists first.

Tabula

My favourite tool though is Tabula. Tabula describes itself as “a tool for liberating data tables trapped inside PDF files”. It’s fairly easy to use too. All you have to do is import your PDF, select your data, push a button and there is your spreadsheet! You save the scraped page in CSV and from there you can import it into any spreadsheet program.

One small problem is that Tabula only scrapes one PDF page at a time. So 10 PDF pages worth of data gives you 10 spreadsheets.

Installing Tabula is a piece of cake: download, unzip and run. Tabula is written in Java (so you should have Java installed) and uses Ruby for scraping, which is one of the languages used on Scraperwiki to build tailor-made PDF scrapers.

Source:http://memeburn.com/2013/11/the-5-minute-guide-to-scraping-data-from-pdfs/

Tuesday, 17 December 2013

Website data scraping is not an easy

Website data scraping is not an easy task and it takes tremendous time when it comes to analysis and restructuring of the data. It is for these reasons that, you should visit us as we make this process look simple. We have a team of skilled and experienced data scrapers who will make the result from the project that you present future proof and flexible enough to fit into as many situations as you may think of or you may be finding solutions for.

Indeed, our website data scraping experts are knowledgeable and they will use their experienced hands to deliver the best data to you and within a short duration. In Web Data Scraping process input source will be web resource and most common output formats are xls, csv, XML, notepad, word file etc. Website Data Scraping having excellence to scrape database from HTML, XML, text, word file, images, reports, PDF files etc.

As world is growing fast every businesses having higher value of time so values of manual work is going rapidly down day by day. Imagine how many days it will take to scrape millions of records manually, may be over years. As world is rising extremely fast so we have to upgrade ourselves with time and its necessities.

Website Data Scraping introducing ourselves as worldâ€™s most preferred and reliable data scraping service provider. Website Data Scraping equipped with latest tools, techniques, technology and experienced manpower. We upgrade our tools, technology as per clientâ€™s necessity after certain interval to convey tremendous quality to our worldwide clients.

We are capable to deal with composite type of web scraping requirement and deliver world class quality before expected time. Our outstanding quality, time duration and previous clientâ€™s feedback force us to self-importance on ourselves as one of consistent and high quality web scraping service provider. High quality, time duration to complete work and price quote is matters a lot for any client and we try to fulfill all these needs.

We always prefer our entire client as priority customerâ€™s weather we are getting business of only $10 from them. Website Data Scraping never compromise in quality and delivery time and due to these reasons you can try us for your Web Data Scraping requirement.

Web Data Scraping

Can you imagine to get thousands, lacks or millions of web based database in usable format only in 2-10 days? Yes, now its possible with Website Data Scraping. Get over thousands of web based database scraped only in few days and reuse those database for various purposes.

Business Directory Scraping

Online business directories are the best sources to explore the contact details of required service provider. We can help out to build your own niche business directory or in email marketing campaign by collection validated email ids. Donâ€™t hesitate and contact us with business directory link in order to start working.

Web Research and Data Collection

Website Data Scraping having experienced team for internet searching, web research and data collection to satisfy our clientâ€™s requirement and make some profit for organization. Our primary key is to satisfy our customersâ€™ needs at lowest price quote.

- Business directory scraping â€“ yellow pages, yell, yelp, scoot, manta, lawyers, b2bindex etc.
- Report mining, document data scraping, PDF and scanned images scraping.
- Metadata scraping, web crawling, text corpus, weather data mining, stock data scraping.
- Job wrapping, resume scraping, students email id scraping, school and university data scraping.
- Web research, web data mash up, internet searching and data collection.
- Product scraping, image scraping, online price comparison and comparison of feed aggregates.
- Data scraping from LinkedIn, twitter, face book and other social networking sites.
- Product scraping from eBay, Amazon, eCommerce and online shopping websites.

Source:http://www.bharatbhasha.net/finance-and-business.php/404654

Monday, 16 December 2013

Web Scraping a JavaScript Heavy Website: Keeping Things Simple

One of the most common difficulties with web scraping is pulling information from sites that do a lot of rendering on the client side. When faced with scraping a site like this, many programmers reach for very heavy-handed solutions like headless browsers or frameworks like Selenium. Fortunately, there's usually a much simpler way to get the information you need.

But before we dive into that, let's first take a step back and talk about how browsers work so we know where we're headed. When you navigate to a site that does a lot of rendering in the browser -- like Twitter or Forecast.io -- what really happens?

First, your browser makes a single request for an HTML document. That document contains enough information to bootstrap the loading of the rest of the page. It loads some basic markup, potentially some inline CSS and Javascript, and probably a few <script> and <link> elements that point to other resources that the browser must then download in order to finish rendering the page.

Before the days of heavy JavaScript usage, the original HTML document contained all the content on the page. Any external calls to load CSS of Javascript were merely to enhance the presentation or behavior of the page, not change the actual content.

But on sites that rely on the client to do most of the page rendering, the original HTML document is essentially a blank slate, waiting to be filled in asynchronously. In the words of Jamie Edberg -- first paid employee at Reddit and currently a Reliability Architect at Netflix -- when the page first loads, you often "get a rectangle with a lot of divs, and API calls are made to fill out all the divs."

To see exactly what this "rectangle with a lot of divs" looks like, try navigating to sites like Twitter or Forecast.io with Javascript turned off in your browser. This will prevent any client-side rendering from happening and allow you to see what the original page looks like before content is added asynchronously.

Once you've seen the content that comes with the original HTML document, you'll start to realize how much of the content is actually being pulled in asynchronously. But rather than wait for the page to load... and then for some Javascript to load... and then for some data to come back from the asynchronous Javascript requests, why not just skip to the final step?

If you examine the network traffic in your browser as the page is loading, you should be able to see what endpoints the page is hitting to load the data. Flip over to the XHR filter inside the "Network" tab in the Chrome web inspector. These are essentially undocumented API endpoints that the web page is using to pull data. You can use them too!

The endpoints are probably returning JSON-encoded information so that the client-side rendering code can parse it an add it to the DOM. This means it's usually straightforward to call those endpoints directly from your application and parse the response. Now you have the data you need without having to execute Javascript or wait for the page to render or any of that nonsense. Just go right to the source of the data!

Let's take a look at how we might do this on Twitter's homepage. When a logged-in user navigates to twitter.com, Tweets are added to a user's timeline with calls to this endpoint. Pull that up in your browser and you'll see a JSON object that contains a big blob of HTML that's injected into the page. Make a call to this endpoint and then parse your info from the response, rather than waiting for the entire page to load.

It's a similar situation when we look at Forecast.io. The HTML document that's returned from the server provides the skeleton for the page, but all of the forecast information is loaded asynchronously. If you pull up your web inspector, refresh the page and then look for the XHR requests in the "Network" tab, you'll see a call to this endpoint that pulls in all the forecast data for your location.

scraping-forecast-io

Now you don't need to load the entire page and wait for the DOM to be ready in order to scrape the information you're looking for. You can go directly to the source to make your application much faster and save yourself a bunch of hassle.

Wanna learn more? I've written a book on web scraping that tons of people have already downloaded. Check it out!

Source: http://tubes.io/blog/2013/08/28/web-scraping-javascript-heavy-website-keeping-things-simple/

Web Screen Scrape With a Software Program

Which software do you use for data mining? How much time does it take in mining required data and is it able to present in a customized format? Extracting data from the web is

A tedious job, if done manually but the moment you use an application or program, web screen scrape job becomes easy.

Using an application would certainly make data mining an easy affair but the problem is that which application to choose. Availability of a number of software programs makes

it difficult to choose one but you has to select a program because you can âEUR(TM)t keep mining data manually. Start your search for a data mining software program with

Determining your needs. First note down the time a program takes to completing a project.

Quick scraping

The software should nâEUR(TM)t take much time and if it does then there âEUR(TM)s no use of investing in the software. A software program that needs time for data mining would

Only save your labor and not time. Keep this factor in mind as you can âEUR(TM)t keeps waiting for hours for the software to provide you data. Another reason behind choosing a

Quick software program is that you a quick scraping tool would provide you latest data.

Presentation

Extracted data should be presented in readable format that you could use in a hassle free manner. For instance the web screen scrape program should be able to provide data in

Spreadsheet or database file or in any other format as desired by the user. Data that âEUR(TM)s difficult to read is good for nothing. Presentation matters most. If you

ArenâEUR(TM)t able to understand the data then how could you use in future.

Coded program

Invest in web screen scrape program coded for your project and not for everyone. It should be dedicated to you and not made for public. There are groups that provide coded

programs for data mining. They charge a fee for programming but the job they do worth a fee. Look for a reliable group and get the software program that could make your data

Mining job a lot easier.

Whether you are looking for contact details of your targeted audiences or you want to keep a close watch on social media, you need web screen scrape service that would save

Your time and labor. If you âEUR(TM)re using a software program for data mining then you should make sure that the program works according to your wishes.

Source: http://goarticles.com/article/Web-Screen-Scrape-With-a-Software-Program/7763109/

Custom Book Scanning: Worth it for those books you just can’t find in e-format

I know lots of e-book readers who virtually stop reading paper books once they discover e-books. I’m one of them. We recently sold/gave away more than half of our paper book collection. Basically we kept the hardcover books that look good on the shelf.

But that leaves the question of how to replace beloved favorites. Some can be repurchased as e-books, and I’ve certainly done my share of that. However, as you know, publishers haven’t released their entire backlist, which can leave you stuck. Of course, you could scan them yourself, and I know people who do that. But that’s way more work than I want to tackle.

Custom Book ScanningWhen Mark Burger of Custom Book Scanning contacted me and offered me a free trial of his service, I was interested.

I’ve had this one book on my shelf for years that I’ve been wanting to have digitized. I almost did it myself, but it was just too much of a hassle. Every so often, I check the publisher to see if it’s been released as an e-version. So far, no luck. So, I sent him the book.

Options

There are many options to select when sending a book for scanning, and I was fortunate to be able to try them all. So, I chose destructive scanning (which would provide the best scanning/OCR option), EPUB with clickable table of contents, audio book version and delivery of the jpegs of the scans (in case I needed to make changes).

I’m happy to say that it went well. The book looks good. Yes, there are some remaining OCR errors and a few odd line breaks, but the book is perfectly readable, and of better quality than other releases by this publisher. The audiobook uses Ivona (which means you should choose the female voice and get the amazing “Amy” voice). I did run the EPUB through Calibre to change the paragraph style from block to indented, but that’s just a personal preference.

Custom Book Scanning

Pricing

Pricing is based on the length of the book, and starts at $9.95 for a 100-page or less book. You add $3 for each subsequent 100 pages. EPUB or Kindle formatting adds at least $10, or more if you want a clickable table of contents. My book would have cost me $35.95 (not including audio or .jpeg files).

Yes, that’s expensive. If you can buy the book from Amazon, it’s a much better deal. However, if, like me, you have a few beloved books you can’t find in e-version and really want them, I think it’s worth it. Especially since Burger has created the code “THIRTYOFF” for TeleRead readers! If you’re not good at doing math in your head, that would have brought the price for my book down to $25.16.

The Competition

How does it compare to 1DollarScan? The biggest difference is that Custom Book Scanning gives you an EPUB or Mobi file, not just an image .pdf like 1DollarScan. PDFs are a pain, and I wouldn’t want to scan a fiction book with 1DollarScan. Especially when you run a price comparison.

I did a check of what my book would have cost from 1DollarScan. I did select some of the extra options, like OCR (which only makes the book searchable–it doesn’t give you a text file you can modify) and high-quality scan. My book would have cost $21 from 1DollarScan. It’s worth the extra few dollars to have a reflowable document where you can change font size.

Legality

Now, here’s the big question: Is it legal? That’s a good one. While it’s definitely legal to scan and OCR books yourself, it’s questionable whether a service like this is strictly legal. I asked the owner that question, and here was his response:

“The legality of book scanning and fair use has been a topic in the media recently. Custom Book Scanning respects the works of authors and publishers and takes every measure to prevent piracy. At the same time, we also support the rights of book owners to be able to read their paper books digitally or through audiobook for personal use. Aside from the convenience of having your books on an e-reader, we receive many responses from people who aren’t able to enjoy a traditional book because of being visually impaired and find the text to speech feature on an e-reader or an audiobook to allow them to enjoy those titles as well.”

I think that’s a fair answer. I didn’t worry too much about sending in my book. I don’t see it as that much different from doing the work myself. Except for the part about not having to spend the hours doing it.

Do you want to scan every book in your library? Probably not. Is it worth it for those special books you can’t find in an e-version? Definitely.

Source:http://www.teleread.com/ebooks/custom-book-scanning-service-worth-it-for-those-books-you-just-cant-find-in-e-version/

Two Book Scanning Services Lose Copyright Lawsuit in Japan

I have some bad news today for ebook lovers in Japan. The Tokyo District Court has handed down a ruling that says that a couple paid book scanning services (like 1DollarScan here in the US) violated Japanese copyright law.

Presiding Judge Shigeru Osuga has ordered 2 companies to pay a total of $14,000 in fines and stop scanning books. The companies, Sundream Co. and Doraibareggi Japan, have vowed to appeal the ruling and maintain that their actions were legal under Japanese law.

Today’s decision is yet one more step in a year’s long court battle. This lawsuit was filed by seven writers, including Jiro Asada, Keigo Higashino, and Kenshi Hirokane, but it is just one of a number of lawsuits that have been filed since 2011.

Seven major publishing companies and 122 writers publicly demanded in Spetember 2011 that a whole host of book scanning companies, more than 100 in total, stop scanning books. This service is known in Japan as jisui, which loosely translates as “cooking for oneself”, and it was rapidly growing to be a popular option for readers who wanted ebooks which weren’t available in the Japanese ebook market.

The companies typically charge a few hundred yen for each book and use commercial equipment similar to the one pictured below.

Interlock MCLS Digitization Tour

There are 4 other similar lawsuits going on at the moment in the Tokyo courts, and this ruling does not bode well for the jisui companies involved.

I have to say that I am puzzled by this decision. I had thought that the arguments made by the jisui companies made a lot of sense. They said they were working under the direction of their customers, and since the owner of a book has a legal right in Japan to scan that book the jisui service was covered under that right.

Unfortunately, the judge disagreed, saying in part that: “It is difficult for general readers to set up the equipment necessary to digitize their books. We do not view the situation as one in which the operators were carrying out reproduction under the management of their customers.”

I’m not sure that I see what the technical challenges of a scanning a book has to do with a book owner exercising their right to scan a book. It’s like saying that you can’t work on someone’s car because the average person lacks the experience and technical skills of a mechanic.

I would also dispute the idea that book scanning is beyond the abilities of the average reader. It is probably more work than they are interested in performing but that doesn’t mean they cannot do it.

Source:http://www.the-digital-reader.com/2013/10/04/book-scanning-services-now-illegal-japan/#.Uq7gcSdvbFw