Link discovery from web scripts
Title: | Link discovery from web scripts |
---|---|
Patent Number: | 8,572,065 |
Publication Date: | October 29, 2013 |
Appl. No: | 11/937751 |
Application Filed: | November 09, 2007 |
Abstract: | A computer-implemented method, a computer system, and computer media for discovering links in scripts are provided. The computer system includes a crawler, a rules engine, and an index that are utilized to store links generated by scripts located in webpages in the index. The crawler traverses a network to locate webpages having scripts. The rules engine parses the located webpages and extracts the scripts based on rules that are satisfied by segments of the extracted scripts. The rules engine evaluates the segments of the extracted scripts to generate links. After the rules engine validates the links, the rules engine transmits the links to the index for storage. |
Inventors: | McDonald, Kieran Richard (Seattle, WA, US); Aaleti, Srinath Reddy (Redmond, WA, US); Qian, Richard J. (Sammamish, WA, US) |
Assignees: | Microsoft Corporation (Redmond, WA, US) |
Claim: | 1. A computer-implemented method for discovering links in a script, the method comprising: receiving webpages associated with one or more scripts, wherein the one or more scripts are Javascripts; processing the webpages to locate the one or more Javascripts, wherein processing the webpages to locate the Javascripts further comprises: extracting markup language tags from the webpages to locate function calls, variables, and constants; and identifying script elements and non-script elements based on the markup language tags; accessing rules corresponding to the one or more Javascripts; parsing the one or more Javascripts based on the rules corresponding to the one or more Javascripts, wherein the rules include base rules that are applied to all web pages, site rules that are applied to all webpages from a specific site, and auto-discovered rules that are applied to specific webpages; identifying segments of the one or more Javascripts that satisfy the rules; evaluating the identified segments of the one or more Javascripts and applying the rules to the extracted function calls, variables, and constants to generate links; storing the generated links in an index; and retrieving content associated with the generated links; optimizing the retrieved content; and storing the optimized content and metadata with the generated links, wherein the metadata comprises types of content associated with the generated links, types of files associated with the generated links, dialog attributes associated with the generated links, pop-up attributes associated with the generated links, and display sizes of the content associated with the generated links. |
Claim: | 2. The computer-implemented method of claim 1 , wherein the links include uniform resource locators that reference webpages, videos, audio, or multimedia content. |
Claim: | 3. The computer-implemented method of claim 1 , wherein the one or more scripts comprise external scripts that are stored at locations external of the webpages, inline scripts that are contained within the webpages, client-side scripts that are provided by an application executing on a user client device, and event handler scripts that respond to user interaction or web browser events. |
Claim: | 4. One or more computer-readable storage devices having computer-executable instructions embodied thereon that perform a method for generating an index that stores links discovered in scripts, the method comprising: crawling a network to locate webpages; storing in an index metadata corresponding to each located webpage; parsing the located webpages to identify scripts associated with the located webpages; retrieving rules that check the identified scripts for variables, functions, or events, wherein a subset of the variables for the identified scripts are checked to confirm a change of value; when a change of value for the subset of variables is confirmed, generating links based on the variables, functions, or events; evaluating the variables, functions, or events to verify the validity of the generated links; adding the generated links to the index when the generated links are verified; storing the generated links in the index; retrieving content associated with the generated links; optimizing the retrieved content; and storing the metadata with the generated links and the optimized content in the index, wherein the metadata comprises types of content associated with the generated links, types of files associated with the generated links, dialog attributes associated with the generated links, pop-up attributes associated with the generated links, and display sizes of the content associated with the generated links. |
Claim: | 5. The computer-readable storage devices of claim 4 , wherein the link is a uniform resource locator. |
Claim: | 6. The computer-readable storage devices of claim 4 , wherein the scripts are generated by external files, inline code, or event handlers. |
Claim: | 7. The computer-readable storage devices of claim 4 , wherein the links are generated from data received from XML files processed by the identified scripts. |
Claim: | 8. The computer-readable storage devices of claim 7 , wherein the identified scripts are Javascripts. |
Claim: | 9. The computer-readable storage devices of claim 4 , wherein additional rules are generated for derived functions, in the identified scripts, that encapsulate a function associated with the retrieved rules. |
Claim: | 10. The computer-readable storage device of claim 9 , wherein the derived functions generate links. |
Claim: | 11. The computer-readable storage devices of claim 9 , wherein the additional rules have a scope that is limited to a subset of the located webpages or a segment of the identified scripts in the located webpages on the network. |
Claim: | 12. A computing system having processor and hardware memories for discovering links in a script of a webpage, the system comprising: a crawler accessing web pages to identify scripts associated with the webpages, wherein the webpages are HTML pages; a rules engine that parses the identified scripts and evaluates portions of the identified scripts based on rules that detect link-generating segments of the identified scripts, wherein the segments of the identified scripts are evaluated based on variables and expressions specified in the identified scripts and matching function patterns located in the segment and the rules to detect links generated by the identified scripts; an index to store the detected links and metadata for the detected links, wherein the metadata comprises types of content associated with detected links, types of files associated with the detected links, dialog attributes associated with the detected links, pop-up attributes associated with the detected links, and display sizes of the content associated with the detected links; and the processor that retrieves the content associated with the detected links; optimizes the retrieved content; and optimized content in the index with the detected links. |
Claim: | 13. The computing system of claim 12 , wherein the rules engine verifies the detected links. |
Claim: | 14. The computing system of claim 12 , wherein the identified scripts correspond to a multimedia player that utilizes client-side scripts associated with XML files, and the rules engine parses the XML files to generate data and performs string transformations on the generated data to extract links from the XML files. |
Claim: | 15. The computing system of claim 12 , wherein the rules check functions, variables, and event handlers specified in the identified scripts and a subset of the rules are scoped to correspond to a subset of the webpages. |
Claim: | 16. The computer-implemented method of claim 1 , wherein the base rules further comprise function rules, variable rules, and dynamic rules. |
Claim: | 17. The computer-readable storage devices of claim 4 , wherein the index stores metadata with each valid generated link, the metadata for each valid link generated includes identifiers for segments of the script associated with the webpage that generated the link and rules that identified the segment of the script. |
Claim: | 18. The computer-readable storage devices of claim 4 , one or more valid generated links are based on webpage global variables that are defined outside of the scripts. |
Claim: | 19. The computer-implemented method of claim 3 , wherein the client-side script is associated with media players that play content for a webpage, wherein media files rendered by the media players are accessible based on the link generated by the client-side script. |
Current U.S. Class: | 707/708 |
Patent References Cited: | 6374260 April 2002 Hoffert et al. 6847977 January 2005 Abajian 6865593 March 2005 Reshef et al. 7100109 August 2006 Chartier 7143088 November 2006 Green 7260564 August 2007 Lynn 7536389 May 2009 Prabhakar et al. 2002/0078136 June 2002 Brodsky 2003/0046385 March 2003 Vincent 2004/0030741 February 2004 Wolton et al. 2004/0059809 March 2004 Benedikt 2004/0098451 May 2004 Mayo 2004/0143787 July 2004 Grancharov 2004/0158617 August 2004 Shanny et al. 2004/0168132 August 2004 Travieso et al. 2006/0190561 August 2006 Conboy et al. 2006/0230011 October 2006 Tuttle et al. 2008/0235671 September 2008 Kellogg et al. 2009/0125480 May 2009 Zhang et al. |
Other References: | Shreeraj Shah, “Crawling Ajax-driven Web 2.0 Applications,” www.infosecwriters.com/text—resources/pdf/Crawling—AJAX—SShah.pdf, No date, 9 pp. cited by applicant “Free Link Extractor—Extract All HTML Links,” www.selfseo.com/link—extractor.php, Sep. 19, 2007, 1 page. cited by applicant “Web Application Scanners: Technical Challenges,” Application Security Software, Consulting, and Services Solutions from NT Objectives, Inc., www.ntoobjectives.com/know/techchallenges.php, Sep. 19, 2007, 3 pp. cited by applicant |
Assistant Examiner: | Mobin, Hasanul |
Primary Examiner: | Ehichioya, Fred I |
Attorney, Agent or Firm: | Shook, Hardy & Bacon L.L.P. |
Accession Number: | edspgr.08572065 |
Database: | USPTO Patent Grants |
Language: | English |
---|