-
Notifications
You must be signed in to change notification settings - Fork 187
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Initial commit importing URL-Detector
- Loading branch information
0 parents
commit 11180bc
Showing
27 changed files
with
4,046 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Url Detector | ||
|
||
The url detector is a library created by the Linkedin Security Team to detect and extract urls in a long piece of text. | ||
|
||
It is able to find and detect any urls such as: | ||
|
||
* __HTML 5 Scheme__ - //www.linkedin.com | ||
* __Usernames__ - user:pass@linkedin.com | ||
* __Email__ - [email protected] | ||
* __IPv4 Address__ - 192.168.1.1/hello.html | ||
* __IPv4 Octets__ - 0x00.0x00.0x00.0x00 | ||
* __IPv4 Decimal__ - http://123123123123/ | ||
* __IPv6 Address__ - ftp://[::]/hello | ||
* __IPv4-mapped IPv6 Address__ - http://[fe30:4:3:0:192.3.2.1]/ | ||
|
||
_Note: Keep in mind that for security purposes, its better to overdetect urls and check more against blacklists than to not detect a url that was submitted. As such, some things that we detect might not be urls but somewhat look like urls. Also, instead of complying with RFC 3986 (http://www.ietf.org/rfc/rfc3986.txt), we try to detect based on browser behavior, optimizing detection for urls that are visitable through the address bar of Chrome, Firefox, Internet Explorer, and Safari._ | ||
|
||
It is also able to identify the parts of the identified urls. For example, for the url: `http://[email protected]:39000/hello?boo=ff#frag` | ||
|
||
* Scheme - "http" | ||
* Username - "user" | ||
* Password - null | ||
* Host - "linkedin.com" | ||
* Port - 39000 | ||
* Path - "/hello" | ||
* Query - "?boo=ff" | ||
* Fragment - "#frag" | ||
|
||
--- | ||
## How to Use: | ||
|
||
Using the URL detector library is simple. Simple import the UrlDetector object and give it some options. In response, you will get a list of urls which were detected. | ||
|
||
For example, the following code will find the url `linkedin.com` | ||
|
||
```java | ||
|
||
UrlDetector parser = new UrlDetector("hello this is a url Linkedin.com", UrlDetectorOptions.Default); | ||
List<String> found = parser.detect(); | ||
|
||
``` | ||
|
||
### Quote Matching and HTML | ||
Depending on your input string, you may want to handle certain characters in a special way. For example if you are | ||
parsing HTML, you probably want to break out of things like quotes and brackets. For example, if your input looks like | ||
|
||
> <a href="http://linkedin.com/abc">linkedin.com</a> | ||
You probably want to make sure that the quotes and brackets are extracted. For that reason, using UrlDetectorOptions | ||
will allow you to change the sensitivity level of detection based on your expected input type. This way you can detect | ||
`linkedin.com` instead of `linkedin.com</a>`. | ||
|
||
In code this looks like: | ||
|
||
```java | ||
|
||
UrlDetector parser = new UrlDetector("<a href="linkedin.com/abc">linkedin.com</a>", UrlDetectorOptions.HTML); | ||
List<String> found = parser.detect(); | ||
|
||
``` | ||
|
||
|
||
--- | ||
## About: | ||
|
||
This library was written by the security team and Linkedin when other options did not exist. Some of the primary authors are: | ||
|
||
* Vlad Shlosberg (vshlos) | ||
* Tzu-Han Jan (tjan) | ||
* Yulia Astakhova (jastakho) | ||
|
||
--- | ||
## Third Party Dependencies | ||
|
||
####TestNG | ||
* http://testng.org/ | ||
* Copyright © 2004-2014 Cédric Beust | ||
* License: Apache 2.0 | ||
|
||
####Apache CommonsLang3: org.apache.commons:commons-lang3:3.1 | ||
* http://commons.apache.org/proper/commons-lang/ | ||
* Copyright © 2001-2014 The Apache Software Foundation | ||
* License: Apache 2.0 | ||
|
||
--- | ||
## License | ||
|
||
Copyright 2015 LinkedIn Corp. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the license at http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
|
||
|
||
boolean isDefaultEnvironment() { | ||
!project.hasProperty('overrideBuildEnvironment') | ||
} | ||
|
||
File getEnvironmentScript() | ||
{ | ||
final File env = file(defaultEnvironment ? 'defaultEnvironment.gradle' : project.overrideBuildEnvironment) | ||
assert env.isFile() : "The environment script [$env] does not exists or is not a file." | ||
return env | ||
} | ||
|
||
apply from: environmentScript | ||
|
||
|
||
subprojects { | ||
plugins.withType(JavaPlugin) { | ||
|
||
dependencies { | ||
testCompile spec.external.testng | ||
compile spec.external.'commonsLang' | ||
} | ||
|
||
test { | ||
afterSuite { desc, result -> | ||
if (!desc.parent) { | ||
println ":${project.name} -- Executed ${result.testCount} tests: ${result.successfulTestCount} succeeded, ${result.failedTestCount} failed, ${result.skippedTestCount} skipped" | ||
} | ||
} | ||
} | ||
|
||
test.useTestNG() { | ||
excludeGroups 'integration' | ||
} | ||
} | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
logger.lifecycle("Configuring using default environment (defaultEnvironment.gradle)...") | ||
|
||
project.ext.spec = [ | ||
"product": [], | ||
"external": [ | ||
"testng": "org.testng:testng:6.1.1", | ||
"commonsLang": "org.apache.commons:commons-lang3:3.1" | ||
] | ||
] | ||
|
||
subprojects { | ||
repositories { | ||
mavenCentral() | ||
} | ||
|
||
project.buildDir = new File(project.rootProject.buildDir, project.name) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
#Version | ||
version=0.1.3 | ||
|
||
#long-running Gradle process speeds up local builds | ||
#to stop the daemon run 'ligradle --stop' | ||
org.gradle.daemon=true | ||
|
||
#configures only relevant projects to speed up the configuration of large projects | ||
#useful when specific project/task is invoked e.g: ligradle :cloud:cloud-api:build | ||
org.gradle.configureondemand=true | ||
|
||
#Gradle will run tasks from subprojects in parallel | ||
#Higher CPU usage, faster builds | ||
org.gradle.parallel=true | ||
|
||
#Allows generation of idea/eclipse metadata for a specific subproject and its upstream project dependencies | ||
ide.recursive=true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
include "url-detector" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
apply plugin: 'java' | ||
apply plugin: 'idea' | ||
|
87 changes: 87 additions & 0 deletions
87
url-detector/src/main/java/com/linkedin/urls/detection/CharUtils.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
package com.linkedin.urls.detection; | ||
|
||
import java.util.ArrayList; | ||
|
||
/** | ||
* Copyright 2015 LinkedIn Corp. All rights reserved. | ||
* Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software distributed under the License is distributed | ||
* on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* | ||
*/ | ||
public class CharUtils { | ||
/** | ||
* Checks if character is a valid hex character. | ||
*/ | ||
public static boolean isHex(char a) { | ||
return (a >= '0' && a <= '9') || | ||
(a >= 'a' && a <= 'f') || | ||
(a >= 'A' && a <= 'F'); | ||
} | ||
|
||
/** | ||
* Checks if character is a valid alphabetic character. | ||
*/ | ||
public static boolean isAlpha(char a) { | ||
return ((a >= 'a' && a <= 'z') || (a >= 'A' && a <= 'Z')); | ||
} | ||
|
||
/** | ||
* Checks if character is a valid numeric character. | ||
*/ | ||
public static boolean isNumeric(char a) { | ||
return a >= '0' && a <= '9'; | ||
} | ||
|
||
/** | ||
* Checks if character is a valid alphanumeric character. | ||
*/ | ||
public static boolean isAlphaNumeric(char a) { | ||
return isAlpha(a) || isNumeric(a); | ||
} | ||
|
||
/** | ||
* Checks if character is a valid unreserved character. This is defined by the RFC 3986 ABNF | ||
*/ | ||
public static boolean isUnreserved(char a) { | ||
return isAlphaNumeric(a) || a == '-' || a == '.' || a == '_' || a == '~'; | ||
} | ||
|
||
/** | ||
* Checks if character is a dot. Heres the doc: | ||
* http://docs.oracle.com/javase/6/docs/api/java/net/IDN.html#toASCII%28java.lang.String,%20int%29 | ||
*/ | ||
public static boolean isDot(char a) { | ||
return (a == '.' || a == '\u3002' || a == '\uFF0E' || a == '\uFF61'); | ||
} | ||
|
||
/** | ||
* Splits a string without the use of a regex, which could split either by isDot() or %2e | ||
* @param input the input string that will be split by dot | ||
* @return an array of strings that is a partition of the original string split by dot | ||
*/ | ||
public static String[] splitByDot(String input) { | ||
ArrayList<String> splitList = new ArrayList<String>(); | ||
StringBuilder section = new StringBuilder(); | ||
InputTextReader reader = new InputTextReader(input); | ||
while (!reader.eof()) { | ||
char curr = reader.read(); | ||
if (isDot(curr)) { | ||
splitList.add(section.toString()); | ||
section.setLength(0); | ||
} else if (curr == '%' && reader.canReadChars(2) && reader.peek(2).equalsIgnoreCase("2e")) { | ||
reader.read(); | ||
reader.read(); //advance past the 2e | ||
splitList.add(section.toString()); | ||
section.setLength(0); | ||
} else { | ||
section.append(curr); | ||
} | ||
} | ||
splitList.add(section.toString()); | ||
return splitList.toArray(new String[splitList.size()]); | ||
|
||
} | ||
} |
Oops, something went wrong.