You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+22-18
Original file line number
Diff line number
Diff line change
@@ -15,28 +15,28 @@ Add mainframe as a source to your data engineering strategy.
15
15
16
16
Among the motivations for this project, it is possible to highlight:
17
17
18
-
- Lack of expertise in the Cobol ecosystem, which makes it hard to integrate mainframes into data engineering strategies
18
+
- Lack of expertise in the Cobol ecosystem, which makes it hard to integrate mainframes into data engineering strategies.
19
19
20
-
- Lack of support from the open-source community to initiatives in this field
20
+
- Lack of support from the open-source community to initiatives in this field.
21
21
22
-
- The overwhelming majority (if not all) of tools to cope with this domain are proprietary
22
+
- The overwhelming majority (if not all) of tools to cope with this domain are proprietary.
23
23
24
-
- Several institutions struggle daily to maintain their legacy mainframes, which prevents them from evolving to more modern approaches to data management
24
+
- Several institutions struggle daily to maintain their legacy mainframes, which prevents them from evolving to more modern approaches to data management.
25
25
26
-
- Mainframe data can only take part in data science activities through very expensive investments
26
+
- Mainframe data can only take part in data science activities through very expensive investments.
27
27
28
28
29
29
## Features
30
30
31
-
- Supports primitive types (although some are "Cobol compiler specific")
31
+
- Supports primitive types (although some are "Cobol compiler specific").
32
32
33
-
- Supports REDEFINES, OCCURS and DEPENDING ON fields (e.g. unchecked unions and variable-size arrays)
33
+
- Supports REDEFINES, OCCURS and DEPENDING ON fields (e.g. unchecked unions and variable-size arrays).
34
34
35
-
- Supports nested structures and arrays
35
+
- Supports nested structures and arrays.
36
36
37
-
- Supports HDFSas well as local file systems
37
+
- Supports Hadoop (HDFS, S3, ...) as well as local file system.
38
38
39
-
- The COBOL copybooks parser doesn't have a Spark dependency and can be reused for integrating into other data processing engines
39
+
- The COBOL copybooks parser doesn't have a Spark dependency and can be reused for integrating into other data processing engines.
40
40
41
41
## Videos
42
42
@@ -135,18 +135,20 @@ Code coverage will be generated on path:
### Reading Cobol binary files from HDFS/local and querying them
138
+
### Reading Cobol binary files from Hadoop/local and querying them
139
139
140
140
1. Create a Spark ```SQLContext```
141
141
142
142
2. Start a ```sqlContext.read``` operation specifying ```za.co.absa.cobrix.spark.cobol.source``` as the format
143
143
144
-
3. Inform the path to the copybook describing the files through ```... .option("copybook", "path_to_copybook_file")```. By default the copybook
145
-
is expected to be in HDFS. You can specify that a copybook is located in the local file system by adding `file://` prefix. For example, you
146
-
can specify a local file like this `.option("copybook", "file:///home/user/data/compybook.cpy")`. Alternatively, instead of providing a path
147
-
to a copybook file you can provide the contents of the copybook itself by using `.option("copybook_contents", "...copybook contents...")`.
144
+
3. Inform the path to the copybook describing the files through ```... .option("copybook", "path_to_copybook_file")```.
145
+
- By default the copybook is expected to be in the default Hadoop filesystem (HDFS, S3, etc).
146
+
- You can specify that a copybook is located in the local file system by adding `file://` prefix.
147
+
- For example, you can specify a local file like this `.option("copybook", "file:///home/user/data/copybook.cpy")`.
148
+
- Alternatively, instead of providing a path to a copybook file you can provide the contents of the copybook itself by using `.option("copybook_contents", "...copybook contents...")`.
149
+
- You can store the copybook in the JAR itself at resources section in this case use `jar://` prefix, e.g.: `.option("copybook", "jar:///copybooks/copybook.cpy")`.
148
150
149
-
4. Inform the path to the HDFS directory containing the files: ```... .load("path_to_directory_containing_the_binary_files")```
151
+
4. Inform the path to the Hadoop directory containing the files: ```... .load("s3a://path_to_directory_containing_the_binary_files")```
150
152
151
153
5. Inform the query you would like to run on the Cobol Dataframe
152
154
@@ -208,7 +210,7 @@ val spark = SparkSession
208
210
.master("local[2]")
209
211
.config("duration", 2)
210
212
.config("copybook", "path_to_the_copybook")
211
-
.config("path", "path_to_source_directory") // could be both, local or HDFS
213
+
.config("path", "path_to_source_directory") // could be both, local or Hadoop (s3://, hdfs://, etc)
0 commit comments