+
Skip to content

Conversation

complexlogic
Copy link
Contributor

@complexlogic complexlogic commented Sep 19, 2023

Since the other PR for Matroska (#967) hasn't had any progress in several years, I decided to give it a try myself. I think that the Matroska container format has a lot of potential for audio and would like to see it supported in TagLib.

Matroska is different from other metadata formats in that it uses nested/hierarchical elements to store data. I designed an API which internally parses the nested EBML structure and generates a flat list of tag objects to present to the user. This allows Matroska files to be parsed in a similar manner to the other formats in TagLib.

The user can add/remove tags to the list, and manipulate the tags' attributes (for example, the target level). When the file is written, the nested EBML structure will be re-rendered from the flat list of tags. None of the EBML-related related code is part of the public API.

I created the programs examples/matroskareader.cpp and examples/matroskawriter.cpp to demonstrate how the API works.

Current Development Status

  • Reading Files
  • Writing Files
  • Chapter support
  • Attachments support (needed for cover art)
  • Audio Properties
  • Complex Properties
  • PropertyMap
  • Documentation
  • Unit tests

@ufleisch
Copy link
Contributor

Thanks for your efforts in moving Matroska support forward. This would really be a useful addition to TagLib. I would be really interested in this feature and want to encourage you to go on. But I think that before merging the code, it should also support writing metadata. Reading and writing should work for the elements which are commonly used in Matroska/Webm files. The problem here is that I do not have any mka files, and the mkv files I have are mostly grabbed from DVD using Handbrake. So probably it is a bit hard for me to find out which elements are commonly used.

Before looking at the code, I wanted to get an overview of what works. I created a tiny mkv-file for test purposes using

ffmpeg -t 0.1 -s qcif -f rawvideo -pix_fmt rgb24 -r 25 -i /dev/zero zero_second.mkv

When I use matroskareader on it, I get

Tag Name: ENCODER
Tag Value: Lavf60.3.100
Target Type Value: None

Tag Name: ENCODER
Tag Value: Lavc60.3.100 libx264
Target Type Value: None

Tag Name: DURATION
Tag Value: 00:00:00.120000000
Target Type Value: None

This is also what I see below Tags when using mkvinfo. Now I wanted to add some tags.

mkvpropedit file.mkv -e info -s title="Title"
mkvpropedit file.mkv -e info -s date=2023-09-23T04:30:00Z
mkvpropedit file.mkv -e info -s segment-filename="Segment Filename"
mkvpropedit file.mkv -e info -s prev-filename="Prev Filename"
mkvpropedit file.mkv -e info -s next-filename="Next Filename"
mkvpropedit file.mkv -e info -s muxing-application="Muxing Application"
mkvpropedit file.mkv -e info -s writing-application="Writing Application"
mkvpropedit file.mkv -e track:1 -s language=en
mkvpropedit file.mkv -e track:1 -s name="Name 1"

I can see this information using mkvinfo -a and mediainfo, but not using matroskareader - it still shows the same three tags.

Now I tried to add some tags using VLC. Unfortunately, this does not seem to work anymore (see bugreport). VLC is using TagLib, maybe this bug will not be fixed until TagLib supports Matroska.

Then I tried Handbrake by converting my tiny mkv file to another mkv file and setting the tags in the Handbrake UI. mediainfo shows the following information:

Movie name                               : handbrake
Description                              : Description
Released date                            : 2023
Encoded date                             : 2023-09-22 15:58:25 UTC
Writing application                      : HandBrake 1.6.1 2023032600
Writing library                          : Lavf59.27.100
ErrorDetectionType                       : Per level 1
ARTIST                                   : Actors
DIRECTOR                                 : Director
GENRE                                    : Genre
SUMMARY                                  : Comment
SYNOPSIS                                 : Plot

mkvinfo also displays the entered information:

|+ Segment information
| + Timestamp scale: 1000000
| + Title: handbrake
| + Multiplexing application: Lavf59.27.100
| + Writing application: HandBrake 1.6.1 2023032600
| + Segment UID: 0x43 0x02 0xc3 0x50 0x41 0xce 0xf3 0x78 0xa1 0xe3 0x1d 0x6e 0xac 0xb7 0x18 0x8b
| + Date: 2023-09-22 15:58:25 UTC
| + Duration: 00:00:00.120000000
(..)
|+ Tags
| + Tag
|  + Targets
|  + Simple
|   + Name: DIRECTOR
|   + String: Director
|  + Simple
|   + Name: ARTIST
|   + String: Actors
|  + Simple
|   + Name: SUMMARY
|   + String: Comment
|  + Simple
|   + Name: DESCRIPTION
|   + String: Description
|  + Simple
|   + Name: GENRE
|   + String: Genre
|  + Simple
|   + Name: SYNOPSIS
|   + String: Plot
|  + Simple
|   + Name: DATE_RELEASED
|   + String: 2023
|  + Simple
|   + Name: ENCODER
|   + String: Lavf59.27.100
| + Tag
|  + Targets
|   + Track UID: 9584013959154292683
|  + Simple
|   + Name: DURATION
|   + String: 00:00:00.120000000

Also matroskareader shows this information correctly.

Tag Name: DIRECTOR
Tag Value: Director
Target Type Value: None

Tag Name: ARTIST
Tag Value: Actors
Target Type Value: None

Tag Name: SUMMARY
Tag Value: Comment
Target Type Value: None

Tag Name: DESCRIPTION
Tag Value: Description
Target Type Value: None

Tag Name: GENRE
Tag Value: Genre
Target Type Value: None

Tag Name: SYNOPSIS
Tag Value: Plot
Target Type Value: None

Tag Name: DATE_RELEASED
Tag Value: 2023
Target Type Value: None

Tag Name: ENCODER
Tag Value: Lavf59.27.100
Target Type Value: None

Tag Name: DURATION
Tag Value: 00:00:00.120000000
Target Type Value: None

This already looks promising! What seems to be missing is the title, in this case "handbrake", which is shown by mkvinfo below "Segment information" and not "Tags". But it seems that not all applications use that title. If I have a look at the example files from matroska-test-files, they use a TITLE tag, which is correctly shown by matroskareader.

Tag Name: TITLE
Tag Value: Big Buck Bunny - test 1
Target Type Value: None

Tag Name: DATE_RELEASED
Tag Value: 2010
Target Type Value: None

Tag Name: COMMENT
Tag Value: Matroska Validation File1, basic MPEG4.2 and MP3 with only SimpleBlock
Target Type Value: None

This is exactly what is contained in the corresponding tag XML file.

tagreader, however, does not show any tags for the mentioned test files. But the mapping of Matroska elements to "standard" tags and TagLib properties should not be a problem once we know what is usually used. Probably, you have more information on that and more realistic example files. But as I said, this looks promising, and I am looking forward to seeing write support.

@complexlogic
Copy link
Contributor Author

Thanks for the feedback @ufleisch. I recommend reading this page from the Matroska specification for proper tags to use for audio.

I'm going to keep developing this feature and will update the PR with my progress.

@complexlogic
Copy link
Contributor Author

Updates:

  • Writing tags is now supported.
  • PropertyMap support has been implemented
  • FileRef has been implemented for .mka, .mkv, and .webm extensions.
  • Matroska/WebM files work with both tagreader and tagwriter

The current code always rewrites the file when the new tag is a different size from the old tag size (which is usually the case). This can be slow with large file sizes. I will eventually optimize the writing to intelligently use void elements as padding to avoid rewrites when possible, similar to what is done for other file types.

For generating sample/test files, I don't recommend using ffmpeg or any program based on it (such as Handbrake). ffmpeg's Matroska metadata implementation is not even close to compliant with the specification. Based on my research, Foobar2000 seems to be the only program with a decent Matroska audio implementation. So I'm using that as the reference implementation where the official Matroska spec is ambiguous.

I plan to work on attachments next, for cover art support.

@ufleisch
Copy link
Contributor

ufleisch commented Oct 1, 2023

Great, thanks! Concerning rewriting the file when the tag changes: Most audio format implementations in TagLib avoid this by using some padding (such as void elements). But I personally think that it is important to be able to bring the file back into a deterministic state without padding when the tag is completely removed, this should be possible with all audio formats supported by TagLib. You could implement it like this:

  • If the tag is empty after save(), make sure that it is completely removed.
  • The default save() method could be optimized to be fast by using padding instead of shrinking the file.
  • There could be an overload save(bool shrink) which can be called with shrink=true to save without padding.

I will have a look at the changes.

@ufleisch
Copy link
Contributor

I had a look at the current state of the Matroska code, and progress looks promissing. I only have a few remarks:

It gives me a good feeling if I can edit the metadata of a file and am then able to revert the changes thereby bringing the file into the exact state it was before. So I used tagwriter to set some values (standard tags and properties) on a Matroska file, then delete them again and look if the file is left as it was before. I was not able to delete properties using the -D option of tagwriter - no surprise if I look into matroskatag.cpp, where there is a return false before removeSimpleTags(). So I just tried to delete standard tags, this worked, but there were some leftovers from -y and -T: DATE_RELEASED/50, PART_NUMBER/30 with values 0, probably these should be removed too.

Regarding the state of the file after the tags have been removed: It was no longer the same, the tags are removed, but the previously existant tags have been changed (Language from "Not set" to "und"). I would prefer if tags which are not edited would be kept as they were, so that the file is brought back into its previous state.

I also had a look at how tags written by other applications are recognized. This looked quite good, only the year was displayed as 0. This is because foobar2000 uses DATE_RECORDED/30, Mp3Tag DATE_RECORDED/50, but tagwriter uses DATE_RELEASED/50. Given that TDRC in ID3v2.4.0 is "Recording time", I would also take DATE_RECORDED. Tags from Handbrake were still not recognized, probably because all tags from Handbrake have a "Target Type Value: None". Maybe TagLib could be a bit more forgiving here.

Concerning the code itself, it is too early to fix minor styling issues, but I have one remark. In matroskatag.h there are a few template functions in the public header file. Probably these "simple tag" template functions could be moved into matroskatag.cpp as members of TagPrivate, then they could directly access tags and would not need a simpleTagsListPrivate() function.

But as I said, it looks promissing and I am looking forward to seeing further progress.

@complexlogic
Copy link
Contributor Author

Thanks for the feedback.

I'm continuing to work on this locally. I have attachments working, but I have discovered some other issues:

  1. The code does not update the Seek Head element. This is necessary when adding a new Tags element, or when the size of an existing Tags element changes (and shifts the offset of other elements in the Seek Head)
  2. The code does not update the cue points. This is necessary when the size of a Tags element changes. Each cue point stores an offset, which needs to be adjusted when the size of any upstream element changes.

If you run mkvalidator on a file written by tagwriter, it will most likely fail due to the issues above. I'm currently working on fixes for this. In the meantime, make sure you only run the code on test files, or files that you have backups for.

@neheb
Copy link
Contributor

neheb commented Dec 25, 2023

Conflicts

@complexlogic
Copy link
Contributor Author

@neheb I'm working on some changes locally to fix the issues I described in #1149 (comment). After I have that done, I will rebase before pushing again.

@ufleisch
Copy link
Contributor

@complexlogic I just want to ask if you are still working on this. If you have local changes, you could push them to the branch, so we could continue working on this feature.

@ufleisch
Copy link
Contributor

I have fixed the conflicts and pushed to a new branch https://github.com/ufleisch/taglib/tree/matroska.

@hohav
Copy link

hohav commented Mar 23, 2025

I would love to see this finished. @complexlogic, did you make any progress on the issues you mentioned in #1149 (comment)?

@ufleisch
Copy link
Contributor

ufleisch commented Aug 3, 2025

Dear @complexlogic

Since we did not hear from you for more than one and a half year now, I assume that you have lost interest on working on Matroska support for TagLib. I can understand that interests and priorities can change over time. But it would be a shame if all your hard work had been in vain, especially since there seem to be quite a few people who would like to see such a feature in TagLib. Therefore, I will try to continue your work. As I do not have the permissions to push into your branch, I will have to close this pull request and create a new one. I have already solved the conflicts and rebased the branch to https://github.com/ufleisch/taglib/tree/matroska. But I have one last request: Since you mentioned that you have already implemented attachments and possibly other features in a local branch, it would be great if you could push this to your public branch. This would save me from having to reprogram these features.

If I do not hear from you within one week, I will close this PR.

@complexlogic
Copy link
Contributor Author

@ufleisch I apologize for my prior lack of response on this issue.

You are right that I stopped work on this a while ago. But it was because I discovered some difficult technical issues, rather than lack of interest.

The first is the need to update the Cues element, as I previous mentioned. This element holds offsets to many other elements, which are typically after Tags. If the Tags element grows in size, then all downstream offsets need to be updated and the element re-rendered to binary. Matroska uses variable size integers, so increasing the value of any offset could increase the size of the integer (and the element), which could then add to the offsets other elements, and so-on. It gets really complicated.

When I started working on this, I did not realize Matroska files were so self-referential with how many offset locations there are.

Most other Matroska implementations I've seen (like Foobar200) will replace a file's existing Tags with a void element, and create a new Tags at the end of the file. I recall you did not like this, but it really simplifies the logic since you don't have to shift any of the existing elements. It keeps the offsets the same.

Another issue is that I discovered that Attachments are only supported for Matroska, but not WebM. So we would have to refactor the code so that Matroska and WebM have separate classes which inherit from a base class, like what is currently being done for the Ogg format. That way we could make Attachments exclusive to the Matroska class.

I will commit my local changes and push them soon. I also enabled you to be able to push to this branch. So you can continue to use this PR, or create a separate branch if you wish.

@complexlogic
Copy link
Contributor Author

Just rebased pushed all of my local changes, which includes Attachments and partial implementation for Cues.

The matroskawriter example can be used to embed cover art, with the following command like syntax: matroskawriter FILE ARTWORK. You will see the artwork if you play the file in VLC, mpv, and other programs.

Also improve type safety and consistency.
Avoid use of raw pointers.
Avoid use of raw pointers, fix property interface.
…perties

Also all attached files can be accessed and modified using complex properties.
A complex property can be set with

-C <complex-property-key> <key1=val1,key2=val2,...>

The second parameter can be set to "" to delete complex properties with the
given key. The set complex property values, a simple shorthand syntax can be
used. Multiple maps are separated by ';', values within a map are assigned
with key=value and separated by a ','. Types are automatically detected,
double quotes can be used to force a string. A ByteVector can be constructed
from the contents of a file with the path is given after "file://". There is
no escape, but hex codes are supported, e.g. "\x2C" to include a ',' and \x3B
to include a ';'.

Examples:

Set a GEOB frame in an ID3v2 tag:
examples/tagwriter -C GENERALOBJECT \
  'data=file://file.bin,description=My description,fileName=file.bin,mimeType=application/octet-stream' \
  file.mp3

Set an APIC frame in an ID3v2 tag (same as -p file.jpg 'My description'):
examples/tagwriter -C PICTURE \
  'data=file://file.jpg,description=My description,pictureType=Front Cover,mimeType=image/jpeg' \
  file.mp3

Set an attached file in a Matroska file:
examples/tagwriter -C file.bin \
  'fileName=file.bin,data=file://file.bin,mimeType=application/octet-stream' \
  file.mka

Set simple tag with target type in a Matroska file:
examples/tagwriter -C PART_NUMBER \
  name=PART_NUMBER,targetTypeValue=20,value=2 file.mka

Set simple tag with binary value in a Matroska file:
examples/tagwriter -C BINARY \
  name=BINARY,data=file://file.bin,targetTypeValue=60 file.mka
Some applications like handbrake store the title only in the segment
info element.
Some encoders write track specific DURATION tags, which should not be
removed.
@ufleisch
Copy link
Contributor

I have now worked quite some time on the Matroska code. The architecture and concepts are retained, I have reduced the use of raw pointers to a minimum. The API uses our explicitly shared containers when possible thereby offering a Qt-style API which is easy to use. Internally, raw pointers and explicit heap allocations are replaced by unique_ptr to make it safer and avoid memory leaks. To eliminate other sources of errors, there is a single point where the mapping between IDs and element classes is defined and enum is used for the IDs in order to get warned when some cases are forgotten. I tried to handle all the cases where offsets and sizes have to be updated. Unit tests cover most of the code, the few things which could not be tested (because I do not have files having certain elements) are listed in the comment at the top of test_matroska.cpp.
As you @complexlogic probably know more about Matroska and have a wider selection of Matroska files, it would be good if you could test the current implementation. I have leveraged the properties and complex properties interfaces so that most Matroska tags can be read and written via these generic APIs, i.e. you can read and generate all tags which can be mapped to the SimpleTag class using the standard tagreader and tagwriter applications from the examples folder. Instructions can be found in the comment at the top of test_matroska.cpp.

@complexlogic
Copy link
Contributor Author

@ufleisch I will give it a test this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载